ADR-005: High Availability and Operations
Status: Accepted
ADR-005: High Availability and Operations
Section titled “ADR-005: High Availability and Operations”Status: Accepted Date: 2026-03-26 Deciders: Alexandre Philippi
Context
Section titled “Context”The current platform has several operational gaps that need addressing before bare metal (Phase 2):
-
Single point of failure — one k3s server node (
vm-sandbox-1). If it goes down, the entire cluster is unavailable. Acceptable for Azure staging, not for bare metal production. -
No alerting — VictoriaMetrics and VictoriaLogs collect data, but nobody gets notified when things break. We only discover issues when someone checks manually.
-
No CI/CD for sandbox images — images are built and pushed manually. No automated pipeline from code push to deployed image.
-
kata-deploy CrashLooping — kata-deploy pods restart repeatedly after completing their work (cosmetic — the DaemonSet re-runs on nodes that already have Kata installed). Not a real failure, but pollutes logs and triggers false alarms.
-
Manual VM start every morning — Azure VMs auto-shutdown at 20:00 BRT to save costs, but there’s no auto-start. Someone has to manually run
az vm startor click the portal each morning.
Decision
Section titled “Decision”1. k3s HA with embedded etcd (bare metal only)
Section titled “1. k3s HA with embedded etcd (bare metal only)”For bare metal (Phase 2), run 3 k3s server nodes with embedded etcd for control plane HA. All 4 mini PCs participate as workers; 3 of them also run the k3s server.
mini-pc-1: k3s server (etcd) + workermini-pc-2: k3s server (etcd) + workermini-pc-3: k3s server (etcd) + workermini-pc-4: k3s worker onlydgx-spark-1: k3s worker (sandbox + GPU)dgx-spark-2: k3s worker (sandbox + GPU)rtx-host: k3s worker (GPU only, tainted)The 3-server topology gives etcd quorum (tolerates 1 server failure). Workers point to a virtual IP or DNS round-robin across the 3 servers. If one server dies, the remaining 2 maintain quorum and the cluster continues operating.
Azure staging stays single control plane — it’s a staging environment and the added complexity isn’t justified.
2. Alerting via vmalert
Section titled “2. Alerting via vmalert”Deploy vmalert alongside VictoriaMetrics. vmalert evaluates recording and alerting rules against VictoriaMetrics data, then fires notifications via webhook to Slack or Discord.
Key alerting rules:
| Alert | Condition | Severity |
|---|---|---|
| NodeNotReady | kube_node_status_condition{condition="Ready",status="true"} == 0 for 2m | critical |
| PodCrashLooping | increase(kube_pod_container_status_restarts_total[15m]) > 5 | warning |
| PVCAlmostFull | kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8 | warning |
| SandboxCreationFailures | rate(opensandbox_sandbox_create_errors_total[5m]) > 0 | critical |
| VLLMLatencyHigh | histogram_quantile(0.99, rate(vllm_request_duration_seconds_bucket[5m])) > 2 | warning |
| NodeDiskPressure | kube_node_status_condition{condition="DiskPressure",status="true"} == 1 | critical |
| KataDeployRestarting | kube_pod_container_status_restarts_total{pod=~"kata-deploy.*"} > 10 AND node already has Kata installed | info (suppressed) |
The KataDeployRestarting alert is explicitly suppressed or set to info — it’s a known cosmetic issue where kata-deploy pods restart after completing installation. No action needed.
3. CI/CD for sandbox images
Section titled “3. CI/CD for sandbox images”GitHub Actions pipeline:
push to sandbox/images/** → GitHub Actions → build multi-arch image → push to GHCR → Ansible deploys- Trigger: push to
sandbox/images/directory - Build:
docker buildxfor amd64 + arm64 (required for DGX Spark GB10 ARM nodes) - Registry: GitHub Container Registry (GHCR) — free for public repos, no self-hosted registry to maintain
- Deploy: Ansible playbook pulls new images and restarts affected pods
This solves the multi-arch requirement from CLAUDE.md (“Sandbox images must be multi-arch”) as part of the standard build pipeline.
4. Azure auto-start
Section titled “4. Azure auto-start”Two options, in order of preference:
Option A: Azure Automation Account with managed identity
- Automation runbook triggers at 08:00 BRT on weekdays
- Managed identity has
Virtual Machine Contributoron the resource group - Runs
Start-AzVMfor both VMs
Option B: Cron on a local machine
crontabentry on a developer machine:0 8 * * 1-5 az vm start -g sharpi-compute-staging --name vm-sandbox-1 && az vm start -g sharpi-compute-staging --name vm-sandbox-2- Simpler but depends on the local machine being on and connected
Start with Option B (zero setup cost), move to Option A if reliability matters.
Consequences
Section titled “Consequences”What HA buys
Section titled “What HA buys”- Survive single node failure — one mini PC dying doesn’t take down the cluster. etcd quorum (2/3) is maintained, API server remains available on the other two servers.
- Rolling upgrades — k3s servers can be upgraded one at a time without cluster downtime.
- Worker continuity — all 4 mini PCs remain workers, so sandbox capacity is unaffected by the HA topology.
What HA does NOT buy
Section titled “What HA does NOT buy”- Network partition tolerance — if the network between servers partitions, the minority side loses quorum. All nodes are on the same L2 (10GbE switch), so this is unlikely but not impossible.
- Multi-site DR — all hardware is in one physical location. A power outage or switch failure takes everything down.
- Automatic failover for stateful workloads — running sandboxes on a dying node are lost. OpenSandbox must handle sandbox recreation on a healthy node (this is expected behavior — sandboxes are ephemeral).
What alerting buys
Section titled “What alerting buys”- Faster incident response — issues surface in Slack/Discord instead of waiting for someone to check dashboards.
- Noise reduction — suppressing known cosmetic issues (kata-deploy restarts) means alerts are actionable.
What CI/CD buys
Section titled “What CI/CD buys”- Reproducible builds — same image from the same commit, every time.
- Multi-arch by default — ARM support for DGX Spark is built into the pipeline, not a manual step.
Alternatives Considered
Section titled “Alternatives Considered”A. External etcd cluster (separate from k3s)
Section titled “A. External etcd cluster (separate from k3s)”Running etcd as a standalone cluster on dedicated nodes or containers.
Rejected. More operational complexity (separate etcd backup/restore, TLS cert management, version upgrades). k3s embedded etcd is simpler and sufficient for our scale (7 nodes).
B. k3s with external database (PostgreSQL)
Section titled “B. k3s with external database (PostgreSQL)”k3s supports using PostgreSQL or MySQL as the datastore instead of etcd.
Rejected. Introduces a database dependency that needs its own HA (PostgreSQL replication). Heavier than embedded etcd for a cluster this small. Would make sense at 50+ nodes, not 7.
C. Single control plane with automated backup/restore
Section titled “C. Single control plane with automated backup/restore”Keep one k3s server, but snapshot etcd regularly and automate restore to a standby node if the primary fails.
Rejected for bare metal. Recovery time is minutes to tens of minutes (detect failure, spin up server, restore snapshot, rejoin workers). With 3 servers, failover is automatic and sub-second. The hardware cost is zero — mini-pc-1/2/3 are already in the cluster as workers; running k3s server on them adds negligible overhead.
Acceptable for Azure staging — where we do run single control plane, since staging downtime doesn’t affect users.
D. Keepalived / HAProxy for k3s server VIP
Section titled “D. Keepalived / HAProxy for k3s server VIP”A virtual IP that floats between the 3 k3s servers using keepalived, with HAProxy load balancing API requests.
Not rejected, deferred. This is the right approach for production but adds a setup step. For initial bare metal deployment, DNS round-robin (or a simple /etc/hosts entry on workers pointing to all 3 server IPs) is sufficient. Add keepalived + VIP when the cluster is stable.