Skip to content

ADR-005: High Availability and Operations

Status: Accepted

Status: Accepted Date: 2026-03-26 Deciders: Alexandre Philippi

The current platform has several operational gaps that need addressing before bare metal (Phase 2):

  1. Single point of failure — one k3s server node (vm-sandbox-1). If it goes down, the entire cluster is unavailable. Acceptable for Azure staging, not for bare metal production.

  2. No alerting — VictoriaMetrics and VictoriaLogs collect data, but nobody gets notified when things break. We only discover issues when someone checks manually.

  3. No CI/CD for sandbox images — images are built and pushed manually. No automated pipeline from code push to deployed image.

  4. kata-deploy CrashLooping — kata-deploy pods restart repeatedly after completing their work (cosmetic — the DaemonSet re-runs on nodes that already have Kata installed). Not a real failure, but pollutes logs and triggers false alarms.

  5. Manual VM start every morning — Azure VMs auto-shutdown at 20:00 BRT to save costs, but there’s no auto-start. Someone has to manually run az vm start or click the portal each morning.

1. k3s HA with embedded etcd (bare metal only)

Section titled “1. k3s HA with embedded etcd (bare metal only)”

For bare metal (Phase 2), run 3 k3s server nodes with embedded etcd for control plane HA. All 4 mini PCs participate as workers; 3 of them also run the k3s server.

mini-pc-1: k3s server (etcd) + worker
mini-pc-2: k3s server (etcd) + worker
mini-pc-3: k3s server (etcd) + worker
mini-pc-4: k3s worker only
dgx-spark-1: k3s worker (sandbox + GPU)
dgx-spark-2: k3s worker (sandbox + GPU)
rtx-host: k3s worker (GPU only, tainted)

The 3-server topology gives etcd quorum (tolerates 1 server failure). Workers point to a virtual IP or DNS round-robin across the 3 servers. If one server dies, the remaining 2 maintain quorum and the cluster continues operating.

Azure staging stays single control plane — it’s a staging environment and the added complexity isn’t justified.

Deploy vmalert alongside VictoriaMetrics. vmalert evaluates recording and alerting rules against VictoriaMetrics data, then fires notifications via webhook to Slack or Discord.

Key alerting rules:

AlertConditionSeverity
NodeNotReadykube_node_status_condition{condition="Ready",status="true"} == 0 for 2mcritical
PodCrashLoopingincrease(kube_pod_container_status_restarts_total[15m]) > 5warning
PVCAlmostFullkubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8warning
SandboxCreationFailuresrate(opensandbox_sandbox_create_errors_total[5m]) > 0critical
VLLMLatencyHighhistogram_quantile(0.99, rate(vllm_request_duration_seconds_bucket[5m])) > 2warning
NodeDiskPressurekube_node_status_condition{condition="DiskPressure",status="true"} == 1critical
KataDeployRestartingkube_pod_container_status_restarts_total{pod=~"kata-deploy.*"} > 10 AND node already has Kata installedinfo (suppressed)

The KataDeployRestarting alert is explicitly suppressed or set to info — it’s a known cosmetic issue where kata-deploy pods restart after completing installation. No action needed.

GitHub Actions pipeline:

push to sandbox/images/** → GitHub Actions → build multi-arch image → push to GHCR → Ansible deploys
  • Trigger: push to sandbox/images/ directory
  • Build: docker buildx for amd64 + arm64 (required for DGX Spark GB10 ARM nodes)
  • Registry: GitHub Container Registry (GHCR) — free for public repos, no self-hosted registry to maintain
  • Deploy: Ansible playbook pulls new images and restarts affected pods

This solves the multi-arch requirement from CLAUDE.md (“Sandbox images must be multi-arch”) as part of the standard build pipeline.

Two options, in order of preference:

Option A: Azure Automation Account with managed identity

  • Automation runbook triggers at 08:00 BRT on weekdays
  • Managed identity has Virtual Machine Contributor on the resource group
  • Runs Start-AzVM for both VMs

Option B: Cron on a local machine

  • crontab entry on a developer machine: 0 8 * * 1-5 az vm start -g sharpi-compute-staging --name vm-sandbox-1 && az vm start -g sharpi-compute-staging --name vm-sandbox-2
  • Simpler but depends on the local machine being on and connected

Start with Option B (zero setup cost), move to Option A if reliability matters.

  • Survive single node failure — one mini PC dying doesn’t take down the cluster. etcd quorum (2/3) is maintained, API server remains available on the other two servers.
  • Rolling upgrades — k3s servers can be upgraded one at a time without cluster downtime.
  • Worker continuity — all 4 mini PCs remain workers, so sandbox capacity is unaffected by the HA topology.
  • Network partition tolerance — if the network between servers partitions, the minority side loses quorum. All nodes are on the same L2 (10GbE switch), so this is unlikely but not impossible.
  • Multi-site DR — all hardware is in one physical location. A power outage or switch failure takes everything down.
  • Automatic failover for stateful workloads — running sandboxes on a dying node are lost. OpenSandbox must handle sandbox recreation on a healthy node (this is expected behavior — sandboxes are ephemeral).
  • Faster incident response — issues surface in Slack/Discord instead of waiting for someone to check dashboards.
  • Noise reduction — suppressing known cosmetic issues (kata-deploy restarts) means alerts are actionable.
  • Reproducible builds — same image from the same commit, every time.
  • Multi-arch by default — ARM support for DGX Spark is built into the pipeline, not a manual step.

A. External etcd cluster (separate from k3s)

Section titled “A. External etcd cluster (separate from k3s)”

Running etcd as a standalone cluster on dedicated nodes or containers.

Rejected. More operational complexity (separate etcd backup/restore, TLS cert management, version upgrades). k3s embedded etcd is simpler and sufficient for our scale (7 nodes).

B. k3s with external database (PostgreSQL)

Section titled “B. k3s with external database (PostgreSQL)”

k3s supports using PostgreSQL or MySQL as the datastore instead of etcd.

Rejected. Introduces a database dependency that needs its own HA (PostgreSQL replication). Heavier than embedded etcd for a cluster this small. Would make sense at 50+ nodes, not 7.

C. Single control plane with automated backup/restore

Section titled “C. Single control plane with automated backup/restore”

Keep one k3s server, but snapshot etcd regularly and automate restore to a standby node if the primary fails.

Rejected for bare metal. Recovery time is minutes to tens of minutes (detect failure, spin up server, restore snapshot, rejoin workers). With 3 servers, failover is automatic and sub-second. The hardware cost is zero — mini-pc-1/2/3 are already in the cluster as workers; running k3s server on them adds negligible overhead.

Acceptable for Azure staging — where we do run single control plane, since staging downtime doesn’t affect users.

D. Keepalived / HAProxy for k3s server VIP

Section titled “D. Keepalived / HAProxy for k3s server VIP”

A virtual IP that floats between the 3 k3s servers using keepalived, with HAProxy load balancing API requests.

Not rejected, deferred. This is the right approach for production but adds a setup step. For initial bare metal deployment, DNS round-robin (or a simple /etc/hosts entry on workers pointing to all 3 server IPs) is sufficient. Add keepalived + VIP when the cluster is stable.