ADR-005: High Availability and Operations

Status: Accepted

ADR-005: High Availability and Operations

Status: Accepted Date: 2026-03-26 Deciders: Alexandre Philippi

Context

The current platform has several operational gaps that need addressing before bare metal (Phase 2):

Single point of failure — one k3s server node (vm-sandbox-1). If it goes down, the entire cluster is unavailable. Acceptable for Azure staging, not for bare metal production.
No alerting — VictoriaMetrics and VictoriaLogs collect data, but nobody gets notified when things break. We only discover issues when someone checks manually.
No CI/CD for sandbox images — images are built and pushed manually. No automated pipeline from code push to deployed image.
kata-deploy CrashLooping — kata-deploy pods restart repeatedly after completing their work (cosmetic — the DaemonSet re-runs on nodes that already have Kata installed). Not a real failure, but pollutes logs and triggers false alarms.
Manual VM start every morning — Azure VMs auto-shutdown at 20:00 BRT to save costs, but there’s no auto-start. Someone has to manually run az vm start or click the portal each morning.

Decision

1. k3s HA with embedded etcd (bare metal only)

For bare metal (Phase 2), run 3 k3s server nodes with embedded etcd for control plane HA. All 4 mini PCs participate as workers; 3 of them also run the k3s server.

mini-pc-1: k3s server (etcd) + worker
mini-pc-2: k3s server (etcd) + worker
mini-pc-3: k3s server (etcd) + worker
mini-pc-4: k3s worker only
dgx-spark-1: k3s worker (sandbox + GPU)
dgx-spark-2: k3s worker (sandbox + GPU)
rtx-host: k3s worker (GPU only, tainted)

The 3-server topology gives etcd quorum (tolerates 1 server failure). Workers point to a virtual IP or DNS round-robin across the 3 servers. If one server dies, the remaining 2 maintain quorum and the cluster continues operating.

Azure staging stays single control plane — it’s a staging environment and the added complexity isn’t justified.

2. Alerting via vmalert

Deploy vmalert alongside VictoriaMetrics. vmalert evaluates recording and alerting rules against VictoriaMetrics data, then fires notifications via webhook to Slack or Discord.

Key alerting rules:

Alert	Condition	Severity
NodeNotReady	`kube_node_status_condition{condition="Ready",status="true"} == 0` for 2m	critical
PodCrashLooping	`increase(kube_pod_container_status_restarts_total[15m]) > 5`	warning
PVCAlmostFull	`kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8`	warning
SandboxCreationFailures	`rate(opensandbox_sandbox_create_errors_total[5m]) > 0`	critical
VLLMLatencyHigh	`histogram_quantile(0.99, rate(vllm_request_duration_seconds_bucket[5m])) > 2`	warning
NodeDiskPressure	`kube_node_status_condition{condition="DiskPressure",status="true"} == 1`	critical
KataDeployRestarting	`kube_pod_container_status_restarts_total{pod=~"kata-deploy.*"} > 10` AND node already has Kata installed	info (suppressed)

The KataDeployRestarting alert is explicitly suppressed or set to info — it’s a known cosmetic issue where kata-deploy pods restart after completing installation. No action needed.

3. CI/CD for sandbox images

GitHub Actions pipeline:

push to sandbox/images/** → GitHub Actions → build multi-arch image → push to GHCR → Ansible deploys

Trigger: push to sandbox/images/ directory
Build: docker buildx for amd64 + arm64 (required for DGX Spark GB10 ARM nodes)
Registry: GitHub Container Registry (GHCR) — free for public repos, no self-hosted registry to maintain
Deploy: Ansible playbook pulls new images and restarts affected pods

This solves the multi-arch requirement from CLAUDE.md (“Sandbox images must be multi-arch”) as part of the standard build pipeline.

4. Azure auto-start

Two options, in order of preference:

Option A: Azure Automation Account with managed identity

Automation runbook triggers at 08:00 BRT on weekdays
Managed identity has Virtual Machine Contributor on the resource group
Runs Start-AzVM for both VMs

Option B: Cron on a local machine

crontab entry on a developer machine: 0 8 * * 1-5 az vm start -g sharpi-compute-staging --name vm-sandbox-1 && az vm start -g sharpi-compute-staging --name vm-sandbox-2
Simpler but depends on the local machine being on and connected

Start with Option B (zero setup cost), move to Option A if reliability matters.

Consequences

What HA buys

Survive single node failure — one mini PC dying doesn’t take down the cluster. etcd quorum (2/3) is maintained, API server remains available on the other two servers.
Rolling upgrades — k3s servers can be upgraded one at a time without cluster downtime.
Worker continuity — all 4 mini PCs remain workers, so sandbox capacity is unaffected by the HA topology.

What HA does NOT buy

Network partition tolerance — if the network between servers partitions, the minority side loses quorum. All nodes are on the same L2 (10GbE switch), so this is unlikely but not impossible.
Multi-site DR — all hardware is in one physical location. A power outage or switch failure takes everything down.
Automatic failover for stateful workloads — running sandboxes on a dying node are lost. OpenSandbox must handle sandbox recreation on a healthy node (this is expected behavior — sandboxes are ephemeral).

What alerting buys

Faster incident response — issues surface in Slack/Discord instead of waiting for someone to check dashboards.
Noise reduction — suppressing known cosmetic issues (kata-deploy restarts) means alerts are actionable.

What CI/CD buys

Reproducible builds — same image from the same commit, every time.
Multi-arch by default — ARM support for DGX Spark is built into the pipeline, not a manual step.

Alternatives Considered

A. External etcd cluster (separate from k3s)

Running etcd as a standalone cluster on dedicated nodes or containers.

Rejected. More operational complexity (separate etcd backup/restore, TLS cert management, version upgrades). k3s embedded etcd is simpler and sufficient for our scale (7 nodes).

B. k3s with external database (PostgreSQL)

k3s supports using PostgreSQL or MySQL as the datastore instead of etcd.

Rejected. Introduces a database dependency that needs its own HA (PostgreSQL replication). Heavier than embedded etcd for a cluster this small. Would make sense at 50+ nodes, not 7.

C. Single control plane with automated backup/restore

Keep one k3s server, but snapshot etcd regularly and automate restore to a standby node if the primary fails.

Rejected for bare metal. Recovery time is minutes to tens of minutes (detect failure, spin up server, restore snapshot, rejoin workers). With 3 servers, failover is automatic and sub-second. The hardware cost is zero — mini-pc-1/2/3 are already in the cluster as workers; running k3s server on them adds negligible overhead.

Acceptable for Azure staging — where we do run single control plane, since staging downtime doesn’t affect users.

D. Keepalived / HAProxy for k3s server VIP

A virtual IP that floats between the 3 k3s servers using keepalived, with HAProxy load balancing API requests.

Not rejected, deferred. This is the right approach for production but adds a setup step. For initial bare metal deployment, DNS round-robin (or a simple /etc/hosts entry on workers pointing to all 3 server IPs) is sufficient. Add keepalived + VIP when the cluster is stable.