ADR-003: Storage and Backups
Status: Proposed
ADR-003: Storage and Backups
Section titled “ADR-003: Storage and Backups”Status: Proposed Date: 2026-03-26 Deciders: Alexandre Philippi
Context
Section titled “Context”The compute platform has several storage needs that are currently unaddressed:
-
VictoriaMetrics and VictoriaLogs use local-path PVCs — both are single-replica Deployments with 50Gi PVCs backed by k3s’s default local-path provisioner. If the node dies, all metrics and logs are lost. There is no replication and no backup strategy.
-
CUA sandboxes produce durable artifacts — per ADR-001, every sandbox runs an ffmpeg VNC recorder (screen.mp4) and an eBPF audit agent (structured JSONL events). Tier 2 sandboxes additionally produce mitmproxy HAR files. These artifacts must survive sandbox termination and be queryable for compliance, debugging, and incident response.
-
Sandbox ephemeral storage is disposable — each sandbox gets 10GB of local disk for its working directory. This data has no value after the sandbox is killed.
-
Bare metal has no cloud storage — in Phase 2, there is no Azure Blob or managed disk. All storage must run on the NVMe drives in the mini PCs and DGX Sparks.
Decision
Section titled “Decision”1. Longhorn for k8s persistent volumes on bare metal
Section titled “1. Longhorn for k8s persistent volumes on bare metal”Replace k3s’s default local-path provisioner with Longhorn for stateful workloads (VictoriaMetrics, VictoriaLogs, MinIO). Longhorn provides:
- Replication — 2 replicas across nodes, survives single node failure
- Snapshots — scheduled snapshots before backups
- Backup to S3 — native backup to any S3-compatible target (MinIO or offsite B2)
- No external dependencies — runs as k8s workloads on the same cluster
On Azure staging (Phase 1), keep local-path provisioner (data loss is acceptable in staging). Longhorn is deployed only on bare metal (Phase 2).
StorageClass configuration:
apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: longhorn-replicatedprovisioner: driver.longhorn.ioparameters: numberOfReplicas: "2" staleReplicaTimeout: "30" dataLocality: "best-effort"reclaimPolicy: Retain2. MinIO for object storage (audit artifacts, screen recordings)
Section titled “2. MinIO for object storage (audit artifacts, screen recordings)”Self-hosted MinIO provides S3-compatible object storage for artifacts that must outlive sandboxes:
| Bucket | Contents | Retention | Expected volume |
|---|---|---|---|
screen-recordings | ffmpeg VNC captures (mp4) | 90 days | ~50MB per sandbox session |
audit-events | eBPF structured JSONL per sandbox | 1 year | ~5MB per sandbox session |
mitm-captures | mitmproxy HAR files (Tier 2 only) | 90 days | ~20MB per sandbox session |
MinIO runs as a 4-node distributed deployment across sandbox nodes, using Longhorn-backed PVCs. This gives erasure coding (data survives loss of 1 node out of 4) on top of Longhorn replication.
Artifact upload flow:
Sandbox terminates → OpenSandbox lifecycle hook copies artifacts out of Kata VM → Uploaded to MinIO via S3 API (mc cp or SDK) → Indexed by sandbox ID, timestamp, tenantMinIO is deployed in the storage namespace with an internal ClusterIP service. The API layer (Caddy) can optionally expose presigned URLs for artifact download.
3. restic for offsite backups
Section titled “3. restic for offsite backups”restic backs up critical data to Backblaze B2 (or any S3 target) on a schedule:
| What | Frequency | Target | Retention |
|---|---|---|---|
| VictoriaMetrics snapshots | Daily | B2 bucket compute-backups/metrics | 30 days |
| VictoriaLogs data | Daily | B2 bucket compute-backups/logs | 14 days |
| MinIO audit-events bucket | Daily | B2 bucket compute-backups/audit | 1 year |
| MinIO screen-recordings | Weekly | B2 bucket compute-backups/recordings | 90 days |
| k3s etcd snapshots | Daily | B2 bucket compute-backups/etcd | 30 days |
VictoriaMetrics supports native snapshots (/snapshot/create API) — restic backs up the snapshot directory, not the live data path. VictoriaLogs supports the same mechanism.
restic runs as a CronJob in the backup namespace. Backup credentials (B2 app key) are stored as a Kubernetes Secret.
4. Keep local-path for sandbox ephemeral storage
Section titled “4. Keep local-path for sandbox ephemeral storage”Sandbox pods continue using local-path provisioner for their 10GB working directories. This data is disposable — no replication, no backup. Longhorn overhead is unnecessary for ephemeral data.
RuntimeClass + StorageClass pairing:
# Sandbox pods (ephemeral, disposable)storageClassName: local-path # default k3s provisioner
# Stateful services (VictoriaMetrics, VictoriaLogs, MinIO)storageClassName: longhorn-replicatedStorage Categories
Section titled “Storage Categories”| Category | Technology | Retention | Replication | Backup | Size estimate |
|---|---|---|---|---|---|
| Metrics | VictoriaMetrics on Longhorn PVC | 30 days | 2 replicas | Daily to B2 | ~50GB |
| Logs | VictoriaLogs on Longhorn PVC | 14 days | 2 replicas | Daily to B2 | ~50GB |
| Audit artifacts | MinIO (eBPF JSONL, HAR) | 1 year | Erasure coded | Daily to B2 | ~500GB/year at scale |
| Screen recordings | MinIO (mp4) | 90 days | Erasure coded | Weekly to B2 | ~5TB/year at scale |
| Sandbox working dirs | local-path | Session only | None | None | 10GB per sandbox |
| k3s etcd | etcd snapshots | N/A | Built-in | Daily to B2 | <1GB |
Bare Metal Storage
Section titled “Bare Metal Storage”Hardware:
- 4x mini PCs: 1TB NVMe each (assumed Minisforum MS-01 config)
- 2x DGX Spark GB10: 512GB-1TB NVMe each
Allocation per mini PC (1TB):
- 100GB — OS + k3s + container images
- 200GB — Longhorn replicated storage pool
- 600GB — local-path (sandbox ephemeral, ~60 concurrent sandboxes at 10GB each)
- 100GB — reserved
RAID: Not used. Longhorn provides replication at the application level across nodes. Single-disk NVMe per node is acceptable because Longhorn replicas on other nodes survive a drive failure. RAID would reduce usable capacity without adding meaningful redundancy beyond what Longhorn already provides.
Total cluster capacity:
- Longhorn pool: ~1.2TB raw across 6 nodes, ~600GB usable with 2x replication
- MinIO: runs on Longhorn, so shares the replicated pool. ~400GB usable for object storage.
- Ephemeral: ~3.6TB across 6 nodes for sandbox working dirs
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- No data loss on single node failure — Longhorn replication covers VictoriaMetrics, VictoriaLogs, and MinIO
- Offsite backups — restic to B2 protects against cluster-wide failure (fire, power, theft)
- S3-compatible API — MinIO lets any tool (ffmpeg upload scripts, audit dashboards, CLI) use standard S3 SDKs
- Audit durability — eBPF events and screen recordings survive sandbox lifecycle, queryable by sandbox ID
- No cloud dependency — entire stack runs on-prem, B2 is the only external service (and is replaceable)
Negative
Section titled “Negative”- Longhorn resource overhead — runs a storage controller, replica manager, and engine per volume. Adds ~500MB RAM and some CPU per node. Acceptable for 6 nodes, but noticeable.
- MinIO operational complexity — distributed mode requires 4 nodes minimum, needs monitoring for disk health and erasure coding status
- B2 egress costs — restic backups are incremental (deduplicated), but large screen recording backups could cost $5-15/mo in B2 storage + egress
- Capacity ceiling — 600GB usable Longhorn is tight if screen recordings are high-volume. May need to add NVMe drives or external storage in Phase 3.
Alternatives Considered
Section titled “Alternatives Considered”A. OpenEBS instead of Longhorn
Section titled “A. OpenEBS instead of Longhorn”OpenEBS (cStor or Mayastor) provides similar replication. Rejected because Longhorn has better k3s integration (lightweight, Rancher-maintained, simpler install). OpenEBS Mayastor requires huge pages and dedicated NVMe devices, which conflicts with sandbox ephemeral usage on the same drives.
B. Ceph instead of MinIO
Section titled “B. Ceph instead of MinIO”Ceph (via Rook) provides both block storage (replacing Longhorn) and object storage (replacing MinIO) in one system. Rejected because Ceph is operationally heavy for a 6-node cluster — minimum 3 MON + 3 OSD daemons, significant RAM overhead (~4GB per OSD), complex failure recovery. MinIO + Longhorn is simpler to operate at this scale.
C. Cloud blob storage (Azure Blob / S3) for staging
Section titled “C. Cloud blob storage (Azure Blob / S3) for staging”Considered using Azure Blob Storage for audit artifacts during Phase 1 staging. Rejected as the default because it introduces a cloud dependency that doesn’t exist on bare metal. However, this remains an option if staging needs durable artifact storage before bare metal is ready — MinIO can be swapped for Azure Blob via the S3-compatible gateway or by changing the upload endpoint.
D. local-path with manual rsync instead of Longhorn
Section titled “D. local-path with manual rsync instead of Longhorn”Run everything on local-path and use rsync cron jobs to copy data between nodes. Rejected because it provides no automatic failover (manual intervention required if a node dies), no snapshot support, and rsync of live VictoriaMetrics data risks corruption. Longhorn handles all of this natively.
E. NFS server for shared storage
Section titled “E. NFS server for shared storage”Run an NFS server on one node and mount it cluster-wide. Rejected because it creates a single point of failure (the NFS node), and NFS performance over 10GbE is worse than local NVMe for VictoriaMetrics write patterns. Longhorn’s distributed approach is more resilient.