ADR-003: Storage and Backups

Status: Proposed

ADR-003: Storage and Backups

Status: Proposed Date: 2026-03-26 Deciders: Alexandre Philippi

Context

The compute platform has several storage needs that are currently unaddressed:

VictoriaMetrics and VictoriaLogs use local-path PVCs — both are single-replica Deployments with 50Gi PVCs backed by k3s’s default local-path provisioner. If the node dies, all metrics and logs are lost. There is no replication and no backup strategy.
CUA sandboxes produce durable artifacts — per ADR-001, every sandbox runs an ffmpeg VNC recorder (screen.mp4) and an eBPF audit agent (structured JSONL events). Tier 2 sandboxes additionally produce mitmproxy HAR files. These artifacts must survive sandbox termination and be queryable for compliance, debugging, and incident response.
Sandbox ephemeral storage is disposable — each sandbox gets 10GB of local disk for its working directory. This data has no value after the sandbox is killed.
Bare metal has no cloud storage — in Phase 2, there is no Azure Blob or managed disk. All storage must run on the NVMe drives in the mini PCs and DGX Sparks.

Decision

1. Longhorn for k8s persistent volumes on bare metal

Replace k3s’s default local-path provisioner with Longhorn for stateful workloads (VictoriaMetrics, VictoriaLogs, MinIO). Longhorn provides:

Replication — 2 replicas across nodes, survives single node failure
Snapshots — scheduled snapshots before backups
Backup to S3 — native backup to any S3-compatible target (MinIO or offsite B2)
No external dependencies — runs as k8s workloads on the same cluster

On Azure staging (Phase 1), keep local-path provisioner (data loss is acceptable in staging). Longhorn is deployed only on bare metal (Phase 2).

StorageClass configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-replicated
provisioner: driver.longhorn.io
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"
  dataLocality: "best-effort"
reclaimPolicy: Retain

2. MinIO for object storage (audit artifacts, screen recordings)

Self-hosted MinIO provides S3-compatible object storage for artifacts that must outlive sandboxes:

Bucket	Contents	Retention	Expected volume
`screen-recordings`	ffmpeg VNC captures (mp4)	90 days	~50MB per sandbox session
`audit-events`	eBPF structured JSONL per sandbox	1 year	~5MB per sandbox session
`mitm-captures`	mitmproxy HAR files (Tier 2 only)	90 days	~20MB per sandbox session

MinIO runs as a 4-node distributed deployment across sandbox nodes, using Longhorn-backed PVCs. This gives erasure coding (data survives loss of 1 node out of 4) on top of Longhorn replication.

Artifact upload flow:

Sandbox terminates
  → OpenSandbox lifecycle hook copies artifacts out of Kata VM
  → Uploaded to MinIO via S3 API (mc cp or SDK)
  → Indexed by sandbox ID, timestamp, tenant

MinIO is deployed in the storage namespace with an internal ClusterIP service. The API layer (Caddy) can optionally expose presigned URLs for artifact download.

3. restic for offsite backups

restic backs up critical data to Backblaze B2 (or any S3 target) on a schedule:

What	Frequency	Target	Retention
VictoriaMetrics snapshots	Daily	B2 bucket `compute-backups/metrics`	30 days
VictoriaLogs data	Daily	B2 bucket `compute-backups/logs`	14 days
MinIO audit-events bucket	Daily	B2 bucket `compute-backups/audit`	1 year
MinIO screen-recordings	Weekly	B2 bucket `compute-backups/recordings`	90 days
k3s etcd snapshots	Daily	B2 bucket `compute-backups/etcd`	30 days

VictoriaMetrics supports native snapshots (/snapshot/create API) — restic backs up the snapshot directory, not the live data path. VictoriaLogs supports the same mechanism.

restic runs as a CronJob in the backup namespace. Backup credentials (B2 app key) are stored as a Kubernetes Secret.

4. Keep local-path for sandbox ephemeral storage

Sandbox pods continue using local-path provisioner for their 10GB working directories. This data is disposable — no replication, no backup. Longhorn overhead is unnecessary for ephemeral data.

RuntimeClass + StorageClass pairing:

# Sandbox pods (ephemeral, disposable)
storageClassName: local-path    # default k3s provisioner

# Stateful services (VictoriaMetrics, VictoriaLogs, MinIO)
storageClassName: longhorn-replicated

Storage Categories

Category	Technology	Retention	Replication	Backup	Size estimate
Metrics	VictoriaMetrics on Longhorn PVC	30 days	2 replicas	Daily to B2	~50GB
Logs	VictoriaLogs on Longhorn PVC	14 days	2 replicas	Daily to B2	~50GB
Audit artifacts	MinIO (eBPF JSONL, HAR)	1 year	Erasure coded	Daily to B2	~500GB/year at scale
Screen recordings	MinIO (mp4)	90 days	Erasure coded	Weekly to B2	~5TB/year at scale
Sandbox working dirs	local-path	Session only	None	None	10GB per sandbox
k3s etcd	etcd snapshots	N/A	Built-in	Daily to B2	<1GB

Bare Metal Storage

Hardware:

4x mini PCs: 1TB NVMe each (assumed Minisforum MS-01 config)
2x DGX Spark GB10: 512GB-1TB NVMe each

Allocation per mini PC (1TB):

100GB — OS + k3s + container images
200GB — Longhorn replicated storage pool
600GB — local-path (sandbox ephemeral, ~60 concurrent sandboxes at 10GB each)
100GB — reserved

RAID: Not used. Longhorn provides replication at the application level across nodes. Single-disk NVMe per node is acceptable because Longhorn replicas on other nodes survive a drive failure. RAID would reduce usable capacity without adding meaningful redundancy beyond what Longhorn already provides.

Total cluster capacity:

Longhorn pool: ~1.2TB raw across 6 nodes, ~600GB usable with 2x replication
MinIO: runs on Longhorn, so shares the replicated pool. ~400GB usable for object storage.
Ephemeral: ~3.6TB across 6 nodes for sandbox working dirs

Consequences

Positive

No data loss on single node failure — Longhorn replication covers VictoriaMetrics, VictoriaLogs, and MinIO
Offsite backups — restic to B2 protects against cluster-wide failure (fire, power, theft)
S3-compatible API — MinIO lets any tool (ffmpeg upload scripts, audit dashboards, CLI) use standard S3 SDKs
Audit durability — eBPF events and screen recordings survive sandbox lifecycle, queryable by sandbox ID
No cloud dependency — entire stack runs on-prem, B2 is the only external service (and is replaceable)

Negative

Longhorn resource overhead — runs a storage controller, replica manager, and engine per volume. Adds ~500MB RAM and some CPU per node. Acceptable for 6 nodes, but noticeable.
MinIO operational complexity — distributed mode requires 4 nodes minimum, needs monitoring for disk health and erasure coding status
B2 egress costs — restic backups are incremental (deduplicated), but large screen recording backups could cost $5-15/mo in B2 storage + egress
Capacity ceiling — 600GB usable Longhorn is tight if screen recordings are high-volume. May need to add NVMe drives or external storage in Phase 3.

Alternatives Considered

A. OpenEBS instead of Longhorn

OpenEBS (cStor or Mayastor) provides similar replication. Rejected because Longhorn has better k3s integration (lightweight, Rancher-maintained, simpler install). OpenEBS Mayastor requires huge pages and dedicated NVMe devices, which conflicts with sandbox ephemeral usage on the same drives.

B. Ceph instead of MinIO

Ceph (via Rook) provides both block storage (replacing Longhorn) and object storage (replacing MinIO) in one system. Rejected because Ceph is operationally heavy for a 6-node cluster — minimum 3 MON + 3 OSD daemons, significant RAM overhead (~4GB per OSD), complex failure recovery. MinIO + Longhorn is simpler to operate at this scale.

C. Cloud blob storage (Azure Blob / S3) for staging

Considered using Azure Blob Storage for audit artifacts during Phase 1 staging. Rejected as the default because it introduces a cloud dependency that doesn’t exist on bare metal. However, this remains an option if staging needs durable artifact storage before bare metal is ready — MinIO can be swapped for Azure Blob via the S3-compatible gateway or by changing the upload endpoint.

D. local-path with manual rsync instead of Longhorn

Run everything on local-path and use rsync cron jobs to copy data between nodes. Rejected because it provides no automatic failover (manual intervention required if a node dies), no snapshot support, and rsync of live VictoriaMetrics data risks corruption. Longhorn handles all of this natively.

E. NFS server for shared storage

Run an NFS server on one node and mount it cluster-wide. Rejected because it creates a single point of failure (the NFS node), and NFS performance over 10GbE is worse than local NVMe for VictoriaMetrics write patterns. Longhorn’s distributed approach is more resilient.