Skip to content

ADR-003: Storage and Backups

Status: Proposed

Status: Proposed Date: 2026-03-26 Deciders: Alexandre Philippi

The compute platform has several storage needs that are currently unaddressed:

  1. VictoriaMetrics and VictoriaLogs use local-path PVCs — both are single-replica Deployments with 50Gi PVCs backed by k3s’s default local-path provisioner. If the node dies, all metrics and logs are lost. There is no replication and no backup strategy.

  2. CUA sandboxes produce durable artifacts — per ADR-001, every sandbox runs an ffmpeg VNC recorder (screen.mp4) and an eBPF audit agent (structured JSONL events). Tier 2 sandboxes additionally produce mitmproxy HAR files. These artifacts must survive sandbox termination and be queryable for compliance, debugging, and incident response.

  3. Sandbox ephemeral storage is disposable — each sandbox gets 10GB of local disk for its working directory. This data has no value after the sandbox is killed.

  4. Bare metal has no cloud storage — in Phase 2, there is no Azure Blob or managed disk. All storage must run on the NVMe drives in the mini PCs and DGX Sparks.

1. Longhorn for k8s persistent volumes on bare metal

Section titled “1. Longhorn for k8s persistent volumes on bare metal”

Replace k3s’s default local-path provisioner with Longhorn for stateful workloads (VictoriaMetrics, VictoriaLogs, MinIO). Longhorn provides:

  • Replication — 2 replicas across nodes, survives single node failure
  • Snapshots — scheduled snapshots before backups
  • Backup to S3 — native backup to any S3-compatible target (MinIO or offsite B2)
  • No external dependencies — runs as k8s workloads on the same cluster

On Azure staging (Phase 1), keep local-path provisioner (data loss is acceptable in staging). Longhorn is deployed only on bare metal (Phase 2).

StorageClass configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-replicated
provisioner: driver.longhorn.io
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
dataLocality: "best-effort"
reclaimPolicy: Retain

2. MinIO for object storage (audit artifacts, screen recordings)

Section titled “2. MinIO for object storage (audit artifacts, screen recordings)”

Self-hosted MinIO provides S3-compatible object storage for artifacts that must outlive sandboxes:

BucketContentsRetentionExpected volume
screen-recordingsffmpeg VNC captures (mp4)90 days~50MB per sandbox session
audit-eventseBPF structured JSONL per sandbox1 year~5MB per sandbox session
mitm-capturesmitmproxy HAR files (Tier 2 only)90 days~20MB per sandbox session

MinIO runs as a 4-node distributed deployment across sandbox nodes, using Longhorn-backed PVCs. This gives erasure coding (data survives loss of 1 node out of 4) on top of Longhorn replication.

Artifact upload flow:

Sandbox terminates
→ OpenSandbox lifecycle hook copies artifacts out of Kata VM
→ Uploaded to MinIO via S3 API (mc cp or SDK)
→ Indexed by sandbox ID, timestamp, tenant

MinIO is deployed in the storage namespace with an internal ClusterIP service. The API layer (Caddy) can optionally expose presigned URLs for artifact download.

restic backs up critical data to Backblaze B2 (or any S3 target) on a schedule:

WhatFrequencyTargetRetention
VictoriaMetrics snapshotsDailyB2 bucket compute-backups/metrics30 days
VictoriaLogs dataDailyB2 bucket compute-backups/logs14 days
MinIO audit-events bucketDailyB2 bucket compute-backups/audit1 year
MinIO screen-recordingsWeeklyB2 bucket compute-backups/recordings90 days
k3s etcd snapshotsDailyB2 bucket compute-backups/etcd30 days

VictoriaMetrics supports native snapshots (/snapshot/create API) — restic backs up the snapshot directory, not the live data path. VictoriaLogs supports the same mechanism.

restic runs as a CronJob in the backup namespace. Backup credentials (B2 app key) are stored as a Kubernetes Secret.

4. Keep local-path for sandbox ephemeral storage

Section titled “4. Keep local-path for sandbox ephemeral storage”

Sandbox pods continue using local-path provisioner for their 10GB working directories. This data is disposable — no replication, no backup. Longhorn overhead is unnecessary for ephemeral data.

RuntimeClass + StorageClass pairing:

# Sandbox pods (ephemeral, disposable)
storageClassName: local-path # default k3s provisioner
# Stateful services (VictoriaMetrics, VictoriaLogs, MinIO)
storageClassName: longhorn-replicated
CategoryTechnologyRetentionReplicationBackupSize estimate
MetricsVictoriaMetrics on Longhorn PVC30 days2 replicasDaily to B2~50GB
LogsVictoriaLogs on Longhorn PVC14 days2 replicasDaily to B2~50GB
Audit artifactsMinIO (eBPF JSONL, HAR)1 yearErasure codedDaily to B2~500GB/year at scale
Screen recordingsMinIO (mp4)90 daysErasure codedWeekly to B2~5TB/year at scale
Sandbox working dirslocal-pathSession onlyNoneNone10GB per sandbox
k3s etcdetcd snapshotsN/ABuilt-inDaily to B2<1GB

Hardware:

  • 4x mini PCs: 1TB NVMe each (assumed Minisforum MS-01 config)
  • 2x DGX Spark GB10: 512GB-1TB NVMe each

Allocation per mini PC (1TB):

  • 100GB — OS + k3s + container images
  • 200GB — Longhorn replicated storage pool
  • 600GB — local-path (sandbox ephemeral, ~60 concurrent sandboxes at 10GB each)
  • 100GB — reserved

RAID: Not used. Longhorn provides replication at the application level across nodes. Single-disk NVMe per node is acceptable because Longhorn replicas on other nodes survive a drive failure. RAID would reduce usable capacity without adding meaningful redundancy beyond what Longhorn already provides.

Total cluster capacity:

  • Longhorn pool: ~1.2TB raw across 6 nodes, ~600GB usable with 2x replication
  • MinIO: runs on Longhorn, so shares the replicated pool. ~400GB usable for object storage.
  • Ephemeral: ~3.6TB across 6 nodes for sandbox working dirs
  • No data loss on single node failure — Longhorn replication covers VictoriaMetrics, VictoriaLogs, and MinIO
  • Offsite backups — restic to B2 protects against cluster-wide failure (fire, power, theft)
  • S3-compatible API — MinIO lets any tool (ffmpeg upload scripts, audit dashboards, CLI) use standard S3 SDKs
  • Audit durability — eBPF events and screen recordings survive sandbox lifecycle, queryable by sandbox ID
  • No cloud dependency — entire stack runs on-prem, B2 is the only external service (and is replaceable)
  • Longhorn resource overhead — runs a storage controller, replica manager, and engine per volume. Adds ~500MB RAM and some CPU per node. Acceptable for 6 nodes, but noticeable.
  • MinIO operational complexity — distributed mode requires 4 nodes minimum, needs monitoring for disk health and erasure coding status
  • B2 egress costs — restic backups are incremental (deduplicated), but large screen recording backups could cost $5-15/mo in B2 storage + egress
  • Capacity ceiling — 600GB usable Longhorn is tight if screen recordings are high-volume. May need to add NVMe drives or external storage in Phase 3.

OpenEBS (cStor or Mayastor) provides similar replication. Rejected because Longhorn has better k3s integration (lightweight, Rancher-maintained, simpler install). OpenEBS Mayastor requires huge pages and dedicated NVMe devices, which conflicts with sandbox ephemeral usage on the same drives.

Ceph (via Rook) provides both block storage (replacing Longhorn) and object storage (replacing MinIO) in one system. Rejected because Ceph is operationally heavy for a 6-node cluster — minimum 3 MON + 3 OSD daemons, significant RAM overhead (~4GB per OSD), complex failure recovery. MinIO + Longhorn is simpler to operate at this scale.

C. Cloud blob storage (Azure Blob / S3) for staging

Section titled “C. Cloud blob storage (Azure Blob / S3) for staging”

Considered using Azure Blob Storage for audit artifacts during Phase 1 staging. Rejected as the default because it introduces a cloud dependency that doesn’t exist on bare metal. However, this remains an option if staging needs durable artifact storage before bare metal is ready — MinIO can be swapped for Azure Blob via the S3-compatible gateway or by changing the upload endpoint.

D. local-path with manual rsync instead of Longhorn

Section titled “D. local-path with manual rsync instead of Longhorn”

Run everything on local-path and use rsync cron jobs to copy data between nodes. Rejected because it provides no automatic failover (manual intervention required if a node dies), no snapshot support, and rsync of live VictoriaMetrics data risks corruption. Longhorn handles all of this natively.

Run an NFS server on one node and mount it cluster-wide. Rejected because it creates a single point of failure (the NFS node), and NFS performance over 10GbE is worse than local NVMe for VictoriaMetrics write patterns. Longhorn’s distributed approach is more resilient.