ADR-002: TLS and Auth — Caddy + API Keys + NSG Lockdown
Status: Accepted
ADR-002: TLS and Auth — Caddy + API Keys + NSG Lockdown
Section titled “ADR-002: TLS and Auth — Caddy + API Keys + NSG Lockdown”Status: Accepted Date: 2026-03-28 Deciders: Alexandre Philippi
Context
Section titled “Context”The compute platform currently has no TLS or authentication:
- All services (OpenSandbox on
:8080, vLLM on:8000, VictoriaMetrics on:8428) are plain HTTP - No authentication — anyone who can reach the IPs can create sandboxes, run inference, or read metrics
- Azure NSG allows SSH (port 22) from
0.0.0.0/0 - k3s API (port 6443) is accessible from any IP in the VNET
This is acceptable for initial bootstrapping but must be locked down before any external consumer (CLI, agent orchestrator, partner integration) connects to the platform.
Decision
Section titled “Decision”1. Caddy as edge reverse proxy with automatic TLS
Section titled “1. Caddy as edge reverse proxy with automatic TLS”Caddy sits at the edge of the cluster, terminates TLS via Let’s Encrypt (ACME), and proxies to internal services over plain HTTP.
Internet → Caddy (:443, TLS) → OpenSandbox (:8080, plain HTTP) → vLLM (:8000, plain HTTP) → VictoriaMetrics (:8428, plain HTTP)Caddy runs as a k8s Deployment in the ingress namespace with a hostNetwork: true pod (or NodePort 443) on a designated ingress node. A DNS A record points to the node’s public IP.
2. API key authentication at the Caddy layer
Section titled “2. API key authentication at the Caddy layer”External consumers authenticate via an Authorization: Bearer <api-key> header. Caddy validates the key before proxying the request.
Two implementation options (decided during implementation):
Option A — Caddy forward_auth: Caddy sends each request to a lightweight auth sidecar that checks the key against a store. More flexible (can add rate limiting, key scoping per-route).
Option B — Caddy basicauth with key-as-password: Simpler, no sidecar. API key is the password, username is the key ID. Limited — no per-route scoping, no metadata on keys.
Recommendation: Start with Option A. The auth sidecar is ~100 lines (validate key against a k8s Secret or SQLite, return 200/401). This gives us key metadata (created_at, scopes, last_used) from day one without refactoring later.
3. API key storage
Section titled “3. API key storage”Keys are stored in a Kubernetes Secret (api-keys in ingress namespace) as a JSON map:
{ "sk_live_abc123": {"name": "cli-prod", "scopes": ["sandboxes", "inference"], "created": "2026-03-28"}, "sk_live_def456": {"name": "partner-acme", "scopes": ["inference"], "created": "2026-03-28"}}The auth sidecar watches this Secret and reloads on change. Key management is via kubectl edit secret or a simple CLI wrapper — no database, no external dependency.
If the number of keys or complexity grows (50+ keys, per-request rate limits, usage tracking), migrate to SQLite mounted as a PVC. The forward_auth interface stays the same.
4. NSG lockdown
Section titled “4. NSG lockdown”| Rule | Port | Source | Justification |
|---|---|---|---|
| SSH | 22 | Operator IP(s) only | No reason for SSH to be open to the world |
| k3s API | 6443 | Operator IP(s) only | Only operators need kubectl access |
| HTTPS | 443 | 0.0.0.0/0 | Public API surface — Caddy terminates TLS, auth validates keys |
| HTTP | 80 | 0.0.0.0/0 | ACME HTTP-01 challenge (Caddy redirects to 443) |
All other inbound traffic is denied by default (Azure NSG default deny).
Operator IPs are maintained as a variable in the Ansible inventory. When an operator’s IP changes, update inventory and re-run the NSG playbook.
5. Internal cluster traffic stays plain HTTP
Section titled “5. Internal cluster traffic stays plain HTTP”Services within the k3s cluster communicate over plain HTTP on the pod network:
- OpenSandbox → Kata pods (cluster-internal)
- Sandbox agents → vLLM (cluster-internal, HTTP)
- OTel Collector → VictoriaMetrics/VictoriaLogs (cluster-internal)
The cluster network is a single-tenant, private network. Encrypting intra-cluster traffic adds latency and operational complexity with minimal security benefit at this scale.
Architecture
Section titled “Architecture”┌─ External ─────────────────────────────────────────────┐│ ││ Consumer (CLI, agent, partner) ││ Authorization: Bearer sk_live_abc123 ││ │└──────────────────┬──────────────────────────────────────┘ │ HTTPS (443) ▼┌─ Edge (Caddy) ─────────────────────────────────────────┐│ ││ TLS termination (Let's Encrypt, auto-renew) ││ ├── forward_auth → auth sidecar (validate API key) ││ ├── /sandboxes/* → OpenSandbox (:8080) ││ ├── /v1/* → vLLM (:8000) ││ └── /metrics → VictoriaMetrics (:8428) ││ ││ WebSocket: /sandboxes/{id}/exec proxied as-is ││ (Caddy natively handles WS upgrade after auth) ││ │└──────────────────┬──────────────────────────────────────┘ │ plain HTTP (cluster network) ▼┌─ k3s cluster (internal) ───────────────────────────────┐│ ││ OpenSandbox (:8080) — sandbox lifecycle ││ vLLM (:8000) — LLM inference ││ VictoriaMetrics (:8428) — metrics ││ VictoriaLogs (:9428) — logs (not exposed externally) ││ │└─────────────────────────────────────────────────────────┘Caddyfile
Section titled “Caddyfile”{ email ops@sharpi.dev}
compute.sharpi.dev { # Auth: validate API key on every request forward_auth localhost:9090 { uri /auth copy_headers X-Key-Name X-Key-Scopes }
# OpenSandbox API handle /sandboxes/* { reverse_proxy opensandbox.sandbox.svc.cluster.local:8080 }
# vLLM (OpenAI-compatible) handle /v1/* { reverse_proxy vllm.inference.svc.cluster.local:8000 }
# Metrics (read-only) handle /metrics/* { reverse_proxy victoria-metrics.monitoring.svc.cluster.local:8428 }
# Default — 404 respond "Not Found" 404}Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Automatic TLS — Caddy handles Let’s Encrypt cert issuance and renewal with zero manual intervention
- Single auth layer — all external traffic goes through one validation point, no per-service auth logic
- WebSocket support — Caddy natively proxies WebSocket connections (sandbox exec) after auth
- NSG lockdown — SSH and k3s API no longer open to the internet, reduces attack surface significantly
- Simple key management — k8s Secret is sufficient for 10-50 keys, no external dependencies
- Audit-friendly — Caddy access logs include key identity (
X-Key-Name), correlate API calls with consumers
Negative / Not Covered
Section titled “Negative / Not Covered”- No mTLS between services — internal cluster traffic is plain HTTP. Acceptable for single-tenant, private network. If multi-tenant or cross-network traffic is added, revisit.
- No zero-trust networking — we trust the cluster network. A compromised pod could sniff internal traffic. Mitigation: network policies restrict pod-to-pod communication (OpenSandbox can reach Kata pods, sandboxes can reach vLLM, nothing else).
- No OAuth / JWT for end users — API keys are for service-to-service auth (CLI, orchestrators). If human users need interactive login, add an OAuth flow in front of Caddy later.
- No key rotation automation — keys are rotated manually via
kubectl. Acceptable at current scale, add automation if key count exceeds 50. - Single point of ingress — Caddy is a single pod. If it goes down, all external access is lost. For Phase 1 (staging), this is acceptable. For production, run 2 replicas behind a floating IP or load balancer.
Alternatives Considered
Section titled “Alternatives Considered”A. Traefik (k3s default ingress)
Section titled “A. Traefik (k3s default ingress)”k3s ships with Traefik as the default ingress controller. We disable it (--disable=traefik in k3s server args).
Rejected because:
- Traefik’s ACME integration requires more configuration than Caddy’s (storage backends, resolver config)
forward_authin Traefik requires middleware chain configuration — more YAML, same result- Caddy’s Caddyfile is dramatically simpler than Traefik’s dynamic config or IngressRoute CRDs
- We don’t need Traefik’s advanced features (canary routing, traffic mirroring)
B. nginx
Section titled “B. nginx”Rejected because:
- No built-in ACME — requires certbot sidecar or cert-manager CRD
- Config syntax is verbose for reverse proxy + auth
- WebSocket proxying requires explicit
proxy_set_header Upgradeconfiguration - More moving parts for the same outcome
C. Kong
Section titled “C. Kong”Rejected because:
- Heavy — Kong Gateway requires a database (Postgres) or runs in DB-less mode with declarative config
- Designed for API gateway at scale (rate limiting, plugins, developer portal) — overkill for 3 upstream services
- Operational complexity far exceeds what we need in Phase 1
D. Custom auth service (standalone)
Section titled “D. Custom auth service (standalone)”Rejected as initial approach. A standalone auth service (with its own database, admin API, key lifecycle) is premature. The forward_auth sidecar gives us the same validation with 100 lines of code and a k8s Secret. If requirements grow (OAuth, RBAC, usage billing), replace the sidecar with a real auth service — the Caddy forward_auth interface doesn’t change.
E. No reverse proxy — TLS at each service
Section titled “E. No reverse proxy — TLS at each service”Rejected. Each service (OpenSandbox, vLLM, VictoriaMetrics) would need its own cert management and auth logic. Duplicated effort, inconsistent auth, no single access log.