Adopt NATS message bus with A2A protocol for worker-to-worker communication
Status: Accepted
Adopt NATS message bus with A2A protocol for worker-to-worker communication
Section titled “Adopt NATS message bus with A2A protocol for worker-to-worker communication”Context and Problem Statement
Section titled “Context and Problem Statement”arpi spawns workers in sandboxes but has no mechanism for workers to communicate with each other or for humans to message workers asynchronously. The current model is terminal-only, single-worker, synchronous — you watch one worker’s stdout in one terminal tab. This breaks down when:
- Multiple workers work on the same project and need to coordinate (“what’s the API schema?”, “I’m done, you can integrate now”).
- A worker is blocked and needs human input, but the human is away from the terminal.
- Humans want to redirect, monitor, or query workers without being tethered to a terminal session.
The industry has no standard agent message bus as of March 2026. Google’s A2A protocol (v0.2.x) defines agent-to-agent task delegation over HTTP. Anthropic’s MCP defines agent-to-tool integration. Neither defines the messaging infrastructure layer. This is the gap arpi fills.
Sharpi already operates auau — a self-hosted WhatsApp API platform with MCP integration. This provides an existing human-to-agent bridge over WhatsApp, eliminating the need for Slack.
Decision Drivers
Section titled “Decision Drivers”- Go-first stack — the message bus must have a production Go client and ideally be embeddable in the arpi process.
- Two-wall security — messages between workers in sandboxes must route through approved channels; the bus must not create a bypass around Wall 1 (gateway) or Wall 2 (sandbox).
- OpenSandbox as sandbox standard — the arpi-agent sidecar runs alongside execd inside each sandbox; the messaging client lives in the sidecar, not in execd.
- auau already exists — WhatsApp transport is solved. The bus needs a bridge to auau, not a replacement for it.
- Provider-agnostic interfaces — existing pattern from the toolchain ADR. The bus should be behind a
MessageBusProviderinterface. - Async-first — workers run for hours. Humans are not always at a terminal. The system must handle store-and-forward, not just real-time streaming.
- Control plane owns the bus — the message bus is a platform service (Connectivity domain), not a CLI feature. The CLI is a thin client that connects to it.
Considered Options
Section titled “Considered Options”- NATS JetStream + A2A protocol
- Matrix homeserver
- Custom HTTP/WebSocket point-to-point
- SSH + terminal only (status quo)
Decision Outcome
Section titled “Decision Outcome”Chosen option: “NATS JetStream + A2A protocol”, because it matches arpi’s Go stack (NATS is Go-native, embeddable), provides the right weight class (lighter than Kafka/Matrix, stronger than Redis Streams), and pairs a proven message transport (NATS) with the emerging agent interop standard (A2A) for structured task delegation.
A2A defines the protocol workers speak (Agent Cards, task lifecycle, discovery). NATS provides the transport and persistence (pub/sub, request-reply, JetStream for replay). auau provides the WhatsApp human bridge. MCP remains the tool integration layer — these are complementary, not competing.
A2A maturity note: A2A is v0.2.x as of March 2026. The adapter layer isolates arpi from spec changes — A2A types are defined locally and translated at the boundary. If the spec breaks compatibility at v1.0, only the adapter needs updating. If an official Go SDK ships, it replaces the local types.
Consequences
Section titled “Consequences”- Good, because workers can discover and delegate work to each other via A2A Agent Cards without arpi-specific coupling.
- Good, because NATS JetStream provides message replay — a worker that restarts catches up on missed messages automatically.
- Good, because NATS can be embedded in the control plane process (no separate server for local dev) or run standalone for production.
- Good, because the subject-based addressing (
worker.<uid>.inbox,project.<id>.feed) maps naturally to an IRC-like channel UX for humans. - Good, because auau bridge means humans can message workers from WhatsApp today — no need to build or adopt Slack.
- Bad, because this adds a new runtime dependency (NATS) to the arpi stack.
- Bad, because A2A is still maturing (v0.2.x) — the spec may change, requiring adapter updates. Mitigated by the adapter isolation pattern.
- Neutral, because cross-organization federation (workers across different companies) is deferred. NATS leaf nodes and A2A’s HTTP transport can support this later.
Implementation Plan
Section titled “Implementation Plan”Architecture
Section titled “Architecture”Human interfaces Control Plane Worker sandboxes----------------- --------------- ----------------WhatsApp (auau) --+ +-- Sandbox ATerminal (CLI) --+--> Connectivity domain <----> ------+-- Sandbox BWeb UI (future) --+ (NATS embedded or standalone) +-- Sandbox C subjects: | worker.<uid>.inbox +-- execd (files/cmds) worker.<uid>.outbox +-- arpi-agent (sidecar) project.<id>.feed +-- A2A adapter system.alertsThe message bus is a control plane service in the Connectivity domain (per template-schema ontology). The CLI connects to it as a thin client. Workers in sandboxes connect via gateway-approved egress.
Components
Section titled “Components”1. MessageBusProvider interface (Connectivity domain: api/connectivity/bus/)
type MessageBusProvider interface { Publish(ctx context.Context, subject string, msg Message) error Subscribe(ctx context.Context, subject string, handler MessageHandler) (Subscription, error) Request(ctx context.Context, subject string, msg Message, timeout time.Duration) (Message, error) Close() error}
// Message includes sender identity for authentication verification.type Message struct { ID string // unique message ID SenderUID string // worker uid (verified by bus, not self-reported) Subject string // NATS subject Payload []byte // A2A task or plain message Metadata map[string]string // headers, correlation IDs Timestamp time.Time}First implementation: NATS client wrapping nats.go + JetStream.
Error handling and resilience:
- JetStream consumers with explicit ack and max delivery attempts (default 3). Messages that exceed retries go to a dead-letter subject (
system.deadletter.<original-subject>). - Request-reply uses context deadlines. Timeout returns a typed error, not a generic failure.
- Connection loss triggers automatic reconnect (NATS client built-in). Missed messages during disconnect are replayed from JetStream on reconnect.
- Circuit breaker on the auau bridge: if auau is unreachable, buffer messages locally and retry with exponential backoff. Alert on
system.alertsafter N failures.
2. arpi-agent sidecar (Compute domain: compute/sidecar/)
A Go binary injected into every sandbox alongside execd. Responsibilities:
- Read the injected template TOML.
- Connect to NATS via gateway-approved egress (see “Security model” below).
- Expose an A2A server (Agent Card + task handler) on a local port inside the sandbox.
- Bridge CLI agent stdio to A2A tasks (translate A2A task requests into CLI commands via execd, stream output back as task updates).
- Report status/health to
worker.<uid>.statussubject.
3. A2A adapter (inside arpi-agent)
Each sandbox exposes an A2A Agent Card describing the worker’s capabilities (derived from the template’s [workstation] section — mcps and skills map to A2A capability descriptors). The adapter:
- Serves the Agent Card at
/.well-known/agent.jsoninside the sandbox. - Translates incoming A2A tasks to NATS messages (and vice versa).
- Implements the A2A task lifecycle:
submitted -> working -> completed | failed | input-required. input-requiredtriggers a message tosystem.alertswhich bridges to auau (WhatsApp notification).
4. auau bridge (Connectivity domain: api/connectivity/bridges/auau/)
A bridge service in the control plane that:
- Subscribes to NATS subjects (
system.alerts,worker.*.outbox). - Forwards relevant messages to auau’s REST API (which delivers via WhatsApp).
- Receives auau webhooks (human WhatsApp replies) and publishes to the target worker’s NATS inbox.
- Uses auau’s existing MCP integration for structured tool calls when needed.
5. arpi chat CLI command (cli/cmd/chat.go)
An IRC-style terminal interface. The CLI connects to the control plane’s NATS endpoint as a thin client:
arpi chat #project-foo— subscribe to a project feed, see all worker activity.arpi chat @codex-1— direct message to a specific worker.- Messages from workers rendered with nick-style prefixes (
[codex-1],[claude-2]). - Human messages published to the target’s NATS inbox.
Security model
Section titled “Security model”Worker-to-worker authentication uses NATS credential-based auth, tied to arpi’s identity domain:
- Connection auth: Each worker gets a NATS credential (NKey or JWT) derived from its uid at spawn time. The control plane provisions this alongside other credentials. Workers cannot connect to NATS without a valid credential.
- Subject ACLs: NATS authorization restricts each worker to:
- Publish to:
worker.<own-uid>.outbox,worker.<own-uid>.status,project.<assigned-project>.feed - Subscribe to:
worker.<own-uid>.inbox,project.<assigned-project>.feed,system.discovery - Request-reply:
worker.<target-uid>.inbox(for A2A task delegation — target must be in the same org)
- Publish to:
- Identity verification:
Message.SenderUIDis set by the bus from the authenticated connection, not by the sender. Workers cannot spoof identity. - Wall 2 egress: Sandbox network rules allow egress only to the gateway (Wall 1). NATS traffic routes through the gateway. The sandbox cannot reach NATS directly — the gateway proxies and logs all bus traffic.
- TLS: All NATS connections use TLS. No cleartext bus traffic.
NATS subject schema
Section titled “NATS subject schema”worker.<uid>.inbox # messages TO this worker (direct)worker.<uid>.outbox # messages FROM this worker (broadcast)worker.<uid>.status # health/state updatesproject.<id>.feed # all activity for a projectsystem.alerts # blocked workers, errors, human-neededsystem.discovery # A2A Agent Card announcementssystem.deadletter.<subj> # failed messages after max retriesLocal dev vs production
Section titled “Local dev vs production”| Aspect | Local dev | Production |
|---|---|---|
| NATS server | Embedded in control plane process (in-memory, no disk) | Standalone NATS cluster (JetStream with file storage) |
| Connection | nats://localhost:4222 (auto-started by arpi spawn) | Configured via ARPI_NATS_URL or control plane config |
| Auth | No auth (single-user, localhost only) | NKey/JWT per worker, TLS required |
| Persistence | In-memory JetStream (messages lost on restart) | File-backed JetStream (durable) |
| auau bridge | Optional (only if auau is configured) | Always running |
Affected paths
Section titled “Affected paths”api/connectivity/bus/—MessageBusProviderinterface + NATS implementationapi/connectivity/bridges/auau/— auau webhook bridge servicecompute/sidecar/— arpi-agent Go binary (the sandbox sidecar)cli/cmd/chat.go— newarpi chatcommand (thin client)cli/cmd/spawn.go— spawn registers worker with bus via control plane APIcli/cmd/status.go— status reads fromworker.*.statusvia control plane API
Dependencies
Section titled “Dependencies”- Add
github.com/nats-io/nats.go(NATS client) - Add
github.com/nats-io/nats-server/v2(embedded server for local dev) - A2A types: define locally from spec (no official Go SDK yet), or generate from JSON-RPC schema
Patterns to follow
Section titled “Patterns to follow”- Provider interface pattern from
toolchain.md—MessageBusProvideralongsideIAMProvider,GatewayProvider,SandboxProvider. - Sidecar pattern: arpi-agent alongside execd, not replacing it. Call execd’s API for low-level ops.
- Subject naming: dot-separated hierarchy matching NATS conventions.
- Domain ownership: bus logic in Connectivity, sidecar in Compute, CLI is just a client.
Patterns to avoid
Section titled “Patterns to avoid”- Do not put bus logic in
cli/internal/— the bus is a control plane service, not a CLI feature. - Do not embed messaging logic in execd — it’s OpenSandbox’s daemon, not ours.
- Do not bypass Wall 2 egress filtering — NATS traffic from sandbox routes through gateway-approved egress.
- Do not build a Slack bot — auau (WhatsApp) is the human bridge.
- Do not replace MCP with A2A — MCP is worker-to-tool, A2A is worker-to-worker. They coexist.
- Do not let workers self-report identity in messages — the bus sets SenderUID from authenticated connection.
Migration steps
Section titled “Migration steps”This is additive — no existing functionality is replaced. Build order:
MessageBusProviderinterface + NATS implementation (can test standalone).- arpi-agent sidecar with NATS client (deploy in sandbox alongside execd).
- A2A adapter in arpi-agent (workers discover and delegate to each other).
- auau bridge (NATS <-> WhatsApp via auau webhooks).
arpi chatCLI command (IRC-style terminal view).- Control plane spawn integration (auto-register worker with bus on
POST /v1/workers).
Verification
Section titled “Verification”-
MessageBusProviderinterface defined with Publish, Subscribe, Request, Close methods - NATS implementation passes unit tests for pub/sub, request-reply, and JetStream replay
- Embedded NATS server starts within control plane for local dev (no external NATS required)
- arpi-agent sidecar runs inside sandbox alongside execd without conflicts
- arpi-agent connects to NATS via gateway-approved egress (not bypassing Wall 2)
- Worker cannot connect to NATS without valid NKey/JWT credential
- Worker cannot publish to another worker’s outbox (subject ACL enforced)
- Message.SenderUID is set by bus from authenticated connection, not self-reported
- A2A Agent Card served from sandbox at
/.well-known/agent.json, discoverable viasystem.discovery - A2A task delegation works: worker A sends task to worker B, receives result
-
input-requiredstate triggers message tosystem.alerts - auau bridge forwards
system.alertsto WhatsApp via auau REST API - Human WhatsApp reply (via auau webhook) arrives at target worker’s NATS inbox
-
arpi chat #project-fooshows real-time worker activity in terminal -
arpi chat @codex-1sends direct message to worker, receives reply -
arpi statusreflects worker presence from NATS status subjects - Worker restart replays missed JetStream messages (no message loss)
- NATS traffic from sandbox is visible in gateway audit log
- All NATS connections use TLS (no cleartext bus traffic)
- Dead-letter subject receives messages after max delivery attempts exceeded
Pros and Cons of the Options
Section titled “Pros and Cons of the Options”NATS JetStream + A2A protocol
Section titled “NATS JetStream + A2A protocol”NATS provides the message transport (pub/sub, request-reply, persistence via JetStream). A2A provides the worker interop protocol (Agent Cards, task lifecycle, discovery). Combined, they give both infrastructure and semantics.
- Good, because NATS is a single Go binary, embeddable in arpi control plane (no separate server for dev).
- Good, because JetStream provides at-least-once delivery and message replay — workers that restart don’t lose context.
- Good, because NATS subjects (
worker.<uid>.inbox) map naturally to IRC-style channels for human UX. - Good, because A2A is backed by Google + 50 enterprise partners (spec) — strongest interop trajectory.
- Good, because A2A’s
input-requiredtask state maps perfectly to “worker needs human help” notifications. - Good, because NATS leaf nodes enable future cross-org federation without architecture changes.
- Bad, because A2A spec is still v0.2.x — breaking changes possible. Mitigated by adapter isolation.
- Bad, because no official A2A Go SDK — must define types from spec.
- Neutral, because adds NATS as a runtime dependency (but embeddable, so no ops burden for local dev).
Matrix homeserver
Section titled “Matrix homeserver”Federated messaging protocol with rooms, E2EE, presence, and custom event types. Mature spec (Matrix 1.x).
- Good, because built-in federation — workers across organizations can communicate.
- Good, because E2EE for sensitive worker communication.
- Good, because DAG-based event model provides strong auditability.
- Good, because bridges to IRC, Slack, Discord exist.
- Bad, because heavyweight — Synapse/Dendrite is a full server, not embeddable in a CLI.
- Bad, because overkill for intra-system worker messaging (arpi workers talking to each other).
- Bad, because no Go homeserver SDK — would need to run Dendrite (Go) as a separate process.
- Bad, because the DAG model adds latency and complexity unnecessary for simple task delegation.
Custom HTTP/WebSocket point-to-point
Section titled “Custom HTTP/WebSocket point-to-point”Build a bespoke messaging layer using HTTP REST + WebSocket, no standard protocol.
- Good, because full control over the protocol and implementation.
- Good, because no external dependencies.
- Bad, because reinventing pub/sub, persistence, replay, presence, discovery.
- Bad, because no interoperability with external workers (vendor lock-in to arpi’s custom protocol).
- Bad, because significant engineering effort for table-stakes messaging features.
SSH + terminal only (status quo)
Section titled “SSH + terminal only (status quo)”Keep the current model: one terminal per worker, stdout streaming, no worker-to-worker communication.
- Good, because zero additional complexity.
- Good, because SSH is universal and works today.
- Bad, because no worker-to-worker communication — workers are isolated.
- Bad, because human must be at terminal to interact — no async notifications.
- Bad, because doesn’t scale beyond 2-3 workers (too many terminal tabs).
- Bad, because no message persistence — if you miss stdout, it’s gone.
More Information
Section titled “More Information”Relationship to existing ADRs
Section titled “Relationship to existing ADRs”- two-wall-security.md — NATS traffic from sandboxes must route through Wall 2 egress. The bus does not create a new network path that bypasses sandbox isolation. See “Security model” above for specifics.
- sandbox-strategy.md — arpi-agent is a sidecar alongside OpenSandbox’s execd. It uses execd’s API for file/command operations and adds messaging on top.
- toolchain.md —
MessageBusProviderfollows the same provider interface pattern asIAMProvider,GatewayProvider,SandboxProvider. - template-schema.md —
arpi spawn(via control planePOST /v1/workers) gains responsibility for registering the worker with the bus and publishing its Agent Card. arpi-agent lives in the Compute domain. The bus service lives in the Connectivity domain.
Relationship to MCP and A2A
Section titled “Relationship to MCP and A2A”MCP and A2A are complementary:
| MCP | A2A | |
|---|---|---|
| What | Worker-to-tool integration | Worker-to-worker delegation |
| Direction | Worker calls tool | Worker delegates to worker |
| Transport | stdio or HTTP+SSE | HTTP + JSON-RPC (over NATS in arpi) |
| Example | Worker calls Sentry MCP to fetch errors | Worker asks review-bot to check its PR |
Both coexist in arpi. MCP servers are configured via the registry. A2A tasks flow over the NATS bus.
Conditions to revisit
Section titled “Conditions to revisit”- If A2A spec reaches v1.0 with breaking changes from v0.2.x, update the adapter (isolated by design).
- If an official A2A Go SDK is released, replace local type definitions.
- If cross-org federation becomes a requirement, evaluate NATS leaf nodes vs. A2A’s native HTTP transport for cross-boundary communication.
- If OpenSandbox ships their Go SDK with built-in messaging hooks, evaluate whether arpi-agent can delegate bus connectivity to execd.