Skip to content

Adopt NATS message bus with A2A protocol for worker-to-worker communication

Status: Accepted

Adopt NATS message bus with A2A protocol for worker-to-worker communication

Section titled “Adopt NATS message bus with A2A protocol for worker-to-worker communication”

arpi spawns workers in sandboxes but has no mechanism for workers to communicate with each other or for humans to message workers asynchronously. The current model is terminal-only, single-worker, synchronous — you watch one worker’s stdout in one terminal tab. This breaks down when:

  • Multiple workers work on the same project and need to coordinate (“what’s the API schema?”, “I’m done, you can integrate now”).
  • A worker is blocked and needs human input, but the human is away from the terminal.
  • Humans want to redirect, monitor, or query workers without being tethered to a terminal session.

The industry has no standard agent message bus as of March 2026. Google’s A2A protocol (v0.2.x) defines agent-to-agent task delegation over HTTP. Anthropic’s MCP defines agent-to-tool integration. Neither defines the messaging infrastructure layer. This is the gap arpi fills.

Sharpi already operates auau — a self-hosted WhatsApp API platform with MCP integration. This provides an existing human-to-agent bridge over WhatsApp, eliminating the need for Slack.

  • Go-first stack — the message bus must have a production Go client and ideally be embeddable in the arpi process.
  • Two-wall security — messages between workers in sandboxes must route through approved channels; the bus must not create a bypass around Wall 1 (gateway) or Wall 2 (sandbox).
  • OpenSandbox as sandbox standard — the arpi-agent sidecar runs alongside execd inside each sandbox; the messaging client lives in the sidecar, not in execd.
  • auau already exists — WhatsApp transport is solved. The bus needs a bridge to auau, not a replacement for it.
  • Provider-agnostic interfaces — existing pattern from the toolchain ADR. The bus should be behind a MessageBusProvider interface.
  • Async-first — workers run for hours. Humans are not always at a terminal. The system must handle store-and-forward, not just real-time streaming.
  • Control plane owns the bus — the message bus is a platform service (Connectivity domain), not a CLI feature. The CLI is a thin client that connects to it.
  • NATS JetStream + A2A protocol
  • Matrix homeserver
  • Custom HTTP/WebSocket point-to-point
  • SSH + terminal only (status quo)

Chosen option: “NATS JetStream + A2A protocol”, because it matches arpi’s Go stack (NATS is Go-native, embeddable), provides the right weight class (lighter than Kafka/Matrix, stronger than Redis Streams), and pairs a proven message transport (NATS) with the emerging agent interop standard (A2A) for structured task delegation.

A2A defines the protocol workers speak (Agent Cards, task lifecycle, discovery). NATS provides the transport and persistence (pub/sub, request-reply, JetStream for replay). auau provides the WhatsApp human bridge. MCP remains the tool integration layer — these are complementary, not competing.

A2A maturity note: A2A is v0.2.x as of March 2026. The adapter layer isolates arpi from spec changes — A2A types are defined locally and translated at the boundary. If the spec breaks compatibility at v1.0, only the adapter needs updating. If an official Go SDK ships, it replaces the local types.

  • Good, because workers can discover and delegate work to each other via A2A Agent Cards without arpi-specific coupling.
  • Good, because NATS JetStream provides message replay — a worker that restarts catches up on missed messages automatically.
  • Good, because NATS can be embedded in the control plane process (no separate server for local dev) or run standalone for production.
  • Good, because the subject-based addressing (worker.<uid>.inbox, project.<id>.feed) maps naturally to an IRC-like channel UX for humans.
  • Good, because auau bridge means humans can message workers from WhatsApp today — no need to build or adopt Slack.
  • Bad, because this adds a new runtime dependency (NATS) to the arpi stack.
  • Bad, because A2A is still maturing (v0.2.x) — the spec may change, requiring adapter updates. Mitigated by the adapter isolation pattern.
  • Neutral, because cross-organization federation (workers across different companies) is deferred. NATS leaf nodes and A2A’s HTTP transport can support this later.
Human interfaces Control Plane Worker sandboxes
----------------- --------------- ----------------
WhatsApp (auau) --+ +-- Sandbox A
Terminal (CLI) --+--> Connectivity domain <----> ------+-- Sandbox B
Web UI (future) --+ (NATS embedded or standalone) +-- Sandbox C
subjects: |
worker.<uid>.inbox +-- execd (files/cmds)
worker.<uid>.outbox +-- arpi-agent (sidecar)
project.<id>.feed +-- A2A adapter
system.alerts

The message bus is a control plane service in the Connectivity domain (per template-schema ontology). The CLI connects to it as a thin client. Workers in sandboxes connect via gateway-approved egress.

1. MessageBusProvider interface (Connectivity domain: api/connectivity/bus/)

type MessageBusProvider interface {
Publish(ctx context.Context, subject string, msg Message) error
Subscribe(ctx context.Context, subject string, handler MessageHandler) (Subscription, error)
Request(ctx context.Context, subject string, msg Message, timeout time.Duration) (Message, error)
Close() error
}
// Message includes sender identity for authentication verification.
type Message struct {
ID string // unique message ID
SenderUID string // worker uid (verified by bus, not self-reported)
Subject string // NATS subject
Payload []byte // A2A task or plain message
Metadata map[string]string // headers, correlation IDs
Timestamp time.Time
}

First implementation: NATS client wrapping nats.go + JetStream.

Error handling and resilience:

  • JetStream consumers with explicit ack and max delivery attempts (default 3). Messages that exceed retries go to a dead-letter subject (system.deadletter.<original-subject>).
  • Request-reply uses context deadlines. Timeout returns a typed error, not a generic failure.
  • Connection loss triggers automatic reconnect (NATS client built-in). Missed messages during disconnect are replayed from JetStream on reconnect.
  • Circuit breaker on the auau bridge: if auau is unreachable, buffer messages locally and retry with exponential backoff. Alert on system.alerts after N failures.

2. arpi-agent sidecar (Compute domain: compute/sidecar/)

A Go binary injected into every sandbox alongside execd. Responsibilities:

  • Read the injected template TOML.
  • Connect to NATS via gateway-approved egress (see “Security model” below).
  • Expose an A2A server (Agent Card + task handler) on a local port inside the sandbox.
  • Bridge CLI agent stdio to A2A tasks (translate A2A task requests into CLI commands via execd, stream output back as task updates).
  • Report status/health to worker.<uid>.status subject.

3. A2A adapter (inside arpi-agent)

Each sandbox exposes an A2A Agent Card describing the worker’s capabilities (derived from the template’s [workstation] section — mcps and skills map to A2A capability descriptors). The adapter:

  • Serves the Agent Card at /.well-known/agent.json inside the sandbox.
  • Translates incoming A2A tasks to NATS messages (and vice versa).
  • Implements the A2A task lifecycle: submitted -> working -> completed | failed | input-required.
  • input-required triggers a message to system.alerts which bridges to auau (WhatsApp notification).

4. auau bridge (Connectivity domain: api/connectivity/bridges/auau/)

A bridge service in the control plane that:

  • Subscribes to NATS subjects (system.alerts, worker.*.outbox).
  • Forwards relevant messages to auau’s REST API (which delivers via WhatsApp).
  • Receives auau webhooks (human WhatsApp replies) and publishes to the target worker’s NATS inbox.
  • Uses auau’s existing MCP integration for structured tool calls when needed.

5. arpi chat CLI command (cli/cmd/chat.go)

An IRC-style terminal interface. The CLI connects to the control plane’s NATS endpoint as a thin client:

  • arpi chat #project-foo — subscribe to a project feed, see all worker activity.
  • arpi chat @codex-1 — direct message to a specific worker.
  • Messages from workers rendered with nick-style prefixes ([codex-1], [claude-2]).
  • Human messages published to the target’s NATS inbox.

Worker-to-worker authentication uses NATS credential-based auth, tied to arpi’s identity domain:

  1. Connection auth: Each worker gets a NATS credential (NKey or JWT) derived from its uid at spawn time. The control plane provisions this alongside other credentials. Workers cannot connect to NATS without a valid credential.
  2. Subject ACLs: NATS authorization restricts each worker to:
    • Publish to: worker.<own-uid>.outbox, worker.<own-uid>.status, project.<assigned-project>.feed
    • Subscribe to: worker.<own-uid>.inbox, project.<assigned-project>.feed, system.discovery
    • Request-reply: worker.<target-uid>.inbox (for A2A task delegation — target must be in the same org)
  3. Identity verification: Message.SenderUID is set by the bus from the authenticated connection, not by the sender. Workers cannot spoof identity.
  4. Wall 2 egress: Sandbox network rules allow egress only to the gateway (Wall 1). NATS traffic routes through the gateway. The sandbox cannot reach NATS directly — the gateway proxies and logs all bus traffic.
  5. TLS: All NATS connections use TLS. No cleartext bus traffic.
worker.<uid>.inbox # messages TO this worker (direct)
worker.<uid>.outbox # messages FROM this worker (broadcast)
worker.<uid>.status # health/state updates
project.<id>.feed # all activity for a project
system.alerts # blocked workers, errors, human-needed
system.discovery # A2A Agent Card announcements
system.deadletter.<subj> # failed messages after max retries
AspectLocal devProduction
NATS serverEmbedded in control plane process (in-memory, no disk)Standalone NATS cluster (JetStream with file storage)
Connectionnats://localhost:4222 (auto-started by arpi spawn)Configured via ARPI_NATS_URL or control plane config
AuthNo auth (single-user, localhost only)NKey/JWT per worker, TLS required
PersistenceIn-memory JetStream (messages lost on restart)File-backed JetStream (durable)
auau bridgeOptional (only if auau is configured)Always running
  • api/connectivity/bus/MessageBusProvider interface + NATS implementation
  • api/connectivity/bridges/auau/ — auau webhook bridge service
  • compute/sidecar/ — arpi-agent Go binary (the sandbox sidecar)
  • cli/cmd/chat.go — new arpi chat command (thin client)
  • cli/cmd/spawn.go — spawn registers worker with bus via control plane API
  • cli/cmd/status.go — status reads from worker.*.status via control plane API
  • Add github.com/nats-io/nats.go (NATS client)
  • Add github.com/nats-io/nats-server/v2 (embedded server for local dev)
  • A2A types: define locally from spec (no official Go SDK yet), or generate from JSON-RPC schema
  • Provider interface pattern from toolchain.mdMessageBusProvider alongside IAMProvider, GatewayProvider, SandboxProvider.
  • Sidecar pattern: arpi-agent alongside execd, not replacing it. Call execd’s API for low-level ops.
  • Subject naming: dot-separated hierarchy matching NATS conventions.
  • Domain ownership: bus logic in Connectivity, sidecar in Compute, CLI is just a client.
  • Do not put bus logic in cli/internal/ — the bus is a control plane service, not a CLI feature.
  • Do not embed messaging logic in execd — it’s OpenSandbox’s daemon, not ours.
  • Do not bypass Wall 2 egress filtering — NATS traffic from sandbox routes through gateway-approved egress.
  • Do not build a Slack bot — auau (WhatsApp) is the human bridge.
  • Do not replace MCP with A2A — MCP is worker-to-tool, A2A is worker-to-worker. They coexist.
  • Do not let workers self-report identity in messages — the bus sets SenderUID from authenticated connection.

This is additive — no existing functionality is replaced. Build order:

  1. MessageBusProvider interface + NATS implementation (can test standalone).
  2. arpi-agent sidecar with NATS client (deploy in sandbox alongside execd).
  3. A2A adapter in arpi-agent (workers discover and delegate to each other).
  4. auau bridge (NATS <-> WhatsApp via auau webhooks).
  5. arpi chat CLI command (IRC-style terminal view).
  6. Control plane spawn integration (auto-register worker with bus on POST /v1/workers).
  • MessageBusProvider interface defined with Publish, Subscribe, Request, Close methods
  • NATS implementation passes unit tests for pub/sub, request-reply, and JetStream replay
  • Embedded NATS server starts within control plane for local dev (no external NATS required)
  • arpi-agent sidecar runs inside sandbox alongside execd without conflicts
  • arpi-agent connects to NATS via gateway-approved egress (not bypassing Wall 2)
  • Worker cannot connect to NATS without valid NKey/JWT credential
  • Worker cannot publish to another worker’s outbox (subject ACL enforced)
  • Message.SenderUID is set by bus from authenticated connection, not self-reported
  • A2A Agent Card served from sandbox at /.well-known/agent.json, discoverable via system.discovery
  • A2A task delegation works: worker A sends task to worker B, receives result
  • input-required state triggers message to system.alerts
  • auau bridge forwards system.alerts to WhatsApp via auau REST API
  • Human WhatsApp reply (via auau webhook) arrives at target worker’s NATS inbox
  • arpi chat #project-foo shows real-time worker activity in terminal
  • arpi chat @codex-1 sends direct message to worker, receives reply
  • arpi status reflects worker presence from NATS status subjects
  • Worker restart replays missed JetStream messages (no message loss)
  • NATS traffic from sandbox is visible in gateway audit log
  • All NATS connections use TLS (no cleartext bus traffic)
  • Dead-letter subject receives messages after max delivery attempts exceeded

NATS provides the message transport (pub/sub, request-reply, persistence via JetStream). A2A provides the worker interop protocol (Agent Cards, task lifecycle, discovery). Combined, they give both infrastructure and semantics.

  • Good, because NATS is a single Go binary, embeddable in arpi control plane (no separate server for dev).
  • Good, because JetStream provides at-least-once delivery and message replay — workers that restart don’t lose context.
  • Good, because NATS subjects (worker.<uid>.inbox) map naturally to IRC-style channels for human UX.
  • Good, because A2A is backed by Google + 50 enterprise partners (spec) — strongest interop trajectory.
  • Good, because A2A’s input-required task state maps perfectly to “worker needs human help” notifications.
  • Good, because NATS leaf nodes enable future cross-org federation without architecture changes.
  • Bad, because A2A spec is still v0.2.x — breaking changes possible. Mitigated by adapter isolation.
  • Bad, because no official A2A Go SDK — must define types from spec.
  • Neutral, because adds NATS as a runtime dependency (but embeddable, so no ops burden for local dev).

Federated messaging protocol with rooms, E2EE, presence, and custom event types. Mature spec (Matrix 1.x).

  • Good, because built-in federation — workers across organizations can communicate.
  • Good, because E2EE for sensitive worker communication.
  • Good, because DAG-based event model provides strong auditability.
  • Good, because bridges to IRC, Slack, Discord exist.
  • Bad, because heavyweight — Synapse/Dendrite is a full server, not embeddable in a CLI.
  • Bad, because overkill for intra-system worker messaging (arpi workers talking to each other).
  • Bad, because no Go homeserver SDK — would need to run Dendrite (Go) as a separate process.
  • Bad, because the DAG model adds latency and complexity unnecessary for simple task delegation.

Build a bespoke messaging layer using HTTP REST + WebSocket, no standard protocol.

  • Good, because full control over the protocol and implementation.
  • Good, because no external dependencies.
  • Bad, because reinventing pub/sub, persistence, replay, presence, discovery.
  • Bad, because no interoperability with external workers (vendor lock-in to arpi’s custom protocol).
  • Bad, because significant engineering effort for table-stakes messaging features.

Keep the current model: one terminal per worker, stdout streaming, no worker-to-worker communication.

  • Good, because zero additional complexity.
  • Good, because SSH is universal and works today.
  • Bad, because no worker-to-worker communication — workers are isolated.
  • Bad, because human must be at terminal to interact — no async notifications.
  • Bad, because doesn’t scale beyond 2-3 workers (too many terminal tabs).
  • Bad, because no message persistence — if you miss stdout, it’s gone.
  • two-wall-security.md — NATS traffic from sandboxes must route through Wall 2 egress. The bus does not create a new network path that bypasses sandbox isolation. See “Security model” above for specifics.
  • sandbox-strategy.md — arpi-agent is a sidecar alongside OpenSandbox’s execd. It uses execd’s API for file/command operations and adds messaging on top.
  • toolchain.mdMessageBusProvider follows the same provider interface pattern as IAMProvider, GatewayProvider, SandboxProvider.
  • template-schema.mdarpi spawn (via control plane POST /v1/workers) gains responsibility for registering the worker with the bus and publishing its Agent Card. arpi-agent lives in the Compute domain. The bus service lives in the Connectivity domain.

MCP and A2A are complementary:

MCPA2A
WhatWorker-to-tool integrationWorker-to-worker delegation
DirectionWorker calls toolWorker delegates to worker
Transportstdio or HTTP+SSEHTTP + JSON-RPC (over NATS in arpi)
ExampleWorker calls Sentry MCP to fetch errorsWorker asks review-bot to check its PR

Both coexist in arpi. MCP servers are configured via the registry. A2A tasks flow over the NATS bus.

  • If A2A spec reaches v1.0 with breaking changes from v0.2.x, update the adapter (isolated by design).
  • If an official A2A Go SDK is released, replace local type definitions.
  • If cross-org federation becomes a requirement, evaluate NATS leaf nodes vs. A2A’s native HTTP transport for cross-boundary communication.
  • If OpenSandbox ships their Go SDK with built-in messaging hooks, evaluate whether arpi-agent can delegate bus connectivity to execd.