Introduction
Multi-agent LLM systems — where multiple AI models collaborate to complete complex tasks — have moved from research demos to production systems at companies like Cognition, Devin, and numerous enterprise software providers. Building these systems reliably at scale requires careful architectural thinking.
This post covers the core patterns, failure modes, and engineering practices for multi-agent systems in production.
Why Multi-Agent?
Single-agent architectures hit limits when tasks require:
- Context beyond a single model's window: A legal review spanning thousands of documents
- Parallelism: Running simultaneous research threads
- Specialization: Different models optimized for different subtasks (coding vs. analysis vs. writing)
- Verification: Independent models checking each other's work
- Long-horizon tasks: Multi-step plans where early decisions affect later ones
Multi-agent architectures address these by decomposing work across multiple models.
Core Architectural Patterns
1. Orchestrator-Subagent
The most common pattern: one central orchestrator agent plans and delegates to specialized subagents.
User → [Orchestrator]
├── [Research Agent]
├── [Code Agent]
└── [Verification Agent]
Orchestrator responsibilities:
- Decompose the task into subtasks
- Assign subtasks to appropriate agents
- Track progress and handle failures
- Synthesize results into a coherent output
Tradeoffs:
- Single point of failure (orchestrator)
- Bottleneck if orchestrator is slow
- Excellent for tasks with clear hierarchical decomposition
Example: A software engineering agent that delegates to a code writer, a test writer, a debugger, and a documentation writer.
2. Peer-to-Peer (Pipeline)
Agents are arranged as a pipeline, each processing and passing output to the next:
[Agent A: Research] → [Agent B: Draft] → [Agent C: Review] → [Agent D: Final]
Advantages:
- Simple data flow
- Easy to reason about state
- Natural fit for sequential refinement workflows
Disadvantages:
- No parallelism
- Error propagation (A's mistake becomes B's input)
- Latency adds up
Best for: document processing pipelines, code generation + review, content creation workflows.
3. Debate / Adversarial
Multiple agents with competing objectives check each other:
[Agent A: Propose solution]
[Agent B: Critique Agent A's solution]
[Agent A: Defend or revise]
[Judge Agent: Evaluate and decide]
Used for: factual verification, risk assessment, legal/financial analysis. Forces the system to surface assumptions and weaknesses.
Production note: This pattern is expensive (2-3x compute per task). Reserve for high-stakes decisions.
4. Parallel Execution + Aggregation
Independent agents work simultaneously, results are aggregated:
[Agent A: Approach 1]
[Input] ──────── [Agent B: Approach 2] ──── [Aggregator] → Output
[Agent C: Approach 3]
Natural fit for: best-of-N generation, ensemble methods, research tasks with multiple dimensions.
Engineering consideration: The aggregator itself is a complexity sink — it needs to handle partial failures, disagreements, and synthesis from heterogeneous outputs.
5. Hierarchical Decomposition
Recursive orchestration for complex tasks:
[Top Orchestrator]
├── [Sub-Orchestrator 1]
│ ├── [Worker A]
│ └── [Worker B]
└── [Sub-Orchestrator 2]
├── [Worker C]
└── [Worker D]
Scales to arbitrarily complex tasks but adds significant coordination overhead. Works best with strong task decomposition primitives.
Communication Protocols
Message Format Standards
All agents should communicate via a structured format:
class AgentMessage:
task_id: str # unique identifier for the task
sender: str # agent ID
recipient: str # agent ID or "orchestrator"
message_type: str # "request" | "result" | "error" | "update"
content: dict # task-specific payload
metadata: dict # latency, cost, confidence, etc.
parent_message_id: str # for tracing
Standardized formats enable:
- Logging and observability
- Replay and debugging
- Protocol evolution without breaking changes
Shared Memory vs. Message Passing
Shared memory (e.g., a vector store all agents can read/write):
- Easy to implement
- Risk of concurrent writes
- Stale reads
- Good for: reference data, long-term knowledge
Message passing (each agent only sees its own context):
- Explicit data flow
- Better isolation
- Harder to share large artifacts
- Good for: task coordination, status updates
Production systems often combine both: message passing for coordination, shared memory for large artifacts.
State Management
The State Problem
Long-running multi-agent tasks accumulate state that must be:
- Persisted (for recovery from failures)
- Accessible to the right agents
- Consistent (no stale reads leading to duplicate work)
State Hierarchy
Task state (hours-long)
└── Subtask state (minutes)
└── Agent turn state (seconds)
Use different storage backends for each:
- Task state: database (Postgres, DynamoDB)
- Subtask state: Redis or in-memory with checkpointing
- Agent turn state: in-context (LLM context window)
Checkpoint and Resume
For tasks that may take hours, agents must be able to resume from failure:
class TaskCheckpoint:
task_id: str
completed_subtasks: List[SubtaskResult]
pending_subtasks: List[Subtask]
context_snapshot: str # compressed context
created_at: datetime
On agent restart, load the latest checkpoint and resume. This requires idempotent operations — re-running a subtask shouldn't cause side effects.
Failure Handling
Types of Failures
- Agent failure: Model returns an error or malformed output
- Timeout: Agent takes too long
- Deadlock: Agents waiting on each other circularly
- Semantic failure: Agent returns valid output that's wrong
- Context overflow: Accumulated context exceeds model limits
Retry Strategies
def retry_with_backoff(agent_call, max_retries=3):
for attempt in range(max_retries):
try:
result = agent_call()
if is_valid(result):
return result
except AgentError as e:
if not is_retryable(e):
raise
wait = 2 ** attempt + random.random()
time.sleep(wait)
# Fallback: simpler agent or human escalation
return fallback_handler(agent_call)
Deadlock Detection
In orchestrator-subagent systems, deadlocks occur when:
- Agent A is waiting for Agent B's result
- Agent B is waiting for Agent A's result
Prevention: maintain a dependency graph and detect cycles before dispatching. Most hierarchical systems prevent deadlocks by design (parent always waits on children, never vice versa).
Semantic Failure Detection
The hardest failure mode to catch. Strategies:
- Output schema validation: Reject malformed outputs early
- Confidence scoring: Model estimates its own uncertainty
- Critic agents: Dedicated verification agents review outputs
- Automated testing: For code tasks, run tests and check output
Observability
What to Log
Every agent interaction should emit:
- Input prompt (or hash for large inputs)
- Output (or hash)
- Latency
- Token count (input + output)
- Cost
- Model version
- Success/failure status
- Task and parent task IDs
Traces, Not Just Logs
Distributed tracing (OpenTelemetry) across agent calls gives you:
- End-to-end latency breakdown
- Which agents are bottlenecks
- Where failures cascade
- Full replay of any task
Cost Attribution
Multi-agent systems can be expensive to operate. Track cost per:
- Task type
- Customer/user
- Agent role (orchestrator is typically cheap, worker agents expensive)
- Failure mode (retries cost money)
Production Lessons
1. Simpler architectures first
The orchestrator-subagent pattern solves 80% of use cases. Only add complexity when simpler architectures demonstrably fail.
2. Context window management is critical
Each agent's context window is finite. Design your information architecture so agents receive only what they need. Use summarization liberally.
3. Human escalation paths are essential
For high-stakes tasks, always provide a path to escalate to a human when agent confidence is low or retries are exhausted.
4. Test with adversarial inputs
Multi-agent systems can amplify prompt injection attacks — one agent's malformed output becomes another's trusted input. Test your systems for injection vulnerabilities.
5. Async everything
Long-running agent tasks should be async by default. Synchronous multi-agent calls lead to timeouts, connection drops, and poor user experience.
Conclusion
Multi-agent LLM systems are powerful but add substantial engineering complexity. The most successful production deployments start simple — often a single orchestrator with 2-3 specialized subagents — and add complexity only when it's clearly warranted.
Invest heavily in observability, structured communication, and failure handling before scaling up the number of agents or the task complexity.
Explore more AI system design patterns in our comprehensive curriculum.