Multi-Agent Architecture Production

How we built a multi-agent system that manages itself

6 AI agents. 1 team. Each with a role, a budget, and a chain of command. Architecture and lessons learned building an autonomous AI team.

March 2026 8 min

How we built a multi-agent system that manages itself

A single AI agent can answer questions, classify tickets, or draft emails. But when the task requires coordinating multiple specialised capabilities — research, analysis, writing, code execution, and human oversight — a single agent hits its limits fast. That is where multi-agent systems come in.

This post covers how we architect production multi-agent systems: the orchestration patterns, model selection for each role, a real code example, and why we build custom rather than using off-the-shelf frameworks.

Why Multiple Agents Instead of One?

The core insight is the same one that makes engineering teams effective: specialisation beats generalism at scale.

A single mega-prompt that tries to handle research, analysis, writing, fact-checking, and formatting will:

Hit context-window limits quickly as you stuff more instructions and tool definitions in.
Produce inconsistent quality because the model is juggling too many objectives.
Be impossible to debug — when the output is wrong, you cannot tell which "phase" failed.
Cost more, because every task runs through the most capable (and most expensive) model, even when a simpler model would suffice.

A multi-agent system splits the work into specialised roles. Each agent has a focused prompt, a narrow set of tools, and a model chosen to match the complexity of its task. An orchestrator coordinates the workflow, passing outputs from one agent as inputs to the next.

Architecture: Orchestrator and Workers

Our standard multi-agent architecture follows an orchestrator-worker pattern:

The Orchestrator

The orchestrator is the "project manager" of the system. It receives the initial request, breaks it into subtasks, assigns each subtask to the appropriate worker agent, collects results, handles errors and retries, and assembles the final output.

Model choice: Claude Opus 4.6. The orchestrator needs strong reasoning, planning, and error-handling capabilities. It makes judgment calls — should a failed subtask be retried, reassigned, or escalated? Does the output from Worker A need to be reformatted before Worker B can use it? These decisions require the most capable model available. The cost is justified because the orchestrator processes far fewer tokens than the workers (it deals in summaries and routing decisions, not raw content).

Worker Agents

Each worker agent handles one type of task. Examples from a real content-generation pipeline:

Worker	Role	Model	Why This Model
Researcher	Searches knowledge bases, extracts relevant data, summarises sources	Claude Sonnet 4.6	Needs good comprehension and tool use; does not need Opus-level reasoning
Writer	Produces draft content from research summaries and outlines	Claude Sonnet 4.6	Strong writing quality at lower cost than Opus
Fact-Checker	Verifies claims against source documents, flags unsupported statements	Claude Sonnet 4.6	Needs careful reading comprehension; Sonnet handles this well
Formatter	Applies brand guidelines, fixes markdown, generates metadata	Claude Haiku 4.5	Simple, rule-based transformations; Haiku is 60x cheaper than Opus
Classifier	Routes incoming requests, tags content, triages priorities	Claude Haiku 4.5	Fast, cheap, and accurate for classification tasks

Why Model Selection Matters

Using the right model for each role is not just about saving money (though that matters). It is about reliability. Haiku responds in 200-400ms; Opus can take 5-15 seconds for complex reasoning. If your classifier agent uses Opus, every incoming request waits 10 seconds for a task that Haiku could handle in 300 milliseconds. In a system processing thousands of requests per day, that latency compounds into a terrible user experience and wasted compute.

Cost comparison for 100,000 tasks per month:

All Opus: approximately $1,200 in API costs.
Mixed models (Opus orchestrator, Sonnet workers, Haiku for simple tasks): approximately $180 in API costs.
That is an 85 % cost reduction with no loss in output quality, because each model is matched to the complexity it actually needs to handle.

The Heartbeat Loop: Keeping Agents Alive

In production, agents can stall — an API call times out, a worker returns malformed output, or a subtask takes longer than expected. A heartbeat loop ensures the orchestrator stays aware of each worker's status and can intervene when something goes wrong.

Here is a simplified version of the pattern we use:

import asyncio
import time
from dataclasses import dataclass, field

@dataclass
class WorkerStatus:
    worker_id: str
    last_heartbeat: float = field(default_factory=time.time)
    status: str = "idle"  # idle, working, stalled, failed
    current_task: str = ""
    retries: int = 0

class Orchestrator:
    def __init__(self, max_retries=3, heartbeat_timeout=30):
        self.workers: dict[str, WorkerStatus] = {}
        self.max_retries = max_retries
        self.heartbeat_timeout = heartbeat_timeout

    async def heartbeat_loop(self):
        """Continuously monitors worker health."""
        while True:
            now = time.time()
            for worker_id, status in self.workers.items():
                if status.status == "working":
                    elapsed = now - status.last_heartbeat
                    if elapsed > self.heartbeat_timeout:
                        if status.retries < self.max_retries:
                            status.retries += 1
                            status.status = "working"
                            status.last_heartbeat = now
                            await self.retry_task(
                                worker_id, status.current_task
                            )
                        else:
                            status.status = "failed"
                            await self.escalate(
                                worker_id, status.current_task
                            )
            await asyncio.sleep(5)

    async def retry_task(self, worker_id, task):
        """Re-dispatches a stalled task to the same or different worker."""
        print(f"Retrying {task} on {worker_id} "
              f"(attempt {self.workers[worker_id].retries})")
        # Re-dispatch logic here

    async def escalate(self, worker_id, task):
        """Escalates a persistently failing task for human review."""
        print(f"Escalating {task} from {worker_id} - "
              f"max retries exceeded")
        # Send alert to Slack, email, or monitoring dashboard

The key elements:

Timeout detection: If a worker has not sent a heartbeat in 30 seconds, assume it stalled.
Automatic retries: Retry up to 3 times before giving up. Each retry resets the heartbeat timer.
Escalation: After max retries, alert a human rather than silently failing. In production, this sends a Slack message with the task details and failure context.
Non-blocking: The heartbeat loop runs asynchronously alongside the main workflow, so monitoring does not slow down task execution.

Why We Build Custom (Not CrewAI, AutoGen, or LangGraph)

There are popular open-source frameworks for multi-agent systems. We evaluated all of them extensively and chose to build custom orchestration for production deployments. Here is why:

CrewAI

Strengths: Clean API, good for prototyping, nice abstraction of "crews" and "tasks." Limitations: Opinionated about agent-to-agent communication patterns. Limited control over retry logic, error handling, and model routing. Adds a dependency layer that obscures what prompts are actually being sent. Production monitoring requires bolting on external tools. When something breaks at 3 AM, you need to debug through CrewAI's abstractions before you can see the actual API calls.

AutoGen (Microsoft)

Strengths: Strong conversational patterns, good research backing. Limitations: Heavyweight setup, designed primarily for research scenarios. The "conversation between agents" paradigm works beautifully in demos but becomes hard to control in production where you need deterministic workflows with strict error budgets and latency requirements.

LangGraph

Strengths: Graph-based workflow definition, good state management, integrates with the LangChain ecosystem. Limitations: Tightly coupled to LangChain, which adds significant complexity and abstraction layers. Version churn is high — breaking changes between releases are common. The graph abstraction is powerful but over-engineered for many real-world workflows that are fundamentally sequential with conditional branching.

Why Custom Wins in Production

Our custom orchestration layer is roughly 800 lines of Python. It handles:

Dynamic model routing per agent based on task complexity.
Structured retry logic with exponential backoff and circuit breakers.
Token budget tracking and cost allocation per agent.
Heartbeat monitoring with Slack escalation.
Structured logging that captures every prompt, response, and decision for debugging and evaluation.
Graceful degradation — if Opus is unavailable, the orchestrator can fall back to Sonnet with adjusted prompts.

The frameworks add 10,000+ lines of dependency code to achieve similar functionality, but with less control over the details that matter most in production: error handling, cost management, and observability. When a client's workflow breaks at scale, we need to see exactly what happened at the API level, not navigate through framework abstractions.

That said, if you are prototyping or building an internal tool where reliability requirements are lower, CrewAI or LangGraph can save you time getting started. The custom approach pays off when you need production-grade reliability and cost control.

The Anthropic Agent SDK: a new middle ground

Since we originally built our custom orchestration, Anthropic has released the Agent SDK — a Python framework specifically for building Claude-powered agents with built-in tool management, guardrails, and multi-agent orchestration. It occupies a useful middle ground between the heavyweight frameworks above and a fully custom build. The Agent SDK is opinionated where it matters (safety, tool execution, handoffs between agents) but minimal where it does not, giving you more control than CrewAI while requiring less boilerplate than building from scratch. For new projects, we now evaluate it alongside our custom orchestration.

Another development worth noting: the Model Context Protocol (MCP) has become the industry standard for connecting agents to external tools and data sources. With backing from the Linux Foundation and support from Anthropic, OpenAI, Google, Microsoft, and AWS, MCP provides a universal interface that any agent — whether in a multi-agent system or standalone — can use to discover and invoke tools. This is particularly valuable in multi-agent architectures where different worker agents need access to different tool sets.

The Anthropic Agent SDK: a new middle ground

Results in Production

Here are real metrics from multi-agent systems we have deployed:

Content Production Pipeline (Media Client)

Architecture: Orchestrator + 4 workers (researcher, writer, fact-checker, formatter).
Volume: 120 articles per week, each 1,500-3,000 words.
Quality: 94 % of articles published with zero human edits. The remaining 6 % needed minor adjustments, not rewrites.
Cost per article: $0.42 in API costs (blended across all agents and models).
Time per article: 4.2 minutes average from brief to final draft, compared to 3-4 hours with a human writer.
Human role: Editorial review and strategic direction. The team shifted from "producing content" to "directing content strategy."

Support Triage System (SaaS Client)

Architecture: Orchestrator + 3 workers (classifier, resolver, escalation-drafter).
Volume: 4,200 tickets per month.
Auto-resolution rate: 67 % (up from 12 % with their previous rule-based system).
Average resolution time: 22 seconds for auto-resolved tickets.
Cost: $320 per month in API costs for the entire system.
False positive rate: 1.2 % (tickets incorrectly auto-resolved that required human follow-up).

Data Analysis Pipeline (Fintech Client)

Architecture: Orchestrator + 3 workers (query-writer, analyst, report-formatter).
Volume: 45 reports per week across 6 departments.
Accuracy: 99.1 % on numerical outputs (validated against manual calculations).
Time savings: 28 analyst-hours per week freed up for strategic work.
Cost: $89 per month in API costs.

When to Use a Multi-Agent System

Not every AI task needs multiple agents. Use a multi-agent architecture when:

The workflow has distinct phases that benefit from different prompts, tools, or models.
Quality requires checks and balances — a writer agent and a fact-checker agent produce better results than a single agent trying to do both.
Scale demands cost optimisation — routing simple tasks to cheap models while reserving expensive models for complex reasoning.
Reliability is critical — retry logic, heartbeat monitoring, and graceful degradation are easier to implement when each agent is an isolated unit.

For simpler use cases — a single classification task, a straightforward Q&A bot, a one-step data transformation — a single agent is simpler, cheaper, and just as effective.

To learn more about how individual autonomous AI agents work before scaling to multi-agent systems, start with that foundation. And if you are ready to explore what a multi-agent system could do for your specific workflow, our AI agents service covers the full design-to-deployment process.