Anthropic reports 90.2% gains from multi-agent. Cognition publishes 'Don't Build Multi-Agents'. A side-by-side reading of the two camps' primary sources, plus a decision framework for single vs multi.
“AI agents are the answer” was recommended in two diametrically opposite directions within a single year — a rare event. In 2025, Anthropic published that its multi-agent system outperformed a single-agent baseline by 90.2% on internal evaluation (Anthropic, 2025-06). The same year, Cognition Labs published a post titled “Don’t Build Multi-Agents” (Cognition, 2025).
The two articles cite the same primary sources — Anthropic’s own operating cases, OpenAI’s guides, ReAct-family papers — and arrive at opposite recommendations. Neither is wrong. The problems they’re solving are different. This article places both camps’ primary sources side by side and distills a decision framework for single vs multi-agent.
”Agent” Refers to Too Many Things
Before comparing, the definitions need narrowing. From 2024 to 2025, search traffic for “agentic AI” increased over 600% (US News, 2025). The word now refers to anything from an LLM call loop to a fully autonomous system. Calling different systems by the same word breaks the “is multi better, or single?” debate before it starts.
Anthropic’s “Building Effective Agents” (2024-12) guide resolves this ambiguity as follows.
- Workflow: “LLMs and tools are orchestrated through predefined code paths.” — LLMs and tools execute sequentially within predefined code paths
- Agent: “systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.” — the LLM dynamically decides its own procedure and tool use
- Multi-Agent System (MAS): multiple LLMs (or multiple instances of the same LLM) collaborate via distributed decision-making and delegation
The difference between the three systems is not “how many agents” but who decides the next action. In a Workflow, code decides. In a Single Agent, one LLM decides. In a MAS, multiple LLMs decide via delegation and coordination.
| Dimension | Workflow (Chain) | Single Agent | Multi-Agent System |
|---|---|---|---|
| Decides next action | Predefined code | A single LLM | Multiple LLMs (delegation·coordination) |
| Number of steps | Fixed | Variable (open-ended) | Variable + delegation depth |
| Debugging difficulty | Low | Medium | High (non-deterministic) |
| Token cost (vs chat, Anthropic measure) | 1-2x | ~4x | ~15x |
| Examples | Prompt chaining, Routing | ReAct loop, Claude Code | Anthropic Research System, AutoGen |
| Best for | Well-defined procedures | Open-ended single task | Parallel exploration, domain separation |
Only after agreeing on these definitions can “why does Anthropic recommend multi” and “why does Cognition say don’t build multi” be compared on the same plane.
Anthropic’s 90.2% — The Case for Multi-Agent
In June 2025, Anthropic published the backend architecture behind its Research feature (“How we built our multi-agent research system”, Anthropic Engineering, 2025-06). The single most-cited number from that article is one line.
“A multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.” (Anthropic)
The same article specifies the cost as follows.
- Agents (single) = ~4x the tokens of a chat baseline
- Multi-agent = ~15x the tokens of a chat baseline
- “Token usage alone explains 80% of performance variance.” — token usage alone explains 80% of the variance in performance
The 90.2% number alone makes “go multi” feel like the obvious conclusion, but the same article immediately pins down the cost trade-off. The gain is meaningful only on tasks that can absorb a 15× token budget.
Eight Prompt Engineering Principles
The eight principles Anthropic distilled from operating the system can be read as a checklist for multi-agent design (Anthropic, 2025-06).
- Mental modeling — the lead agent forms a mental model of the user’s intent before delegating
- Delegation teaching — explicitly teaches subagents how to delegate (role, scope, stop conditions)
- Effort scaling — match the number of calls to task difficulty. Simple = 1 agent / 3-10 tool calls, complex = 10+ subagents
- Tool design criticality — the tool interface determines result quality. A poorly designed tool defeats any prompt tuning
- Self-improvement — agents evaluate and retry their own outputs
- Search strategy — broad → narrow. Explore widely first, then converge
- Extended thinking — expand reasoning budget for complex tasks
- Parallelization — 3-5 subagents in parallel + 3+ tool calls in parallel → research time cut by up to 90%
Among these, the most operationally weighted are #3 (Effort scaling) and #8 (Parallelization). The most common antipattern, per Anthropic’s own report, is throwing multi at simple queries.
Self-Reported Failure Modes
The same article enumerates four failure modes Anthropic observed in production (Anthropic).
- Spawning 50+ subagents for simple queries
- Endless web searches for nonexistent information
- Vague task descriptions causing subagents to duplicate work
- Selecting SEO-abusive sites (content farms) over authoritative sources
All four occur when the lead agent’s delegation design is weak in a multi-agent structure. What would not occur — or would cost only 1× — in a single-agent setup explodes to 15× cost × frequency in multi.
Operating Trade-offs
The trade-offs Anthropic itself acknowledges:
| Trade-off | Description |
|---|---|
| Cost explosion | 15x tokens vs chat. Per-token pricing transfers directly to operating cost |
| Sync bottleneck | The lead agent waits synchronously on every subagent’s result |
| Rainbow deployment | Minor changes cascade into large behavioral shifts → gradual rollout required |
| Non-deterministic debugging | Same input, different outputs. “Full production tracing” and “high-level observability of decision patterns” become required |
| Cascade failure | ”Minor failures cascade into large behavioral changes; requires durable execution and error recovery without expensive restarts.” |
Anthropic’s own conclusion is not “multi is the answer” but “multi is the answer for research-style open-ended tasks; otherwise, start from the simplest solution.” The same company’s “Building Effective Agents” guide states this more strongly.
“Find the simplest solution possible, and only increase complexity when needed… Many applications benefit most from optimizing single LLM calls with retrieval and in-context examples.” (Anthropic)
The 90.2% number and “start simple” appear in the same company’s two articles simultaneously. Not a contradiction — a signal that the answer depends on task profile.
Cognition’s Counter — Don’t Build Multi-Agents
Following Anthropic’s announcement, the same year Cognition Labs (the company behind Devin) published a direct rebuttal (“Don’t Build Multi-Agents”, Cognition, 2025). The title is the recommendation, and the rationale compresses into two principles.
- “Share context, and share full agent traces, not just individual messages.”
- “Actions carry implicit decisions, and conflicting decisions carry bad results.” (Cognition)
The first principle says full traces — not message snippets — must be shared. The second says every action carries implicit assumptions, and when assumptions conflict, results break. Violate these two principles and multi-agent breaks. That is Cognition’s position.
The Flappy Bird Example
The shortest, clearest example Cognition gives is building a Flappy Bird clone.
Suppose a user asks “build me a Flappy Bird clone.” The lead agent splits the work between two subagents.
- Subagent 1: generate the background → draws Mario-style pipe scenery
- Subagent 2: generate the bird character → draws a pixel-style bird that does not match Mario
Each subagent sees only its own work and makes its own style assumption. When the lead agent assembles the two outputs, the assumptions conflict and consistency breaks. The integration step inherits both miscommunications at once.
The lesson is simple. In tasks where each subagent’s “implicit decision” is likely to conflict with another subagent’s decision, the multi structure itself creates the problem. Especially pronounced for outputs requiring single-source consistency — codebases, games, design systems.
Cognition’s Recommendation
Following the principles and the example, Cognition’s alternative is the single-threaded linear agent.
- Prefer a single agent where context flows continuously in one thread
- When context grows too long, a dedicated LLM compresses history into “key details, events, and decisions”
- Cognition acknowledges: “this approach is hard to get right” — the compression LLM’s accuracy is itself a challenge
Cognition does not say “multi is never acceptable” in the conclusion. It is closer to “the trade-offs of multi are hard to absorb at the current state of the art, and getting single right is also hard enough.”
What’s Different Between the Two Camps
The reason two articles citing the same primary sources — Anthropic Building Effective Agents, OpenAI Practical Guide, ReAct-family papers — arrive at opposite conclusions is that the task profiles they’re solving are different.
| Dimension | Anthropic Research System | Cognition Devin |
|---|---|---|
| Representative task | Open-ended web research, broad exploration | Long-running coding, codebase consistency changes |
| Sub-task assumption conflict | Low (each subagent explores a different source) | High (each subagent edits a different part of the same codebase) |
| Value of parallel exploration | Very high (research time -90%) | Low (parallel changes generate conflicts) |
| Result integration | Lead agent synthesizes sources | Codebase consistency check |
| Recommendation | Multi-agent | Single-threaded |
In short, both recommendations are correct on top of their task profile. The conclusion after reading both articles together is that the question itself — “is multi-agent good?” — is wrong.
The right question is “in the task profile I’m solving, does multi create value, or does it create conflict?”
The Decision — When Single, When Multi
Combining both camps’ primary sources, the decision criteria distill into four dimensions.
---
config:
look: handDrawn
theme: neutral
---
flowchart TD
A[Task definition] --> B{Procedure predefined?}
B -->|Yes| C[Workflow / Chain]
B -->|No| D{Sub-task assumption conflict likely?}
D -->|High| E[Single Agent]
D -->|Low| F{Parallel exploration creates value?}
F -->|No| E
F -->|Yes| H["Multi-Agent System<br/>+ model mapping strategy"]
Four Decision Dimensions
- Dimension 1 — Procedure predefined: If steps can be defined in code, Workflow is the simplest and safest. Debugging is possible. The moment dynamic decision-making is required, you cross into Agent territory.
- Dimension 2 — Sub-task assumption conflict: Tasks where one subagent’s decision can conflict with another’s (coding, design, single artifact) are safer with single. Tasks with low conflict potential (parallel web research, domain separation) extract value from multi.
- Dimension 3 — Value of parallel exploration: Is this a task where parallelism cuts time by 90%, or one where sequential is natural? Anthropic’s research time -90% is an upper bound for tasks where parallel exploration genuinely matters.
- Dimension 4 — Model mapping strategy: Can you map different models to lead and subagent (e.g., lead=Opus, subagents=Sonnet)? If yes, the raw token explosion of multi does not translate directly into cost explosion. If no — every call must be the SOTA model — single is safer on cost.
Cost — Model Mapping, Not Raw Tokens
The “15x tokens vs chat” Anthropic reports holds when every call is the same model. Real production differs. In the same announcement, Anthropic specifies the model mapping for its Multi-Agent Research System:
- Lead agent = Claude Opus 4
- Subagents = Claude Sonnet 4
The lead handles decision-making and synthesis; subagents handle high-volume implementation work. Sonnet is a fraction of Opus’s price (input pricing is roughly one-fifth as of May 2026), so handling the same task with a single agent + Opus throughout can actually cost more. AI Gateway patterns (Haiku → Sonnet → Opus cascade) applied in production have reported LLM cost reductions of 40-70%.
The reverse hidden cost is more dangerous. The common default of running a single agent on one SOTA model — say, Claude Opus 4.7 for every subtask — uses the most expensive model even on simple subtasks. Cost comparison must move from raw tokens to (model price) × (call count).
So in the decision, “can we absorb 15x tokens” is the wrong question. The right questions are two:
- Can different models be mapped to lead and subagent for this task?
- When using a single agent, does the SOTA model really need to handle every subtask?
If mapping is possible and simple subtasks dominate, multi-agent + model mapping can be cheaper than single + SOTA. Cost is not an absolute — it depends on mapping feasibility.
Five Cases Applied
| Case | Recommendation | Rationale |
|---|---|---|
| Consistent codebase changes (Devin) | Single | Cognition case. High sub-task assumption conflict |
| Web research, broad exploration | Multi | Anthropic case. High value of parallel exploration |
| Accounting voucher processing (defined SOP) | Workflow | Procedure predefined. Agent unnecessary |
| Single-domain RAG chatbot | Single | Single task, no domain separation needed |
| CRM + payment + analytics integrated workflow | Multi | Domain·permission separation. Each domain operates on different assumptions |
Effort Scaling — Anthropic’s Operating Rule
After the decision tree, the next rule to apply is Anthropic’s effort scaling. Even within the same multi-agent structure, the number of calls must match task difficulty.
- Simple task: 1 agent / 3-10 tool calls
- Medium task: lead agent + 2-3 subagents
- Complex task: lead agent + 10+ subagents
Ignoring this rule and going “since it is multi, every task gets 10+ subagents” is exactly Anthropic’s first self-reported failure mode — spawning 50+ subagents for simple queries.
What Both Camps Agree On
The recommendations point in opposite directions, but both camps agree on one thing: start simple.
- Anthropic Building Effective Agents: “find the simplest solution possible, and only increase complexity when needed.”
- OpenAI Practical Guide (related material): “maximize a single agent’s capabilities first… use orchestration patterns that match your complexity level, starting with a single agent.”
- Cognition Don’t Build Multi-Agents: “Single-threaded linear agents where the context is continuous.”
Three articles draw different conclusions, but the starting point is the same. Don’t use multi for tasks that can be solved with single. Multi becomes warranted only for tasks where (a) the value of parallel exploration is clear, (b) sub-task assumption conflict is low, and (c) the cost — accounting for model mapping — is acceptable.
When all three conditions hold, multi-agent delivers the 90.2%. When any one is missing, the same multi structure produces conflict and cost overruns.
Next Areas
This article limited itself to the single vs multi decision. Two larger areas adjacent to the decision are left as separate topics.
- Multi-Agent Workflow patterns and examples — Supervisor / Sequential / Hierarchical / Swarm / Map-Reduce / Group Chat and other topology·coordination combinations. While this article’s comparison focused on dynamic-delegation cases (Anthropic Research), this area covers the broader hybrid structures where multiple agents are placed within a predefined flow.
- How to evaluate multi-agent systems — Evaluation methodology for non-deterministic systems. The limits of LLM-as-judge, human eval, and benchmarks (UC Berkeley CRDI in April 2026 reported that SWE-bench, GAIA, and AgentBench are vulnerable to reward hacking).
These two areas are each their own large article and are meaningful only for teams that have passed the decision stage covered here. This article’s scope is limited to that decision itself.
Sources
- How we built our multi-agent research system (Anthropic Engineering, 2025-06) — 90.2% performance gain, 15x tokens, 8 principles, self-reported failure modes
- Building effective agents (Anthropic Engineering, 2024-12) — Workflow vs Agent definitions, “start simple” recommendation, 6 patterns
- Don’t Build Multi-Agents (Cognition Labs, 2025) — two principles (share context, conflicting decisions), Flappy Bird example, single-threaded recommendation
- A Practical Guide to Building Agents (OpenAI, 2025) — single agent first, when to split
- What does ‘agentic’ AI mean? (US News, 2025-11) — agentic search traffic up 600%
- How We Broke Top AI Agent Benchmarks (UC Berkeley CRDI, 2026-04-12) — benchmark reward hacking warning
Related Posts

Multi-Agent Workflow — 6 Patterns from Supervisor to Swarm
Six core patterns of multi-agent workflow (Supervisor / Sequential / Hierarchical / Network / Swarm / Map-Reduce), grounded in primary sources from LangGraph, CrewAI, OpenAI, and Anthropic. Each pattern's topology and fit, plus a decision framework for production.

Multi-Agent for Korean Financial Marketing — A Phased Sketch
A phased view of how a Korean financial marketing pipeline (briefing → research → planning → compliance → launch) could be moved onto multi-agents. Phase 1 RAG agent, Phase 2 collaborative multi-agent (CrewAI as one example) with HITL, Phase 3 macro orchestrator + memory layer + observability. Each phase is paired with a component sketch, the patterns worth watching, and where it tends to break — held loosely as one possible path among many.

The Advisor Pattern Is a Price Tag, Not Architecture
What surfaces on the second read of Anthropic's Advisor Tool. This isn't new architecture — it's a temporary fix shaped by 2026 pricing. A pattern that disappears once Opus prices drop, and eleven other papers from the same period are quietly moving the same way. The anchor of the series.