Single vs Multi-Agent — Same Sources, Opposite Conclusions

Anthropic reports 90.2% gains from multi-agent. Cognition publishes 'Don't Build Multi-Agents'. A side-by-side reading of the two camps' primary sources, plus a decision framework for single vs multi.

“AI agents are the answer” was recommended in two diametrically opposite directions within a single year — a rare event. In 2025, Anthropic published that its multi-agent system outperformed a single-agent baseline by 90.2% on internal evaluation (Anthropic, 2025-06). The same year, Cognition Labs published a post titled “Don’t Build Multi-Agents” (Cognition, 2025).

The two articles cite the same primary sources — Anthropic’s own operating cases, OpenAI’s guides, ReAct-family papers — and arrive at opposite recommendations. Neither is wrong. The problems they’re solving are different. This article places both camps’ primary sources side by side and distills a decision framework for single vs multi-agent.

”Agent” Refers to Too Many Things

Before comparing, the definitions need narrowing. From 2024 to 2025, search traffic for “agentic AI” increased over 600% (US News, 2025). The word now refers to anything from an LLM call loop to a fully autonomous system. Calling different systems by the same word breaks the “is multi better, or single?” debate before it starts.

Anthropic’s “Building Effective Agents” (2024-12) guide resolves this ambiguity as follows.

Workflow: “LLMs and tools are orchestrated through predefined code paths.” — LLMs and tools execute sequentially within predefined code paths
Agent: “systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.” — the LLM dynamically decides its own procedure and tool use
Multi-Agent System (MAS): multiple LLMs (or multiple instances of the same LLM) collaborate via distributed decision-making and delegation

The difference between the three systems is not “how many agents” but who decides the next action. In a Workflow, code decides. In a Single Agent, one LLM decides. In a MAS, multiple LLMs decide via delegation and coordination.

Dimension	Workflow (Chain)	Single Agent	Multi-Agent System
Decides next action	Predefined code	A single LLM	Multiple LLMs (delegation·coordination)
Number of steps	Fixed	Variable (open-ended)	Variable + delegation depth
Debugging difficulty	Low	Medium	High (non-deterministic)
Token cost (vs chat, Anthropic measure)	1-2x	~4x	~15x
Examples	Prompt chaining, Routing	ReAct loop, Claude Code	Anthropic Research System, AutoGen
Best for	Well-defined procedures	Open-ended single task	Parallel exploration, domain separation

Only after agreeing on these definitions can “why does Anthropic recommend multi” and “why does Cognition say don’t build multi” be compared on the same plane.

Anthropic’s 90.2% — The Case for Multi-Agent

In June 2025, Anthropic published the backend architecture behind its Research feature (“How we built our multi-agent research system”, Anthropic Engineering, 2025-06). The single most-cited number from that article is one line.

“A multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.” (Anthropic)

The same article specifies the cost as follows.

Agents (single) = ~4x the tokens of a chat baseline
Multi-agent = ~15x the tokens of a chat baseline
“Token usage alone explains 80% of performance variance.” — token usage alone explains 80% of the variance in performance

The 90.2% number alone makes “go multi” feel like the obvious conclusion, but the same article immediately pins down the cost trade-off. The gain is meaningful only on tasks that can absorb a 15× token budget.

Eight Prompt Engineering Principles

The eight principles Anthropic distilled from operating the system can be read as a checklist for multi-agent design (Anthropic, 2025-06).

Mental modeling — the lead agent forms a mental model of the user’s intent before delegating
Delegation teaching — explicitly teaches subagents how to delegate (role, scope, stop conditions)
Effort scaling — match the number of calls to task difficulty. Simple = 1 agent / 3-10 tool calls, complex = 10+ subagents
Tool design criticality — the tool interface determines result quality. A poorly designed tool defeats any prompt tuning
Self-improvement — agents evaluate and retry their own outputs
Search strategy — broad → narrow. Explore widely first, then converge
Extended thinking — expand reasoning budget for complex tasks
Parallelization — 3-5 subagents in parallel + 3+ tool calls in parallel → research time cut by up to 90%

Among these, the most operationally weighted are #3 (Effort scaling) and #8 (Parallelization). The most common antipattern, per Anthropic’s own report, is throwing multi at simple queries.

Self-Reported Failure Modes

The same article enumerates four failure modes Anthropic observed in production (Anthropic).

Spawning 50+ subagents for simple queries
Endless web searches for nonexistent information
Vague task descriptions causing subagents to duplicate work
Selecting SEO-abusive sites (content farms) over authoritative sources

All four occur when the lead agent’s delegation design is weak in a multi-agent structure. What would not occur — or would cost only 1× — in a single-agent setup explodes to 15× cost × frequency in multi.

Operating Trade-offs

The trade-offs Anthropic itself acknowledges:

Trade-off	Description
Cost explosion	15x tokens vs chat. Per-token pricing transfers directly to operating cost
Sync bottleneck	The lead agent waits synchronously on every subagent’s result
Rainbow deployment	Minor changes cascade into large behavioral shifts → gradual rollout required
Non-deterministic debugging	Same input, different outputs. “Full production tracing” and “high-level observability of decision patterns” become required
Cascade failure	”Minor failures cascade into large behavioral changes; requires durable execution and error recovery without expensive restarts.”

Anthropic’s own conclusion is not “multi is the answer” but “multi is the answer for research-style open-ended tasks; otherwise, start from the simplest solution.” The same company’s “Building Effective Agents” guide states this more strongly.

“Find the simplest solution possible, and only increase complexity when needed… Many applications benefit most from optimizing single LLM calls with retrieval and in-context examples.” (Anthropic)

The 90.2% number and “start simple” appear in the same company’s two articles simultaneously. Not a contradiction — a signal that the answer depends on task profile.

Cognition’s Counter — Don’t Build Multi-Agents

Following Anthropic’s announcement, the same year Cognition Labs (the company behind Devin) published a direct rebuttal (“Don’t Build Multi-Agents”, Cognition, 2025). The title is the recommendation, and the rationale compresses into two principles.

“Share context, and share full agent traces, not just individual messages.”

“Actions carry implicit decisions, and conflicting decisions carry bad results.” (Cognition)

The first principle says full traces — not message snippets — must be shared. The second says every action carries implicit assumptions, and when assumptions conflict, results break. Violate these two principles and multi-agent breaks. That is Cognition’s position.

The Flappy Bird Example

The shortest, clearest example Cognition gives is building a Flappy Bird clone.

Suppose a user asks “build me a Flappy Bird clone.” The lead agent splits the work between two subagents.

Subagent 1: generate the background → draws Mario-style pipe scenery
Subagent 2: generate the bird character → draws a pixel-style bird that does not match Mario

Each subagent sees only its own work and makes its own style assumption. When the lead agent assembles the two outputs, the assumptions conflict and consistency breaks. The integration step inherits both miscommunications at once.

The lesson is simple. In tasks where each subagent’s “implicit decision” is likely to conflict with another subagent’s decision, the multi structure itself creates the problem. Especially pronounced for outputs requiring single-source consistency — codebases, games, design systems.

Cognition’s Recommendation

Following the principles and the example, Cognition’s alternative is the single-threaded linear agent.

Prefer a single agent where context flows continuously in one thread
When context grows too long, a dedicated LLM compresses history into “key details, events, and decisions”
Cognition acknowledges: “this approach is hard to get right” — the compression LLM’s accuracy is itself a challenge

Cognition does not say “multi is never acceptable” in the conclusion. It is closer to “the trade-offs of multi are hard to absorb at the current state of the art, and getting single right is also hard enough.”

What’s Different Between the Two Camps

The reason two articles citing the same primary sources — Anthropic Building Effective Agents, OpenAI Practical Guide, ReAct-family papers — arrive at opposite conclusions is that the task profiles they’re solving are different.

Dimension	Anthropic Research System	Cognition Devin
Representative task	Open-ended web research, broad exploration	Long-running coding, codebase consistency changes
Sub-task assumption conflict	Low (each subagent explores a different source)	High (each subagent edits a different part of the same codebase)
Value of parallel exploration	Very high (research time -90%)	Low (parallel changes generate conflicts)
Result integration	Lead agent synthesizes sources	Codebase consistency check
Recommendation	Multi-agent	Single-threaded

In short, both recommendations are correct on top of their task profile. The conclusion after reading both articles together is that the question itself — “is multi-agent good?” — is wrong.

The right question is “in the task profile I’m solving, does multi create value, or does it create conflict?”

The Decision — When Single, When Multi

Combining both camps’ primary sources, the decision criteria distill into four dimensions.

---
config:
  look: handDrawn
  theme: neutral
---
flowchart TD
    A[Task definition] --> B{Procedure predefined?}
    B -->|Yes| C[Workflow / Chain]
    B -->|No| D{Sub-task assumption conflict likely?}
    D -->|High| E[Single Agent]
    D -->|Low| F{Parallel exploration creates value?}
    F -->|No| E
    F -->|Yes| H["Multi-Agent System<br/>+ model mapping strategy"]

Four Decision Dimensions

Dimension 1 — Procedure predefined: If steps can be defined in code, Workflow is the simplest and safest. Debugging is possible. The moment dynamic decision-making is required, you cross into Agent territory.
Dimension 2 — Sub-task assumption conflict: Tasks where one subagent’s decision can conflict with another’s (coding, design, single artifact) are safer with single. Tasks with low conflict potential (parallel web research, domain separation) extract value from multi.
Dimension 3 — Value of parallel exploration: Is this a task where parallelism cuts time by 90%, or one where sequential is natural? Anthropic’s research time -90% is an upper bound for tasks where parallel exploration genuinely matters.
Dimension 4 — Model mapping strategy: Can you map different models to lead and subagent (e.g., lead=Opus, subagents=Sonnet)? If yes, the raw token explosion of multi does not translate directly into cost explosion. If no — every call must be the SOTA model — single is safer on cost.

Cost — Model Mapping, Not Raw Tokens

The “15x tokens vs chat” Anthropic reports holds when every call is the same model. Real production differs. In the same announcement, Anthropic specifies the model mapping for its Multi-Agent Research System:

Lead agent = Claude Opus 4
Subagents = Claude Sonnet 4

The lead handles decision-making and synthesis; subagents handle high-volume implementation work. Sonnet is a fraction of Opus’s price (input pricing is roughly one-fifth as of May 2026), so handling the same task with a single agent + Opus throughout can actually cost more. AI Gateway patterns (Haiku → Sonnet → Opus cascade) applied in production have reported LLM cost reductions of 40-70%.

The reverse hidden cost is more dangerous. The common default of running a single agent on one SOTA model — say, Claude Opus 4.7 for every subtask — uses the most expensive model even on simple subtasks. Cost comparison must move from raw tokens to (model price) × (call count).

So in the decision, “can we absorb 15x tokens” is the wrong question. The right questions are two:

Can different models be mapped to lead and subagent for this task?
When using a single agent, does the SOTA model really need to handle every subtask?

If mapping is possible and simple subtasks dominate, multi-agent + model mapping can be cheaper than single + SOTA. Cost is not an absolute — it depends on mapping feasibility.

Five Cases Applied

Case	Recommendation	Rationale
Consistent codebase changes (Devin)	Single	Cognition case. High sub-task assumption conflict
Web research, broad exploration	Multi	Anthropic case. High value of parallel exploration
Accounting voucher processing (defined SOP)	Workflow	Procedure predefined. Agent unnecessary
Single-domain RAG chatbot	Single	Single task, no domain separation needed
CRM + payment + analytics integrated workflow	Multi	Domain·permission separation. Each domain operates on different assumptions

Effort Scaling — Anthropic’s Operating Rule

After the decision tree, the next rule to apply is Anthropic’s effort scaling. Even within the same multi-agent structure, the number of calls must match task difficulty.

Simple task: 1 agent / 3-10 tool calls
Medium task: lead agent + 2-3 subagents
Complex task: lead agent + 10+ subagents

Ignoring this rule and going “since it is multi, every task gets 10+ subagents” is exactly Anthropic’s first self-reported failure mode — spawning 50+ subagents for simple queries.

What Both Camps Agree On

The recommendations point in opposite directions, but both camps agree on one thing: start simple.

Anthropic Building Effective Agents: “find the simplest solution possible, and only increase complexity when needed.”
OpenAI Practical Guide (related material): “maximize a single agent’s capabilities first… use orchestration patterns that match your complexity level, starting with a single agent.”
Cognition Don’t Build Multi-Agents: “Single-threaded linear agents where the context is continuous.”

Three articles draw different conclusions, but the starting point is the same. Don’t use multi for tasks that can be solved with single. Multi becomes warranted only for tasks where (a) the value of parallel exploration is clear, (b) sub-task assumption conflict is low, and (c) the cost — accounting for model mapping — is acceptable.

When all three conditions hold, multi-agent delivers the 90.2%. When any one is missing, the same multi structure produces conflict and cost overruns.

Next Areas

This article limited itself to the single vs multi decision. Two larger areas adjacent to the decision are left as separate topics.

Multi-Agent Workflow patterns and examples — Supervisor / Sequential / Hierarchical / Swarm / Map-Reduce / Group Chat and other topology·coordination combinations. While this article’s comparison focused on dynamic-delegation cases (Anthropic Research), this area covers the broader hybrid structures where multiple agents are placed within a predefined flow.
How to evaluate multi-agent systems — Evaluation methodology for non-deterministic systems. The limits of LLM-as-judge, human eval, and benchmarks (UC Berkeley CRDI in April 2026 reported that SWE-bench, GAIA, and AgentBench are vulnerable to reward hacking).

These two areas are each their own large article and are meaningful only for teams that have passed the decision stage covered here. This article’s scope is limited to that decision itself.

Sources

How we built our multi-agent research system (Anthropic Engineering, 2025-06) — 90.2% performance gain, 15x tokens, 8 principles, self-reported failure modes
Building effective agents (Anthropic Engineering, 2024-12) — Workflow vs Agent definitions, “start simple” recommendation, 6 patterns
Don’t Build Multi-Agents (Cognition Labs, 2025) — two principles (share context, conflicting decisions), Flappy Bird example, single-threaded recommendation
A Practical Guide to Building Agents (OpenAI, 2025) — single agent first, when to split
What does ‘agentic’ AI mean? (US News, 2025-11) — agentic search traffic up 600%
How We Broke Top AI Agent Benchmarks (UC Berkeley CRDI, 2026-04-12) — benchmark reward hacking warning