Minbook
KO
Where Did the Agent Go Wrong: From Answer Accuracy to Process Evaluation

Where Did the Agent Go Wrong: From Answer Accuracy to Process Evaluation

M. · · 9 min read

The agent with the highest answer accuracy ranked last on the evaluation.

When an evaluation framework called TRACE lined up Deep Research Agents (DR agents from here), DeepSeek-V3.1-671B had the highest one-shot answer rate (Pass@1) at 65.8%, yet scored 0.65 on trajectory utility, the lowest among the top tier. The model with a lower 60.1% answer rate, AgentFounder-30B, took first place with a utility of 0.81. The model that answers more correctly received a lower score on the evaluation.

This is not a scoring mistake. It is what happens when the scoring criterion changes. It is a signal that the era of looking only at the answer cell is ending. Seven papers posted to arXiv between 2025 and 2026 share one question. Not whether the agent got it right, but where and how it went wrong. And to answer that, they keep slicing the unit of measurement finer: from answer match to trajectory utility, then to the critical step, the span, and the claim.

%%{init: {"look":"handDrawn","theme":"neutral"}}%%
flowchart TD
    O["answer match (outcome)"] --> T["trajectory utility"]
    T --> ST["critical step"]
    ST --> SP["span"]
    SP --> C["claim"]

This is Part 1 of the series “Locating Agent Failure.” Before going deep on each of the seven papers one by one, it first lays out why answer-based evaluation breaks down and maps the whole landscape. The method each paper uses comes from Part 2 on. The order in time tracks the slicing almost exactly: the March 2025 multi-agent failure taxonomy (MAST) laid the floor, followed by single-agent span evaluation (TRAIL, May 2025), DR agent utility and hallucination evaluation in early 2026 (TRACE, DeepHalluBench), and the June 2026 claim-level audit (DRIFT).

Why a Correct Answer Can Still Rank Last

TRACE’s utility score does not end at the answer. It folds in three things: process efficiency, cognitive quality (how well evidence is grounded and how robust the reasoning stays), and accuracy. DeepSeek ranked first on accuracy but last on utility because its process efficiency was 0.68, low. It took more wandering to reach the same answer.

ModelAnswer rate (Pass@1)Trajectory utilityWhere it diverged
DeepSeek-V3.1-671B65.8%0.65Process efficiency 0.68, inefficient trajectory
AgentFounder-30B60.1%0.81Balance of efficiency and grounding

What matters here is not the reshuffle itself. Answer matching gives the same 100 points to a path that wandered 30 times and a path that arrived in 5. To the user it may be the same answer, but to the builder it is not. A trajectory that burned six times the tokens costs six times as much, and an answer that landed by luck offers no guarantee it will land again on the same input. The utility score pulls up the axes that the single answer cell hides: efficiency, grounding, and reproducibility.

The way TRACE measures this also differs from answer-based evaluation. It attaches an oracle trajectory to each task and measures how little guidance the agent needed to reach the answer. It also splits tasks into groups rather than treating them as one: a group for overall performance, a group seeded with traps to see whether the agent self-corrects, and a group that gives minimal cues to draw out latent ability. Looking only at the answer rate ends at the first group, but looking at utility separates the models that collapse in front of traps from the ones that hold. It is a design meant to surface that the same answer rate can hide different robustness.

One Early Mistake and the Rest Topples

There is one more thing answer-based evaluation misses: where the error began. DeepHalluBench views a DR agent’s work as a long trajectory running plan to search to summarize, and tracks where a hallucination starts and how it spreads.

The numbers give a clear picture. In proprietary agents (Gemini, OpenAI), more than 57% of root errors occur in the early stage. A single wrong premise picked up during planning or the start of search rides all the way through summarizing and into the final answer. The open-source Salesforce agent is the opposite: more than 40% of its errors occur late, as context collapses while writing the conclusion. The same hallucination, but a different starting point, and so a different place to fix.

%%{init: {"look":"handDrawn","theme":"neutral"}}%%
flowchart LR
    P["plan"] --> S["search"] --> M["summarize"] --> A["answer"]
    P -. "57%+ early errors" .-> S
    S -. "propagation" .-> M
    M -. "propagation" .-> A
    A --> J{"score only the answer and the path is invisible"}

This is the domino. Score only the final answer and you know the answer is wrong, but not where it started going wrong. A wrong premise at the planning stage and a stray bit of noise at the summarizing stage get lumped together into the same “wrong answer” at the end. DeepHalluBench splits hallucination into four kinds to undo that lumping: propagation built on an earlier error, intent misaligned with the user’s goal, noise that failed to filter for relevant evidence, and grounding that strayed from the retrieved material. The four initials make the name PING.

And it attaches verification to each stage. Claim verification checks whether a claim matches its source, noise detection checks evidence prioritization, action verification checks whether the plan matches intent, and restriction checking sees whether user-imposed constraints were respected. Rather than holding one ruler to one final answer, it holds a different ruler to each stage. That is what lets you say not “it is wrong” but “the planning stage misread the intent.”

I saw the same thing while running experiments that split work across multiple agents to raise accuracy. If one sub-agent asks the wrong question early, the stages bolted on after it can be fine and the result still topples, because each later stage takes the previous stage’s output as input. The domino looks the same whether single-agent or multi-agent. Only, the more stages there are, the higher the odds the first tile falls wrong, and the higher the cost of tracing it back later.

Well-Packaged Agents vs Custom Packaging

So is this process evaluation equally urgent for everyone? It is not. The line falls on who packaged the agent.

On one side is the well-packaged agent. Claude Code as a coding agent, OpenAI’s and Gemini’s Deep Research. These are long-horizon, heavily tool-calling jobs that a model company has bundled into a general-purpose product. The back end is well built, so most of the time you get the result you wanted. The company verified and polished the trajectory internally, and the user just takes the result. The responsibility for evaluation sits with the company that packaged it.

On the other side is the agent I package myself. When an individual or a company has a sharp, specific need, the general-purpose package no longer fits, so they wire their own data and their own flow. They make it search internal files and the public web together, call sub-agents in a particular order, or splice in their own source mix. This is exactly where the well-packaged long-horizon agent stops covering the job. And the moment it does, the trajectory, the cost, and the failure tracing all become the builder’s own.

DimensionWell-packaged agentCustom packaging
Who packages itModel company (Anthropic, OpenAI, etc.)The builder (individual, company)
ExamplesClaude Code, Deep ResearchCustom multi-agent, internal source mix
Who owns evaluationThe packaging companyThe builder
Cost controlCompany, internallyThe builder, directly
Failure tracingMostly unnecessaryHas to be done by hand

Take a concrete case. Say a team wires an agent that searches an internal wiki, public regulatory documents, and its own product database to answer “where does our product violate the new regulation.” If the search sub-agent pulls the wrong clause from the regulatory document, the analysis sub-agent reasons plausibly on top of it, and the final report is confidently wrong. What this team needs is not “the report is wrong” but “the third call in the search stage grabbed the wrong clause.” Only then do they fix that one stage. Re-running the whole report is a cost, and swapping the entire prompt without knowing where the problem is is a gamble.

DRBench’s observation in enterprise settings points the same way. Attach an agent built for the public web directly to data mixing internal files, email, and chat, and it barely recovers the scattered ground-truth cues. A generally packaged agent does not carry over to a custom environment as-is. The builder has to fill that gap by hand, and while filling it, has to see where it leaks.

Process evaluation becomes necessary precisely on this custom-packaging side. The more tool calls and sub-agent fan-out, the faster cost swells and the blurrier the failure point. Cost control and quality control fall to the builder directly, and a single answer-rate line cannot tell you which stage is burning tokens or where it went wrong. Evaluation moving to process is less an academic fashion than a practical problem the self-packaging builder has taken on.

Seven Ways to Find Where the Agent Went Wrong

The seven papers answer the same question in different ways. What this series looks at is not the score they reached but how they approached it. Laid out by method, they split into five strands.

Paper (arXiv)Method in one lineStrand
MAST (2503.13657)Six people inductively coded execution logs to build a consensus taxonomy of 14 failure modesInductive taxonomy
TRAIL (2505.08638)Puts evaluation on OpenTelemetry spans rather than textInstrumentation
AgentRx (2602.02475)Synthesizes checks from spec and policy, catches violations step by step, leaves an audit logVerification layer
DeepHalluBench (2601.22984)Breaks the trajectory into claims and applies NLI and LLM checks per stageVerification layer
TRACE (2602.21230)Drops answer matching for a utility function multiplying efficiency, grounding, and accuracyMetric formulation
DRBench (2510.00172)Plants ground-truth cues and traps in enterprise data, designing the evaluation environment itselfEnvironment design
DRIFT (2606.02060)Checks the agent’s claims against trajectory evidence and marks the spans that affected the answerVerification layer

Seen this way, finding “where the agent went wrong” gets approached along five lines. Lifting a taxonomy out of data (MAST), standing evaluation on production telemetry (TRAIL), inserting a verification layer before the answer goes out (AgentRx, DeepHalluBench, DRIFT), building the evaluation environment itself (DRBench), and rewriting the definition of the score (TRACE). Faced with the same problem, one builds a taxonomy, another a measurement pipeline, another a set of checks.

These five are not competing alternatives. Looked at from running self-packaged agents, it is closer to stacking layers. You need a taxonomy that names failures to decide what to look for, instrumentation that records the trajectory by span to catch them, a verification layer to filter before the answer goes out, an environment with planted answers and traps to test against, and finally a rewritten score definition to compare the results in one number. No single builder will make all five, but they work as a checklist for asking which of the five your own pipeline is missing.

These seven also overlap with the packaging split above. TRACE, DeepHalluBench, and TRAIL lean toward showing that even well-packaged DR agents and coding agents go wrong this much, while MAST, DRBench, and AgentRx lean toward diagnosing the multi-agent and enterprise-source environments a builder packaged by hand. DRIFT is the most recent work, pulling the unit of measurement from the span down to the claim.

This series weighs method over numbers. Part 2 looks at how the three papers evaluating well-packaged agents (TRACE, DeepHalluBench, TRAIL) each formulate utility, verify claims, and instrument spans. Part 3 covers the three on the custom-packaging side (MAST, AgentRx, DRBench), and Part 4 covers DRIFT, which took the unit from span to claim.

Closing

One thing to flag. A good share of the places pushing this trend sell observability tools. TRAIL came from Patronus AI, related follow-up work from Deepchecks, DRBench from ServiceNow. So “process evaluation is a must” is at the same time their product pitch. You should not take it in without knowing that.

But being a product pitch and being real demand can run together. The one test that separates them is whether the demand exists independent of the pitch. And as more builders move to packaging their own agents, the demand appears whether or not anyone is selling it. Even if the tool companies stop advertising, a builder who fanned out sub-agents still needs someone to trace the failure point.

That builder count will grow. People keep trying to hand agents more complex tasks. The more complex the task, the more it crosses the boundary of the general-purpose package, the builder wires their own flow, and at that moment has to measure cost and quality directly. So I think this market grows.

There is one honest doubt that remains. These graders are themselves still weak. Even the strongest, Gemini-2.5-Pro, localizes failures at the span level with only 18.3% accuracy on GAIA and 5.0% on SWE-bench in TRAIL. It amounts to scoring a failing agent with a similarly failing agent. For process evaluation to become a tool you can trust, the grader has to get better first. How that is possible, and the method each of the seven uses to attack this problem, is what the next pieces look at one by one.


Sources

  • MAST, Why Do Multi-Agent LLM Systems Fail? (arXiv:2503.13657)
  • TRAIL, Trace Reasoning and Agentic Issue Localization (arXiv:2505.08638)
  • AgentRx, Diagnosing AI Agent Failures from Execution Trajectories (arXiv:2602.02475)
  • TRACE, Trajectory-Aware Comprehensive Evaluation for Deep Research Agents (arXiv:2602.21230)
  • DeepHalluBench, Why Your Deep Research Agent Fails? (arXiv:2601.22984)
  • DRBench, A Realistic Benchmark for Enterprise Deep Research (arXiv:2510.00172)
  • DRIFT / TELBench, Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories (arXiv:2606.02060)
Share

Related Posts