Minbook
KO
Where Did the Agent First Go Wrong: Evaluation Drops to the Claim Level

Where Did the Agent First Go Wrong: Evaluation Drops to the Claim Level

M. · · 10 min read

Say you have to bring a new business idea to the boss. The plan starts with market research, and you handed that to a junior. The junior brings back the research, and it is wrong. The starting point is off, so the plan, the projections, the conclusion built on top all topple in a row. Or the research is fine, but it gets twisted in the middle, while someone interprets it or pulls out and combines the parts they need. Either way, once one tile tips, everything behind it tips too. A domino.

An agent dividing up the work comes down to this same division of labor. Search, then inspect sources, then reason, then assemble the answer, in a line, where each step’s output is the next step’s input. So more than “did it get it right,” “where did it first go wrong” decides everything. That is why evaluation, which once watched only the answer rate, has moved down into the process.

This series has followed that “where” down, finer and finer. From whether the answer was right to the utility of the whole trajectory, then to one critical step and the span. Now one rung is left. The claim, smaller than a span. DRIFT, posted in June 2026, drops that far. It checks each claim the agent makes against the evidence, and when one is wrong, it traces how it spread to the answer. The series finale, and the most recent work.

To say it up front: dropping to the claim level is progress, but it is not the destination. It looks more like the start of a long road that has to be cut finer as the division of labor deepens. Why I read it that way, I save for the end.

%%{init: {"look":"handDrawn","theme":"neutral"}}%%
flowchart LR
    A["Junior: market research"] --> B["Middle: collate & interpret"]
    B --> C["Report to the boss: conclusion"]
    A -. "wrong here" .-> X["everything after goes off"]
    B -. "or wrong here" .-> X

What counts as a ‘harmful’ span in a long trajectory

A deep-research agent’s (DR agent from here) trajectory is long. Search, tool calls, inspecting sources, assembling the answer, dozens of steps in a row. And most of what is in there is not a mistake. A search that hits a dead end and tries again, a hypothesis put up to test, the small clutter a tool spits out, all of it is normal process. The same way a junior going down one dead-end while researching is not a mistake.

What DRIFT does first is draw that line. It splits the spans of a trajectory into five kinds.

Span typeNature
Normal explorationLegitimate search toward the answer
Failed searchA retrieval attempt that came up empty
Tentative hypothesisAn exploratory claim, not yet committed
Harmless noiseClutter left by tools and the framework
Harmful error spanThe real problem that ruins the answer (what to find)

Here is how the paper pins down a harmful error span. A judgment that is wrong, unsupported, contradicted by the evidence, or committed too early, sitting on the path to the answer. Whether it first introduced that judgment, leaned on it, amplified it, or nailed it into the conclusion, all of it counts. The test is one thing. Did it affect the answer. A wrong statement that never flowed into the answer is not a target.

This line is hard to draw, in a way that mirrors the reality of divided labor. A junior taking one dead-end search and a junior pulling bad data into the conclusion look similar from outside, but they weigh nothing alike. Telling a normal misstep apart from an error that ruined the answer is where evaluation starts.

The data comes from 2,790 real trajectories. They mixed two agent frameworks (MiroFlow, OAgent), three backbone models (GPT-5, Gemini-2.5-Pro, Claude-Sonnet-4.5), and three benchmarks (GAIA-val, XBench, BrowseComp). A diagnosis that only works on one framework or one model is narrow, so they deliberately picked sources of different character and mixed them. The 1,000 of these that passed review become a benchmark called TELBench. 600 easy, 400 hard, with an average of 11.95 spans per trajectory.

The answer key was built by splitting the work between machine and human. Two LLMs read the trajectories and pulled out as many suspect spans as they could, then two of seven experts took one trajectory each and judged whether each candidate was a real error, against the evidence. They reconciled where the two disagreed, and the expert time alone runs over 300 hours. That is how much human hand went into 1,000 answer-key items, and that burden comes back when we get to the builder.

The labels do not stop at “this is wrong.” Each error also gets which stage it happened at and what kind it was. The stages are eight: planning, search, source verification, extraction, computation, decision, recovery, finalization. The error kinds are eighteen, gathered into six groups: constraint handling, search and retrieval, evidence grounding, entity mapping, information processing, process control. This is not a taxonomy fixed in advance and forced to fit, but one lifted bottom-up from the rationales the labelers wrote. It is the same move MAST made in Part 3, sorting multi-agent failures into fourteen kinds, except this time it is rebuilt over the long log of a single DR agent.

One more thing worth flagging. This answer key was made LLM-first, human-second: the machine proposes candidates, the human filters. Which means an error the machine never proposed rarely catches the human eye either. The range of what gets caught is tied, to a degree, to the machine’s first pass. On top of that, the paper reports how many hours the experts spent but does not put a number on how much the two of them agreed. Even at the desk where you evaluate the evaluator, the question of “how much do you trust the answer key itself” stays one layer down.

DRIFT: three steps that follow the claims

If TRAIL from Part 2 cut the trajectory into spans and judged those, DRIFT goes one rung inside the span. A single span can hold several claims mixed together. The way a junior says a true thing and a false thing in the same paragraph. At the span level, the most you get is “this paragraph looks off,” but at the claim level you reach “this one line in this paragraph has no support.” This is the last rung of the descent the series has followed.

DRIFT is a claim-centered auditing method. Not whether the answer was right, but whether the claims propping up the answer actually attach to evidence. It runs in three steps.

%%{init: {"look":"handDrawn","theme":"neutral"}}%%
flowchart TD
    CK["Claim Keeper: claim ledger"] --> SS["Support Seeker: check evidence"]
    SS --> DT["Dependency Tracer: trace the spread"]
    DT --> M["flag spans that affect the answer"]

First, the Claim Keeper builds a claim ledger. For each claim the agent makes, it records when it first appeared, where it was first used consequentially, what later depends on it, and whether it is still tentative or already committed. Like writing the junior’s statements from the report into a ledger, line by line.

Second, the Support Seeker weighs the evidence behind those claims. For each consequentially used claim, it sorts the support into four grades: directly supported, weakly supported, no support, contradicted by evidence. If the junior wrote “this market grows 20% a year,” it checks line by line whether that figure is actually in the source they pulled.

Third, the Dependency Tracer follows the spread. It maps the path a weakly grounded claim took into later reasoning, and flags the spans that pulled it in or amplified it and nailed it into the conclusion. It is retracing how one wrong line rolled all the way to the end of the report.

Threaded into one scene it looks like this. Mid-search, the agent makes the claim “this market was 5 trillion won last year.” The Claim Keeper enters it in the ledger and notes it was used in the later growth estimate and conclusion. The Support Seeker checks the evidence and finds the source actually says 3 trillion won, not 5. It gets sorted as contradicted by evidence. The Dependency Tracer follows where that 5 trillion went, and the growth math and the final conclusion both stand on it. So that line and the later spans leaning on it get bundled as harmful error. The junior copied one number wrong, and the path it rolled to the end of the report comes out in full.

Put the three steps together and the structure is identical to an audit in a human organization. Write down who said what (ledger), check whether it attaches to a source (evidence), follow whether the wrong statement spread to the conclusion (spread). What DRIFT did was run this procedure automatically over a DR agent’s long log.

It catches half, but misses the first tile

Performance is read on two yardsticks. How well it picked out the harmful spans (F1), and whether it hit the point that first went off (first-error accuracy, FEA from here). Below compares bare prompting (Bare) against DRIFT per auditor model. The auditor models here are separate from the backbones that made the data.

Auditor modelF1 (Bare → DRIFT)FEA (Bare → DRIFT)
DeepSeek-V3.222.46% → 50.51%10.30% → 23.70%
GPT-5.433.93% → 52.48%14.90% → 20.80%
Claude-Sonnet-4.621.89% → 54.91%11.30% → 24.10%
Gemini-2.5-Pro31.01% → 48.41%15.70% → 19.90%

Slot DRIFT in and finding harmful spans rises up to 30 points over bare. The best pairing (Claude-Sonnet-4.6) reaches F1 54.91%, just past halfway. That a single added layer of checking claims against evidence lifts it this much is the method’s achievement.

The trouble is the second column. FEA, that is “where did it first go wrong,” is stuck in the 20s. On hard trajectories it falls further. For DeepSeek the easy side is 34.5%, the hard side 7.5%. Flagging the harmful spans broadly and pinpointing the very first one among them are different jobs, and the paper leaves the latter as an open problem.

The reason is not hard to guess. Once one tile tips, everything behind it looks tipped, so the auditor sees a floor strewn with off-looking spans. Telling which is the cause and which is the consequence is the hard part. Many symptoms, one point of onset. The same way it takes work for a person to read a report backward asking “so where did this actually start.”

The gap between easy and hard work is wide too. On the same DeepSeek, harmful-span F1 climbs to 57.81% on easy trajectories but drops to 39.57% on hard ones. DRIFT leads bare on both sides, but as the job gets long and complex both methods crumble together. The research that actually needs diagnosis is the complex one the junior spent days on, and that is exactly where the evaluation weakens.

The limits the paper notes point the same way. Scaling the model alone does not improve diagnosis. Within the same family, a bigger model did not reliably diagnose trajectories better, and the auditing structure mattered more than model size. The longer the trajectory, the more both methods drop, and how much they catch varies by error type. Subtle kinds, like a judgment committed too early without support, stay weak.

One thing worth noting

Back to the domino, the sorest spot is this second column. The biggest loss is when the junior’s research was wrong from the start, because everything stacked on it turns to nothing. And the thing DRIFT is worst at finding is exactly that first slip. Flagging the answer-ruining spans after the fact is halfway there, but pinpointing the single tile that, if touched, brings everything after it back to life, is still far off. For evaluation to move from “showing what looks off” to “pointing at what to fix,” it has to clear this column.

What the builder takes

A solo builder is never going to replicate a 1,000-item benchmark and 300 hours of labeling. What you take is the idea of a claim ledger. Write down the claims your agent makes along the way, together with where each first appeared and where it flows. On top of that, attach a cheap check asking “does this claim actually attach to the source it pulled,” and you can look at the spots with thin support first, without reading every log by hand. Smaller scale, same principle.

That the grading is done by an LLM is worth pausing on. The direction itself is natural. Finding failures is also a task, and a task improves with how well you guide it. Still, it will not resolve as simply as bolting on a few examples. The grading LLM is itself an agent reading and reasoning over a long trajectory. Unlike classification that pins an answer onto a short input, it has to follow a branching process and judge where it went off. The grader’s failure modes start to resemble the failures it grades. Part of this gives way to better guidance, and part gets harder as the division of labor deepens. FEA stuck in the 20s is the evidence.

So what a builder can do right now is put a second human eye on the seams of the division. At each boundary where a sub-agent hands its result to the next step, the spot where the junior hands research over as a report, place a checkpoint that asks “does the support for what came in this far hold.” When the research sub-agent hands over a summary, even just machine-skimming whether its numbers and citations are actually in the original is enough. It is far cheaper than auditing the whole trajectory, and it filters the costliest first-step slip right there. While the tool cannot catch the first slip, having a person hold the most expensive seat pays for itself.

Closing

We have come through four parts. Part 1 was the landscape of answer evaluation breaking down, Part 2 three altitudes for evaluating well-packaged agents, Part 3 the tools for evaluating the agent you wired yourself. And Part 4’s DRIFT pulled the unit of measurement down to the claim. From answer to trajectory, to span, to claim, one rung at a time.

Why all the way down. Because the agent divides the work. In a division of labor, as in a human organization, which step first went off decides everything, and to find that one tile you have no choice but to cut the steps finer and look. That is the direction the seven papers in this series share. Evaluation keeps dropping finer not out of academic taste, but to follow the domino that division of labor sets up.

So DRIFT is not the destination. “The agent divides up the work” is one sentence that cannot hold every job and process in the world. As the division gets more complex, evaluation will split into yet another unit, and how far that goes nobody knows yet. The claim is only the current bottom rung, not the end, but a branch that just split off.

So Part 1’s conclusion stands again. The moment you step outside the territory where the vendor slots evaluation in, evaluation stops being something you buy and becomes something you handle by hand. Because the tool is still early, unable to catch even the first slip, the reason the builder has to take evaluation on only gets clearer.


Sources

  • DRIFT / TELBench, Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories (arXiv:2606.02060)
Share

Related Posts