Three Altitudes for Evaluating a Well-Built Agent: Score, Claim, Substrate

Part 1 laid out how the unit of evaluation is sliding from the answer down to the process, and noted the place this matters most is the agent a builder wired by hand. But look closely at the well-packaged agents too, the ones a company polished and shipped like Claude Code or Deep Research, and they go wrong more than you would think. The three papers that show this are TRACE, DeepHalluBench, and TRAIL.

Set them side by side and one thing surfaces. They evaluate the same target, but the altitude at which they intervene is different. TRACE changes the definition of the score, DeepHalluBench verifies claims one by one before the answer lands, and TRAIL changes what you measure on in the first place. Output, claim, substrate. Part 2 reads these three altitudes by method. The point is not what score they got but how they approached it.

These three did not pick well-built agents by accident. The fact that even the most polished thing leaks once you look this closely is the very reason evaluation has to come down to the process.

%%{init: {"look":"handDrawn","theme":"neutral"}}%%
flowchart TD
    Q["well-packaged agent (Claude Code, Deep Research)"]
    Q --> A1["TRACE: rewrite the score (output altitude)"]
    Q --> A2["DeepHalluBench: verify the claims (claim altitude)"]
    Q --> A3["TRAIL: change the substrate (span altitude)"]

TRACE: Rewrite the Score

TRACE’s method compresses into a single scoring formula. Instead of answer matching, it multiplies the correctness indicator by process efficiency and cognitive quality. Unfolded: a final-answer indicator that is 1 if correct and 0 if wrong, multiplied by a weighted power of the efficiency score and a weighted power of the cognitive-quality score. The multiplication is the point. If any one of the three is low, the whole thing gets dragged down. In the paper’s phrasing, a research process is only as strong as its weakest link.

Efficiency is a complexity reward divided by trajectory cost. Harder tasks get a bigger reward, and repeating similar actions with no new information raises the cost through a cosine-similarity penalty. Cognitive quality blends two things. Evidence grounding measures whether each claim attaches to a cited source, taken as the geometric mean of Natural Language Inference (NLI) probabilities. Because it is a geometric mean, a single ungrounded claim tanks the score. Reasoning robustness measures how fast the agent recovers from deliberately planted traps, via exponential decay.

The effect of the multiplicative structure is intuitive. Get the answer right but let efficiency near zero and the utility converges to zero too; keep efficiency and accuracy high but leave grounding empty and cognitive quality drags it down. Had it been addition, a high score on one axis would have masked the empty spot on another, but multiplication does not allow that.

Apply this formula to Part 1’s high-score illusion and the numbers line up.

Component	What it checks	DeepSeek-V3.1-671B
Answer correctness	1 if right, 0 if wrong (zeroes the whole thing)	1 (Pass@1 65.8%)
Efficiency (E)	Trajectory cost vs complexity, redundant-exploration penalty	0.68
Grounding	Do claims attach to evidence (geometric mean of NLI)	0.90
Robustness	Does it recover from traps	0.80
Cognitive quality (C)	Grounding and robustness combined	0.85
Final utility (U)	The above, multiplied	0.65

The culprit behind the top-accuracy model (65.8%) sinking to 0.65 utility is the efficiency of 0.68. Grounding and robustness are both high; in a multiplicative structure, efficiency alone tripped it.

TRACE also splits the measurement environment into three. Core (500 tasks, 20% traps) for overall performance, Robustness (100 tasks, 100% traps) to see self-correction, and Scaffolding (50 tasks) to draw out latent ability with minimal cues. For latent ability it gives only the first fraction of an oracle trajectory as a hint and measures the minimum hint fraction needed for reliable success. The lower that fraction, the stronger the agent’s own problem-solving power. AgentFounder-30B needed 0.22, DeepSeek-V3.1 needed 0.35. DeepSeek led on answer rate, but the power to solve with fewer hints belonged to AgentFounder.

It goes one step further. TRACE runs the same task several times to measure whether the strategy stays consistent (reproducibility), and how efficiently the agent converts each new piece of information into reduced uncertainty (adaptability). Braid the two and you get not a score but a personality. AgentFounder-30B came out “systematic and efficient,” with reproducibility 0.89 and adaptability 0.82. Rewriting the score is not just reshuffling the ranking; it surfaces the texture of the strategy hidden behind the same answer.

What this method touches

TRACE looks at the trajectory but does not pinpoint “which step went wrong.” Instead it changes the definition that compresses the whole process into one number. So it works at the highest altitude, the output score. The intervention is to keep the leaderboard that lined agents up by a single Pass@1, yet pull efficiency, grounding, and reproducibility into the score. That is its strength and its limit. It is light, since you barely change how you run the leaderboard, but it does not tell you which stage caused the inefficiency. That is the next two methods’ job.

DeepHalluBench: Verify the Claims Before the Answer

DeepHalluBench drops one altitude. Instead of the final answer, it holds each claim that makes up the answer against its evidence. To do that it first has to slice the trajectory finely. Summary-stage output becomes atomic claims with citations preserved, planning-stage output becomes individual search actions, and the user query becomes atomic restrictions (sub-queries).

The core is running verification as a two-stage cascade. First it narrows the evidence for each claim. A first embedding-similarity filter (cut at 0.4), then reranking to pull the top five chunks, sliced into 15-sentence units to balance context and cost. Then an NLI model (DeBERTa family) gives a first verdict; if confidence exceeds 0.99 it settles there, and if ambiguous it escalates to an LLM (DeepSeek family) for the final call. Filtering with a cheap model and finishing with an expensive one keeps the cost from exploding even while verifying thousands of claims.

When the first pass finds no support, the second pass branches. A claim that carries a citation gets its scope widened to the full document set and re-checked. If support turns up there, it was a misattribution, a wrong citation; if still none, it was fabrication. An intermediate claim with no citation gets compared against earlier findings (a reflection check), and if it does not cohere, it is taken as fabrication. The reason for separating misattribution from fabrication is that they are fixed differently. Misattribution only needs the citation link repaired, while fabrication means lifting the claim out entirely. Same “unsupported,” different prescription.

%%{init: {"look":"handDrawn","theme":"neutral"}}%%
flowchart LR
    C["one atomic claim"] --> R["retrieve evidence (0.4 cut, top-5 rerank)"]
    R --> N["NLI model first pass"]
    N --> V["confidence over 0.99: settled"]
    N --> L["ambiguous: LLM re-judges"]
    L --> V

Verification splits into four modules, each owning one strand of hallucination. Claim verification covers grounding faults (summary stage, explicit), noise detection covers noise the agent failed to filter (summary stage, implicit), action verification covers propagation built on an earlier hallucination plus intent drift (planning stage, explicit), and restriction checking covers constraints the agent neglected (planning stage, implicit). The four initials make PING (Propagation, Intent, Noise, Grounding).

Module	Hallucination it catches	Stage
Claim verification	Grounding fault (fabrication, misattribution)	Summary, explicit
Noise detection	Failure to filter relevant evidence	Summary, implicit
Action verification	Propagation on an earlier error	Planning, explicit
Restriction checking	Neglected user constraint	Planning, implicit

Among the four, noise detection works differently. It clusters retrieved evidence by meaning, ranks the clusters by relevance to the sub-queries, and penalizes ignoring a high-value cluster, normalizing against the worst case where the most important evidence is systematically dropped. On top of it sits a separate metric for retrieval quality itself. That lets it tell whether a hallucination came because retrieval pulled irrelevant material in the first place, or because retrieval was fine and the summary stage failed to prioritize. The same wrong answer points to different places to fix: a retrieval fix versus a summarization fix.

The four module scores are averaged with equal weight into a composite hallucination score. Sliced this way by stage, you get the picture cited in Part 1. Proprietary agents have over 57% of their root errors in planning and early search, propagating forward; Salesforce has over 40% blow up at the conclusion stage. Look only at the final answer and both are just “an answer with hallucination.” Drop to the claim level and the starting points separate.

Is this method trustworthy

If the verifier itself is weak, all of this is meaningless. So DeepHalluBench ran the claim verifier on public fact-checking sets first. It confirmed about 95% accuracy on FEVER and over 85% on SciFact-Open before attaching it to real agent trajectories. A method that inserts a verification layer only holds up if it proves that layer’s reliability first.

Worth noting too that the reliability comes from the NLI-and-LLM cascade, not the NLI model alone. The cheap model handles the high-confidence majority, and only the genuinely ambiguous minority goes to the expensive one. Without this structure that catches accuracy and cost at once, verifying the hundreds of claims that pour out of a single trajectory entirely with an LLM would have been impractical.

TRAIL: Change What You Measure On

The third altitude is the lowest, the substrate. TRAIL’s starting point is one line. Trace analysis so far assumed records written out as text, but real agent frameworks emit structured records in standard formats like OpenTelemetry. And LLMs are weak with that structured data. So text-parsing approaches diverge from genuine production observability. TRAIL puts evaluation on OpenTelemetry spans, labeling errors by span rather than by document. This choice has a practical implication for builders. If you already observe your agent in production with OpenTelemetry, evaluation is not a separate apparatus running on the side; it sits right on the traces you are already collecting. Observation and evaluation use the same substrate.

On top of that it lays a three-category error taxonomy: reasoning, system execution, and planning-and-coordination.

Category	Sub-types (partial)
Reasoning	Hallucination (text, tool), information processing failure, tool-selection error, formatting and instruction non-compliance
System execution	Tool config error, API errors (429, 401, 500, 404), resource exhaustion and timeout
Planning and coordination	Context-management failure, goal deviation, task-orchestration error

What stands out is the system-execution category. Items like API error codes and task orchestration are the kind of category an operations engineer would debug rather than a model researcher. They entered naturally because evaluation was placed on the same unit as production observability.

The traces were made to fail on purpose. GAIA tasks ran on Hugging Face OpenDeepResearch’s manager-and-search-agent hierarchy (backbone o3-mini), while SWE-bench tasks ran on a single-agent CodeAct (backbone Claude 3.7 Sonnet), with constraints like output-length limits and forced exploration to induce errors. That collected 148 traces and 575 error spans across 1,987 spans.

Why does the span unit matter. Putting a whole text record down and saying “something here is wrong” is a different debugging resolution from labeling which of several hundred spans holds which error. The former is document-level and needs a human to reread; the latter is span-level and goes straight to the spot. Deliberately choosing two different trace sources follows the same logic. GAIA is a multi-agent setup where a manager directs search agents, so coordination and orchestration errors arise; SWE-bench is a single agent fixing code, so tool-call and execution errors arise. Look at only one kind of agent and half the taxonomy would have been empty.

What this method exposed

The result is as written in Part 1. Even the strongest, Gemini-2.5-Pro, localized errors at the span level with only 18.3% accuracy on GAIA and 5.0% on SWE-bench. It dropped the substrate so production observation and evaluation meet at the same unit, but the LLM doing the scoring on top still cannot pin spans well. TRAIL shows that dropping the substrate and scoring well on top of it are two different problems.

Closing

The three touched different places to evaluate the same target. TRACE changed the definition of the score to pull in efficiency and grounding, DeepHalluBench verified claims stage by stage before the answer landed, and TRAIL dropped what you measure on from text to spans. Output, claim, substrate. Rather than one of them being right, it is closer to choosing the altitude by where your problem’s failure leaks.

One more thing to flag. The three stumble in the same spot despite their different altitudes. TRACE’s grounding score, DeepHalluBench’s claim verification, and TRAIL’s span scoring all ultimately get judged by an LLM. That DeepHalluBench confirmed its verifier to 95% on public sets before attaching it, and that even the strongest model managed only 18.3% on TRAIL’s span scoring, both point at the same unease. Whichever altitude you pick, you have to weigh the reliability of the tool that scores at that altitude first. The “the grader is still weak” problem from Part 1 runs straight through all three of Part 2. What a builder should ask first when choosing an evaluation tool is not the flash of the score but the verification record of the grader.

If a score swinging on luck is the problem, start with the definition of the score; if you want to know which stage seeds the hallucination, you need claim-level verification; if telemetry is already piling up in production, putting evaluation on those spans is the fit. But all three aim at well-packaged agents. They are methods for scoring, from the outside, something a company refined and shipped. An agent I wired myself, mixing my own sources and stringing together sub-agents, has a different grain. The evaluation environment does not exist, so it has to be built, and the locus of failure is blurred, so it has to be pinned separately. The three papers on that side come in Part 3.

Sources

TRACE, Trajectory-Aware Comprehensive Evaluation for Deep Research Agents (arXiv:2602.21230)
DeepHalluBench, Why Your Deep Research Agent Fails? (arXiv:2601.22984)
TRAIL, Trace Reasoning and Agentic Issue Localization (arXiv:2505.08638)