Locating Agent Failure

Evaluation of LLM agents is moving from final-answer accuracy down to trajectory, span, and claim-level failure localization. A method-first reading of seven arXiv papers from 2025-2026.

About this series

Agent evaluation that only checked whether the answer was right has hit a wall. The longer the task and the more tools it touches, the more the path to the same answer decides cost and trust. Seven papers answer one question in different ways: where did the agent go wrong.

Directly useful for builders wiring their own multi-agent flows, consultants advising on agent adoption, and PMs scrutinizing the reliability of evaluation metrics. We read for method, not benchmark scores.

Part 1 lays out why outcome evaluation breaks and maps the seven papers. From Part 2, we look at how each paper localizes failure, one at a time. The approaches split into five strands: inductive taxonomy, instrumentation, verification layers, environment design, and metric formulation.

Locating Agent Failure

About this series

3 episodes

Where Did the Agent Go Wrong: From Answer Accuracy to Process Evaluation

Three Altitudes for Evaluating a Well-Built Agent: Score, Claim, Substrate

An Agent You Wired Yourself Does Not Come with Its Own Evaluation