Locating Agent Failure
Evaluation of LLM agents is moving from final-answer accuracy down to trajectory, span, and claim-level failure localization. A method-first reading of seven arXiv papers from 2025-2026.
About this series
Agent evaluation that only checked whether the answer was right has hit a wall. The longer the task and the more tools it touches, the more the path to the same answer decides cost and trust. Seven papers answer one question in different ways: where did the agent go wrong.
Directly useful for builders wiring their own multi-agent flows, consultants advising on agent adoption, and PMs scrutinizing the reliability of evaluation metrics. We read for method, not benchmark scores.
Part 1 lays out why outcome evaluation breaks and maps the seven papers. From Part 2, we look at how each paper localizes failure, one at a time. The approaches split into five strands: inductive taxonomy, instrumentation, verification layers, environment design, and metric formulation.
3 episodes
- 01
- 02
- 03