Minbook
KO
An Agent You Wired Yourself Does Not Come with Its Own Evaluation

An Agent You Wired Yourself Does Not Come with Its Own Evaluation

M. · · 10 min read

Part 1 and Part 2 looked at methods for evaluating well-built agents from the outside, the ones a company refined and shipped. Part 3 is the other side: the agent you wired yourself, mixing your own sources and stringing together sub-agents.

Here, the builder means whoever assembles an agent for their own purpose on top of an LLM. Not the model company, but the developer, team, or founder one layer below who wires a custom agent with their own data and their own flow. The “team building a regulatory-compliance agent on internal data” from Part 1 is the type. Once such a team puts the agent up, they hit a wall fast. The answer seems wrong, but what to call the failure, which stage it leaked at, and how to measure whether a fix helped, are all murky. These are the problems a company would have solved internally for a vendor product.

The decisive difference between a vendor agent and a custom one is whether evaluation comes with it. Use Claude Code and the maker finishes quality checks internally before shipping. But an agent you wired on internal data has none of that. There is no answer key, no name for the failure, and no environment to test in, from the start. The three papers that fill this gap are MAST, AgentRx, and DRBench. Each builds a different piece: a vocabulary for failure, a way to pinpoint where it broke, a testbed shaped like your data.

%%{init: {"look":"handDrawn","theme":"neutral"}}%%
flowchart TD
    Q["an agent you wired yourself (custom packaging)"]
    Q --> M["MAST: a vocabulary for failure"]
    Q --> A["AgentRx: auto-pinpoint where it broke"]
    Q --> D["DRBench: a testbed shaped like your data"]

One thing to flag up front. These three were built by researchers, not builders. Teams at UC Berkeley, ServiceNow, and others made them at scale. A solo builder will not reproduce that as-is. So for each paper we look at both what the researchers built and how, and what a builder can borrow at their own scale.

The three pieces are not separate; they run in a line. You need a vocabulary to name what went wrong (MAST), the naming to pinpoint which stage it happened at (AgentRx), and the pinpointing to measure on a testbed whether the fix helped (DRBench). Name, locate, measure. Evaluating a custom agent, in the end, is assembling these three by hand.

MAST: A Vocabulary for Failure

The first thing you hit is that there are no words. Ask “what went wrong” when a multi-agent system fails and there was no shared language to answer with. MAST pulled that language up out of the data.

The method is inductive. Rather than fixing a taxonomy first and fitting traces into it, they went the other way. Six experts opened 150 execution traces from five frameworks and named every visible failure behavior. They grouped and split, repeating until no new failure types emerged. It took over twenty hours each. Then three other annotators labeled subsets independently to measure agreement, and got κ 0.88, a strong consensus.

Why pull the taxonomy from data instead of fixing it in advance? Multi-agent failure was a territory no one had mapped whole. Fix “these failures should exist” up top and fit traces in, and any failure outside the author’s head stays invisible forever. So in reverse, they only named what rose out of real records. The taxonomy built this way became the common reference point for nearly every failure-diagnosis study that followed.

The result is 14 failure modes in three categories. Across the 1,642 traces collected, nearly half are design problems.

CategoryShareRepresentative modes
System Design~44%Step repetition 15.7%, unaware of termination 12.4%, disobey task spec 11.8%
Inter-Agent Misalignment~31%Reasoning-action mismatch 13.2%, task derailment 7.4%
Task Verification~24%Incorrect verification 9.1%, no verification 8.2%

The distribution itself carries a message. Most multi-agent failures come not from the model being dumb but from the structure being misaligned. Repeating the same step endlessly (15.7%) or circling because it does not know when to stop (12.4%) is not solved by a smarter model. It is an orchestrator and memory design problem.

The second-largest category, inter-agent misalignment (~31%), is a coordination problem. One agent reasons one way and acts another (13.2%), the conversation drifts off the original task as it lengthens (7.4%), or it flounders alone instead of asking when stuck. Failures absent in a single agent appear newly when you string several together. Against the hope that multi-agent would be smarter, what grew was less the capability than the room to misalign, which is what this distribution says.

They did not hand-label all 1,642. After building the taxonomy on 150, they used o1 as an LLM judge to auto-label the rest, with 94% agreement and κ 0.77 against humans, which carried the scale. This auto-labeler held κ 0.79 on frameworks and benchmarks not used in development too, confirming the taxonomy is not fit to one framework only.

What the builder takes

A builder will not gather six people to relabel 1,642 traces. What you take is the taxonomy itself. The 14 modes become a checklist for naming “what went wrong” when your multi-agent fails. Drop your traces into these fourteen slots and you see which fill up. If step repetition and termination problems exceed 20% of traces, that is a signal to redesign the orchestrator, not tweak the prompt.

Say a research bot stringing three sub-agents keeps giving off answers. Held against the fourteen slots, it splits into whether it is a reasoning-action mismatch (planned right, called the wrong tool), task derailment (the topic drifts midway), or missing verification (a wrong intermediate result passed through unfiltered). The three are fixed in different places. A name narrows down where to suspect. The taxonomy was made in a lab, but holding it against your own logs is something a builder can do right now.

If the scale feels heavy, borrow the procedure MAST used directly. Hand-label a few of your own traces to build a template, then hand the rest to an LLM judge to drop into the fourteen slots automatically. What the lab did on 1,642, the builder can do on a few dozen with the same procedure. And since the taxonomy is published, you do not even have to build the template from scratch.

AgentRx: Auto-Pinpoint Where It Broke

Even with a taxonomy, “which step in this trajectory was decisively wrong” is a separate problem. Reading a long trajectory by hand to pinpoint it is expensive. AgentRx automates the pinpointing.

The core is that it does not ask the LLM judge “where is it wrong” directly. Instead it inserts a verification layer in between. It first normalizes heterogeneous logs into a common form, then synthesizes executable constraints from the tool schema, the domain policy, and the observed prefix. It runs these constraints at each step to catch violations and leaves an evidence log of “which step broke which rule.” The LLM judge, at the end, takes this evidence log and pinpoints the decisive failure step and its category.

A constraint is not some grand thing. It is a rule pulled into a machine-checkable form, like “this tool requires these arguments” from the schema or “this task must follow this order” from the policy. Run these against each step and a call that broke the spec or an action that skipped the order gets caught. Normalizing the heterogeneous logs into a common form first is for the same reason: the same yardstick has to apply even when each domain’s log looks different.

It also matters that the target is one decisive step, not every failure. A long trajectory mixes in many minor stumbles, and the first step that made it unrecoverable is the real place to fix. Underlining every error does not help debugging. The point is narrowing down where to start to a single spot.

This in-between layer makes the difference. The judge does not stamp “feels like here” wholesale; it judges on the evidence of a rule violation. So the result can be retraced, because why this step was deemed decisive stays in the evidence log. Where Part 2’s TRAIL or its follow-up handed the trace to an LLM judge whole, AgentRx puts one more layer, rule verification, in between. It stands the judge’s call on a list of violations rather than intuition.

The data is 115 failed trajectories across three domains: API workflows, incident response, and open-ended web-and-file tasks, deliberately three of different character. Following a fixed procedure, responding under time pressure, and open-ended exploration all fail in different ways. A diagnosis that only works in one domain is narrow, so they checked whether it carries across all three. On this, AgentRx improved both decisive-step localization and root-cause diagnosis over existing baselines.

What the builder takes

What you take is the idea of constraint synthesis. Your agent already has its tool schemas and policies. From those you can make cheap automatic checks that ask “did this step break this rule.” Instead of reading every trace by hand, you look first where a rule violation shows up. And the pattern of leaving “why I suspect here” as evidence becomes an asset you can reuse the next time the same failure shows up. You do not have to bring in a full-scale framework; these two can start small. Even without a grand diagnosis engine, hanging a few assertions on the tool-call log that ask “did this call break the schema” catches half of it. The point is the order: rather than handing every judgment to the judge, filter the rule violations the machine knows for sure first.

DRBench: A Testbed Shaped like Your Data

Even with a vocabulary and a way to diagnose, if there is no environment to test in, evaluation never starts. Public benchmarks mostly assume clean web search, but the builder’s reality is a messy search space mixing an internal wiki, PDFs, email, and chat. DRBench builds that environment whole.

Enterprise search is hard because the answer is not in one place. The cues are scattered across a line in the wiki, an email, a cell in a spreadsheet, with plausible-but-wrong traps in between. DRBench’s traps are not random noise but bait designed to be confused with the answer. So an agent that just scrapes a lot picks up the traps too and loses points. It tests the ability to gather answers and to filter traps at once.

The core of the method is planting the answers. Across 100 tasks and 10 domains (Sales, Cybersecurity, Compliance, and so on), it scatters ground-truth cues (injected insights) and traps (distractors) through Nextcloud files, Mattermost chat, email, assorted documents, and the public web. Because the setter knows where the answer is hidden, you can measure how much of it the agent recovered (insight recall) and how well it avoided the traps. To that it adds whether citations attach to their sources (factuality) and report quality (six dimensions). The three yardsticks each catch a different failure. Recall asks “did it find everything needed,” factuality asks “does what it wrote actually attach to a source” (verified separately by retrieval), and report quality asks “did it weave it readably.” You can find all the answers yet cite them wrong, or be factually right yet write a terrible report, so rather than collapsing into one number, it keeps the three apart. The agent side runs in five stages: research planning, action planning, a research loop, adaptive execution, and report writing.

The results show the environment’s difficulty. Attach a generic web agent built for the public web (GPT-4.1) to this environment and insight recall came to 1.11%. Factuality 6.67%, report quality 33.07%. It means a browser alone is nearly hopeless for enterprise search. Even the purpose-built DRBA (GPT-4o backbone) had low answer recovery, at 13.18. It avoided traps well (95.76%) but was weak at gathering the scattered answers. Avoiding traps and gathering scattered answers are different abilities, and the hard one in enterprise data is the latter. The benchmark’s own quality was confirmed by human evaluation: on task relevance and grounding, 72 of 75 votes, 96%, approved.

What the builder takes

A builder does not need to synthesize 100 enterprise tasks. What you take is the design of making an answer key by planting answers and traps. Plant a few facts you know and a few plausible traps in your own test data, and your answer-key-less agent suddenly has a measurable number called recall. Smaller scale, same principle. Concretely: pick ten questions your agent must answer, plant each question’s answer somewhere in your data yourself, and slip in a few wrong-but-similar traps. Then how many answers the agent recovered and how many traps it fell for becomes the score directly. The moment the setter knows the answer, scoring becomes possible. If the evaluation environment is missing, do not leave it missing; build a small one.

Closing

A vendor-refined agent comes with evaluation. An agent you wired does not. So evaluation becomes not a part the vendor slots in but a thing you have to handle yourself. These three papers cut the pieces so you do not start from bare ground: a vocabulary for what can go wrong (MAST), a way to auto-pinpoint where it broke (AgentRx), a testbed shaped like your data (DRBench).

But all three were made at lab scale. Six people’s labeling, a three-domain diagnosis framework, a hundred enterprise tasks are not things a solo builder replicates. So the builder borrows selectively rather than copying whole. The taxonomy as a checklist, constraint synthesis as cheap automatic checks, environment design as a small answer key planted in your own data. It is cutting a big tool down to size.

This is the real shape of “the builder takes on evaluation” from Part 1. The moment you step outside the territory where the vendor slots evaluation in too, evaluation stops being something you buy and becomes something you handle by hand. Luckily not bare-handed. What to name, where to locate, how to measure, research paved first, and the builder takes it at their own size.

Part 1 was the landscape, Part 2 three altitudes for evaluating what others built, Part 3 the tools for evaluating what you built. But the grader these tools all leaned on still has a gap. Drop the unit one more notch, and following where an error started and spread to the answer at the claim level, smaller than a span, reveals it. The series finale, the most recent work, DRIFT, comes in Part 4.


Sources

  • MAST, Why Do Multi-Agent LLM Systems Fail? (arXiv:2503.13657)
  • AgentRx, Diagnosing AI Agent Failures from Execution Trajectories (arXiv:2602.02475)
  • DRBench, A Realistic Benchmark for Enterprise Deep Research (arXiv:2510.00172)
Share

Related Posts