
Where Did the Agent First Go Wrong: Evaluation Drops to the Claim Level
A read of DRIFT and TELBench, which drop to the claim level to flag the error spans that affect the answer. Finding harmful spans climbs past halfway, but the first error stalls around 20%. Series Part 4, the finale.

