Your AI agent looks busy.
A third of its work is wasted.
Agents take dozens of steps per task. Many are dead ends — or worse, confidently-wrong work that looks finished but isn't. Normal observability can't see it. DeadBranchBench measures it, honestly, as a bracket.
What you see vs. what actually happened
Agent dashboards show you a tidy success. DeadBranchBench shows you the work that didn't contribute — the rewrites, the abandoned versions, the confidently-wrong finishes.
What the dashboard shows
Agent run · "debug + fix" ✓ wrote code ✓ ran tests ✓ finished task status: success ✓
Looks productive. Tokens counted, task closed. Nothing flagged.
What DeadBranchBench shows
Agent run · "debug + fix" ✓ wrote code ↻ rewrote the code ✗ abandoned the first version ✗ failed external verification provenance waste floor : 1.71% failed-run spend : 31.81%
Across the 15-run cohort: ~30% of spend went to work that didn't contribute — invisible to token/latency dashboards.
An agent confidently wastes a task, live
Provenance waste stays at 0% the whole time — the agent looks productive. Then the external verifier checks what was actually asked.
It passed its own tests. It failed reality.
Provenance-only view
Every file it wrote was used by a later step. Nothing thrown away. By "did it produce clean output?" it looks perfect — which is what normal agent observability reports.
External ground truth
An immutable verifier the agent can't see or game checked the actual requirement. The task did not deliver. The whole run was confidently-wrong spend.
Objective events in, honest bracket out
Record events
Wrap your agent or attach the callback. Objective execution events only — no labels, no claims.
Resolve data-flow
Which step's output did a later step actually use? That's the dead-vs-used signal.
External truth
An immutable verifier the agent can't game says whether the task actually delivered.
Report a range
Provenance floor ≤ human-reviewed waste ≤ failed-task ceiling. Never a hype number.
It sees the waste the dashboards miss
| Capability | Typical agent observability | DeadBranchBench |
|---|---|---|
| Tokens, latency, traces | Yes | Yes |
| Flags thrown-away work | Partial | Yes |
| Catches confidently-wrong work | No | Yes (external verify) |
| Honest bracket, not a hype number | No | Yes |
| Reviewer agreement (κ) on labels | No | Yes |
15-run real-agent debugging cohort
gpt-4o-mini, externally verified. Reported as a bracket — the middle stays pending until independent human labels narrow it. By design.
12/15 externally verified passes (Wilson 95% CI 54.8–93.0%). ~30% of spend was confidently-wrong work invisible to provenance-only DBR. Early data: one model, 15 tasks, one day.
Paste a trace, see your waste
No install, no API key, nothing leaves your browser. Paste a DeadBranchBench trace (or an array of them) and it computes your floor / ceiling bracket client-side.
| run | branch | action | used? | verdict |
|---|
Straight answers
Isn't a failed step just wasted work?
No — that's the core insight. A failed step can be support: its error told the agent what to fix. A step that succeeded can be dead: the agent built something it never used. Execution status ≠ contribution.
Why a bracket instead of one number?
Because an honest waste number needs human review, and we refuse to fake it. The floor (provenance) and ceiling (failed tasks) are objective; the middle is filled only by real reviewers, with agreement (κ) reported.
Does my data leave the browser?
No. The analyzer on this page runs entirely client-side. Nothing is uploaded.
Is the "~30% wasted" number real?
It's the failed-task spend ceiling on a 15-task, one-model, one-day cohort — early data, reported as a bracket, not a universal claim. Reproduce it from the repo.
What agents does it work with?
Any. There's a generic event recorder, a drop-in LangGraph callback, and an importer. If it takes steps and calls tools, it can be measured.
Measure your agents' waste
Get the repo, the benchmark, and early access to the hosted version.