Benchmark for wasted agent work

Your AI agent looks busy.
A third of its work is wasted.

Agents take dozens of steps per task. Many are dead ends — or worse, confidently-wrong work that looks finished but isn't. Normal observability can't see it. DeadBranchBench measures it, honestly, as a bracket.

0%
spend wasted, externally failed runs
0%
what normal tools reported
0
externally-verified runs
The same run, two views

What you see vs. what actually happened

Agent dashboards show you a tidy success. DeadBranchBench shows you the work that didn't contribute — the rewrites, the abandoned versions, the confidently-wrong finishes.

What the dashboard shows

Agent run · "debug + fix"
  ✓ wrote code
  ✓ ran tests
  ✓ finished task

status: success ✓

Looks productive. Tokens counted, task closed. Nothing flagged.

What DeadBranchBench shows

Agent run · "debug + fix"
  ✓ wrote code
  ↻ rewrote the code
  ✗ abandoned the first version
  ✗ failed external verification

provenance waste floor : 1.71%
failed-run spend       : 31.81%

Across the 15-run cohort: ~30% of spend went to work that didn't contribute — invisible to token/latency dashboards.

It's not measuring token cost. It's measuring work that didn't contribute. That's the insight.
Watch it happen — replay of a real captured run

An agent confidently wastes a task, live

Provenance waste stays at 0% the whole time — the agent looks productive. Then the external verifier checks what was actually asked.

agent · case window-r1
Provenance waste (what normal tools see)
0.0%
External ground truth
awaiting verifier…
Why both numbers are needed

It passed its own tests. It failed reality.

Provenance-only view

0.0% waste

Every file it wrote was used by a later step. Nothing thrown away. By "did it produce clean output?" it looks perfect — which is what normal agent observability reports.

External ground truth

100% failed

An immutable verifier the agent can't see or game checked the actual requirement. The task did not deliver. The whole run was confidently-wrong spend.

Some costly wrong work is structurally invisible to provenance-only observability. That gap is the whole reason this benchmark exists.
How it works

Objective events in, honest bracket out

01 · capture

Record events

Wrap your agent or attach the callback. Objective execution events only — no labels, no claims.

02 · provenance

Resolve data-flow

Which step's output did a later step actually use? That's the dead-vs-used signal.

03 · verify

External truth

An immutable verifier the agent can't game says whether the task actually delivered.

04 · bracket

Report a range

Provenance floor ≤ human-reviewed waste ≤ failed-task ceiling. Never a hype number.

Versus normal agent observability

It sees the waste the dashboards miss

CapabilityTypical agent observabilityDeadBranchBench
Tokens, latency, tracesYesYes
Flags thrown-away workPartialYes
Catches confidently-wrong workNoYes (external verify)
Honest bracket, not a hype numberNoYes
Reviewer agreement (κ) on labelsNoYes
The honest result

15-run real-agent debugging cohort

gpt-4o-mini, externally verified. Reported as a bracket — the middle stays pending until independent human labels narrow it. By design.

1.71%floor
31.81%ceiling
provenance floor  ≤  human-reviewed waste (pending)  ≤  failed-task ceiling

12/15 externally verified passes (Wilson 95% CI 54.8–93.0%). ~30% of spend was confidently-wrong work invisible to provenance-only DBR. Early data: one model, 15 tasks, one day.

Run it on your own agent — live, in your browser

Paste a trace, see your waste

No install, no API key, nothing leaves your browser. Paste a DeadBranchBench trace (or an array of them) and it computes your floor / ceiling bracket client-side.

provenance waste floor
pending review
human-reviewed (label to fill)
failed-task ceiling
runbranchactionused?verdict
Questions

Straight answers

Isn't a failed step just wasted work?

No — that's the core insight. A failed step can be support: its error told the agent what to fix. A step that succeeded can be dead: the agent built something it never used. Execution status ≠ contribution.

Why a bracket instead of one number?

Because an honest waste number needs human review, and we refuse to fake it. The floor (provenance) and ceiling (failed tasks) are objective; the middle is filled only by real reviewers, with agreement (κ) reported.

Does my data leave the browser?

No. The analyzer on this page runs entirely client-side. Nothing is uploaded.

Is the "~30% wasted" number real?

It's the failed-task spend ceiling on a 15-task, one-model, one-day cohort — early data, reported as a bracket, not a universal claim. Reproduce it from the repo.

What agents does it work with?

Any. There's a generic event recorder, a drop-in LangGraph callback, and an importer. If it takes steps and calls tools, it can be measured.

Measure your agents' waste

Get the repo, the benchmark, and early access to the hosted version.

✓ You're on the list. Check your inbox.