Benchmark for wasted agent work

Your AI agent looks busy.
A third of its work is wasted.

Agents take dozens of steps per task. Many are dead ends — or worse, confidently-wrong work that looks finished but isn't. Normal observability can't see it. DeadBranchBench measures it, honestly, as a bracket.

Watch an agent waste work → Analyze your own trace

spend wasted, externally failed runs

what normal tools reported

externally-verified runs

The same run, two views

What you see vs. what actually happened

Agent dashboards show you a tidy success. DeadBranchBench shows you the work that didn't contribute — the rewrites, the abandoned versions, the confidently-wrong finishes.

What the dashboard shows

Agent run · "debug + fix"
  ✓ wrote code
  ✓ ran tests
  ✓ finished task

status: success ✓

Looks productive. Tokens counted, task closed. Nothing flagged.

What DeadBranchBench shows

Agent run · "debug + fix"
  ✓ wrote code
  ↻ rewrote the code
  ✗ abandoned the first version
  ✗ failed external verification

provenance waste floor : 1.71%
failed-run spend       : 31.81%

Across the 15-run cohort: ~30% of spend went to work that didn't contribute — invisible to token/latency dashboards.

It's not measuring token cost. It's measuring work that didn't contribute. That's the insight.

Watch it happen — replay of a real captured run

An agent confidently wastes a task, live

Provenance waste stays at 0% the whole time — the agent looks productive. Then the external verifier checks what was actually asked.

agent · case window-r1

Provenance waste (what normal tools see)

0.0%

External ground truth

awaiting verifier…

Why both numbers are needed

It passed its own tests. It failed reality.

Provenance-only view

0.0% waste

Every file it wrote was used by a later step. Nothing thrown away. By "did it produce clean output?" it looks perfect — which is what normal agent observability reports.

External ground truth

100% failed

An immutable verifier the agent can't see or game checked the actual requirement. The task did not deliver. The whole run was confidently-wrong spend.

Some costly wrong work is structurally invisible to provenance-only observability. That gap is the whole reason this benchmark exists.

How it works

Objective events in, honest bracket out

01 · capture

Record events

Wrap your agent or attach the callback. Objective execution events only — no labels, no claims.

02 · provenance

Resolve data-flow

Which step's output did a later step actually use? That's the dead-vs-used signal.

03 · verify

External truth

An immutable verifier the agent can't game says whether the task actually delivered.

04 · bracket

Report a range

Provenance floor ≤ human-reviewed waste ≤ failed-task ceiling. Never a hype number.

Versus normal agent observability

It sees the waste the dashboards miss

Capability	Typical agent observability	DeadBranchBench
Tokens, latency, traces	Yes	Yes
Flags thrown-away work	Partial	Yes
Catches confidently-wrong work	No	Yes (external verify)
Honest bracket, not a hype number	No	Yes
Reviewer agreement (κ) on labels	No	Yes

The honest result

15-run real-agent debugging cohort

gpt-4o-mini, externally verified. Reported as a bracket — the middle stays pending until independent human labels narrow it. By design.

1.71%floor

31.81%ceiling

provenance floor ≤ human-reviewed waste (pending) ≤ failed-task ceiling

12/15 externally verified passes (Wilson 95% CI 54.8–93.0%). ~30% of spend was confidently-wrong work invisible to provenance-only DBR. Early data: one model, 15 tasks, one day.

Run it on your own agent — live, in your browser

Paste a trace, see your waste

No install, no API key, nothing leaves your browser. Paste a DeadBranchBench trace (or an array of them) and it computes your floor / ceiling bracket client-side.

—

provenance waste floor

pending review

human-reviewed (label to fill)

—

failed-task ceiling

run	branch	action	used?	verdict

Questions

Straight answers

Isn't a failed step just wasted work?

No — that's the core insight. A failed step can be support: its error told the agent what to fix. A step that succeeded can be dead: the agent built something it never used. Execution status ≠ contribution.

Why a bracket instead of one number?

Because an honest waste number needs human review, and we refuse to fake it. The floor (provenance) and ceiling (failed tasks) are objective; the middle is filled only by real reviewers, with agreement (κ) reported.

Does my data leave the browser?

No. The analyzer on this page runs entirely client-side. Nothing is uploaded.

Is the "~30% wasted" number real?

It's the failed-task spend ceiling on a 15-task, one-model, one-day cohort — early data, reported as a bracket, not a universal claim. Reproduce it from the repo.

What agents does it work with?

Any. There's a generic event recorder, a drop-in LangGraph callback, and an importer. If it takes steps and calls tools, it can be measured.

Measure your agents' waste

Get the repo, the benchmark, and early access to the hosted version.

✓ You're on the list. Check your inbox.