# Why AI Agents Burn Tokens: The Four Runtime Waste Mechanisms


URL: https://getunblocked.com/blog/why-ai-agents-burn-tokens/
Published: 2026-06-15T10:00:00Z
Author: Dennis Pilarinos
Categories: Engineering Insights, AI Agent Autonomy

AI agent token waste shows up four ways at runtime. Input tokens hit 53.9% of agentic spend. Here is the signal that catches each one, and the fix to track.

---
AI agents burn tokens because they spend most of their budget figuring out what they're doing instead of doing it. The work itself is cheap. The reconstruction around it is not. In one study of agentic software tasks, input tokens were the single largest share of consumption, 53.9% on average, and the iterative review-and-refinement stage ate 59.4% of all tokens ([arXiv 2601.14470](https://arxiv.org/abs/2601.14470)). Generating code was a rounding error next to it. Most of the bill is the agent talking to itself to reassemble context it should have been handed. AI agent token waste isn't one leak. It's four, and each one leaves a measurable fingerprint.

This post is the instrumentation reference. The [token yield thesis](https://getunblocked.com/blog/token-yield-context-problem) sketches these mechanisms; here you get the definition, the research, the exact metric, and the fix for each. It's runtime waste, what happens once the agent is already working, not the [tool-definition overhead](https://getunblocked.com/blog/mcp-token-budget-autopsy) you pay before the first prompt. And it matters more now that the meter is running: GitHub Copilot's June 1 move to token-based billing pushed some agentic bills 10 to 50 times higher ([Tech Times](https://www.techtimes.com/articles/317536/20260601/github-copilot-pricing-change-drives-backlash-agentic-bills-jump-10x-50x-power-users.htm)), so waste that used to hide inside a flat rate now lands on the invoice.

## Why does broad search burn so many tokens?

Broad search is the default failure mode, and it's the most expensive one. With no authoritative source to go to, an agent queries everything: it reads, re-reads, and pattern-matches across half your repo to rebuild context it should have been given. The research is blunt about where the money goes. Input tokens were 53.9% of consumption on average, and review-and-refinement ate 59.4% of the budget while coding took 8.6% ([arXiv 2601.14470](https://arxiv.org/abs/2601.14470), a ChatDev simulation study using a GPT-5 reasoning model, so treat it as directional, not a production benchmark). Most of the bill is the agent talking to itself.

The signal to track is tool calls per task, and it's the first place most AI agent token waste shows up. A grep here, a file read there, another grep, another read. None of it is wrong, exactly. It's the agent doing your onboarding live, on the clock, every single run. The reason broad search dominates is structural: a model with read access to everything and an authoritative answer to nothing will always choose breadth. Access isn't understanding. The fix is to hand the agent a starting point instead of a search space, which is the on-demand retrieval pattern [Unblocked](https://getunblocked.com) is built on: surface the right PR, decision, or thread at the moment of need so the agent spends tokens solving rather than reassembling.

## What does re-search look like, and how do you catch it?

Re-search is broad search with amnesia. The agent learns something on turn three, drops it, and relearns it on turn nine, inflating the context window with noise that makes the next decision worse. Anthropic's own work shows how lopsided this gets at scale: multi-agent systems can consume roughly 15x the tokens of a single chat ([Anthropic engineering](https://www.anthropic.com/engineering/multi-agent-research-system)), much of it agents re-deriving facts a sibling already established. The window fills with restated context, and restated context is just expensive noise.

The signal is repeat-read ratio and cache hit rate. This is the quietest form of AI agent token waste, because the agent reads the same file three times in one session and that's two reads you paid for twice. The fix is the cleanest win on this list: prompt caching cuts cost on cacheable context by up to 90% ([Anthropic](https://www.anthropic.com/news/prompt-caching)). Cache the stable stuff, the system layout, the conventions, the parts that don't change mid-task, and stop re-billing for them every turn. In our experience the repeat-read ratio is the first metric teams are embarrassed by, because it's the one where the agent is visibly doing the same work twice and nobody noticed.

> Re-search is when an agent relearns what it already knew, inflating its window with noise. Anthropic measured multi-agent systems consuming roughly 15x the tokens of a single chat, and prompt caching cuts cost on repeated context by up to 90% (Anthropic, 2025). The signal: repeat-read ratio and cache hit rate.

## How does mis-routing quietly inflate the bill?

Mis-routing is the one that costs you twice: once in tokens, once in a wrong answer you have to unwind. The agent grabs a plausible-looking but stale or wrong source, treats it as ground truth, and confidently builds on top of it. I steal the name for this from radiology: satisfaction of search. A reader finds something that looks like the tumor, stops, says "found it," and misses the second finding that was the actual problem. Agents do this constantly. They retrieve a two-year-old doc, decide it answers the question, and the rest of the session is built on sand.

The signal is wrong-source retrieval rate, and the reason it's hard to catch is that mis-routing looks like progress right up until it doesn't. The deeper cause is context rot: every tested model degrades as the input grows, at every length, not just at the ceiling ([Chroma Research](https://research.trychroma.com/research/context-rot)). A bigger window doesn't fix mis-routing, it feeds it, because more retrieved candidates means more plausible-but-wrong sources to anchor on. The fix is provenance: give the agent sources it can trust and rank, so "plausible" and "authoritative" stop being the same thing. The [reduce token costs playbook](https://getunblocked.com/blog/reduce-ai-token-costs) covers the retrieval-quality side in detail.

## Why do runaway loops drain a token budget to zero?

Runaway loops are where the fat tail of your bill lives. Wrong assumption, retry, wrong assumption, retry, all the way to the budget wall. The research here is stark. On the same task, runs can differ by up to 30x in total tokens, accuracy tends to peak at intermediate cost rather than maximum cost, and models predict their own spend only weakly, with correlation up to about 0.39 ([arXiv 2604.22750](https://arxiv.org/abs/2604.22750), Stanford Digital Economy Lab and Microsoft Research, eight frontier models on SWE-bench Verified). The agent can't feel itself thrashing, so it won't stop on its own.

The signal is retry rate and cost-per-session in the fat tail, the sessions that cost 10x the median. This is the most visible AI agent token waste on the invoice, because a single doomed run can dwarf a week of healthy ones. At organization scale that tail compounds fast: Uber burned through its entire 2026 AI coding budget in about four months and capped engineers at $1,500 a month ([Fortune](https://fortune.com/2026/05/26/uber-coo-ai-spending-tokens-claude-code/)). The fix is bounding the trajectory: cutting an agent's accumulated input and retry path reduced cost by 28.6% to 44.1% ([arXiv 2509.23586](https://arxiv.org/abs/2509.23586)). Don't let a doomed run grind to the cap. We dig into this failure mode specifically in the [auto-loop token cost breakdown](https://getunblocked.com/blog/agent-auto-loop-token-cost). The counter-image is a waterfall: when an agent starts with the right data, every choice leads to a better next choice. Compounding runs both directions, which is the whole argument for fixing context before you fix anything else.

## How do the four mechanisms map on one page?

Here's the whole map in one view. Each mechanism, what it looks like in a session, the signal that catches it, and the fix that moves the number.

| Waste mechanism | What it looks like | Signal to track | Fix |
| Broad search | Agent greps and re-reads across the repo to rebuild context | Tool calls per task | Hand it an authoritative starting point via on-demand retrieval |
| Re-search | Agent relearns facts it already had, inflating the window | Repeat-read ratio, cache hit rate | Prompt caching on stable context |
| Mis-routing | Agent anchors on a stale or wrong source and builds on it | Wrong-source retrieval rate | Provenance and ranked, trusted sources |
| Runaway loops | Retry on a wrong assumption until the budget wall | Retry rate, cost-per-session fat tail | Bound the trajectory, cap retries |


To see whether AI agent token waste really shares one root cause, we ran the obvious controlled test: same prompt, same model, same codebase, context on versus context off, on a real task in a large service. With context, tool calls dropped from 80 to 29 and turns from 50 to 19. The no-context run ran out of turns and never delivered. One test, one task, so read it as directional, not a benchmark suite. But the direction is the point: every mechanism above moved at once, because they share a root cause. Instrumenting these four signals is how teams climb the [context-maturity curve](https://getunblocked.com/context-maturity/), from guessing at their bill to knowing exactly which mechanism is bleeding.

## Frequently asked questions

### Which token waste mechanism should I instrument first?

Start with tool calls per task and repeat-read ratio, because they're the easiest to pull from session logs and the fastest to move. Broad search and re-search together account for the bulk of input-token spend, and input tokens were 53.9% of consumption in agentic tasks ([arXiv 2601.14470](https://arxiv.org/abs/2601.14470)). Catch those two and you've addressed most of the bill before touching loops.

### Is AI agent token waste just a context-window-size problem?

No, and a bigger window often makes it worse. Every tested model degrades as input grows, at every length ([Chroma Research](https://research.trychroma.com/research/context-rot)). More room means more retrieved noise and more plausible-but-wrong sources to anchor on, which feeds mis-routing directly. The fix is better-targeted context, not more of it. The [MCP token budget autopsy](https://getunblocked.com/blog/mcp-token-budget-autopsy) covers the preload side of the same lesson.

### How much can fixing these mechanisms actually save?

Independently, each fix has real numbers. Prompt caching cuts repeated-context cost by up to 90% ([Anthropic](https://www.anthropic.com/news/prompt-caching)), and bounding the agent's trajectory cut total cost 28.6% to 44.1% ([arXiv 2509.23586](https://arxiv.org/abs/2509.23586)). Because the mechanisms share a root cause, fixing context tends to move several signals together rather than one at a time.

### Why can't the agent just stop when it's wasting tokens?

Because it can't tell. Models predict their own token usage only weakly, correlation up to about 0.39, and accuracy on a task often peaks at intermediate cost, not maximum ([arXiv 2604.22750](https://arxiv.org/abs/2604.22750)). A thrashing agent has no internal signal that it's thrashing, which is exactly why retry rate and the cost-per-session fat tail have to be instrumented externally.

## What to instrument first

Pick the two cheapest signals to collect and start this week: tool calls per task and repeat-read ratio. Both come straight out of session logs, and both map to the mechanisms that dominate the bill, since input tokens were 53.9% of agentic consumption ([arXiv 2601.14470](https://arxiv.org/abs/2601.14470)). Add wrong-source retrieval rate and the cost-per-session fat tail once the first two are clean. Track the four together, not cost per token, because a token isn't a unit of value and AI agent token waste looks identical to useful spend on the invoice.

The mechanisms share a root cause, so the highest-impact move isn't four separate fixes. It's giving the agent decision-grade context before it starts, so broad search has a starting point, re-search has a cache, mis-routing has provenance, and loops have a reason to converge. If you want to see where your team sits on the curve before you wire up a single dashboard, run the [LLM readiness assessment](https://readiness.getunblocked.com/). It maps your team to the maturity curve in a few minutes and tells you which mechanism to chase first.