# Tokenmaxxing Is Over. Token Yield Is the Scoreboard, and Context Moves It.


URL: https://getunblocked.com/blog/token-yield-context-problem/
Published: 2026-06-15T08:00:00Z
Author: Brandon Waselnuk
Categories: Engineering Insights, AI Agent Autonomy

Token yield is the new AI coding scoreboard. After Copilot's June 1 billing change pushed agentic bills up 10-50x, context, not cheaper tokens, is the lever.

---
Your team probably doesn't have a token problem. It has a context problem wearing a token problem's clothes. The bill is real, but the bill is a symptom. The number that actually predicts whether AI is paying off is token yield: the useful output you get per token you spend. And the biggest lever on token yield isn't a cheaper model or a tighter budget cap. It's whether the agent understood what it was doing before it started spending.

## What actually happened to tokenmaxxing?

Six weeks, give or take. That's how long it took for "burn as many tokens as you can" to go from status symbol to cautionary tale.

The timeline is brutal when you line it up. Meta ran an internal "Claudeonomics" leaderboard ranking engineers by token spend; the top user reportedly burned around 281 billion tokens in a month before the dashboard was [taken down](https://fortune.com/2026/04/09/meta-killed-employee-ai-token-dashboard/). Amazon killed its own internal leaderboard after staff admitted they were running agents on meaningless tasks to climb it, with an SVP telling people it was encouraging exactly the wrong behavior, per [The Decoder](https://the-decoder.com/amazon-kills-internal-ai-leaderboard-after-employees-gamed-it-with-pointless-tasks/). Uber blew through its entire 2026 AI coding budget in about four months and [capped engineers at $1,500 a month](https://fortune.com/2026/05/26/uber-coo-ai-spending-tokens-claude-code/). Then on May 28, Fortune ran the obituary outright: ["Tokenmaxxing is over."](https://fortune.com/2026/05/28/tokenmaxxing-is-dead-companies-didnt-get-the-roi-from-ai-they-wanted-to-see/)

And then the whole industry got repriced. On June 1, GitHub Copilot moved to token-based billing, and agentic users started reporting [bills 10 to 50 times higher](https://www.techtimes.com/articles/317536/20260601/github-copilot-pricing-change-drives-backlash-agentic-bills-jump-10x-50x-power-users.htm) than the flat rate they were used to: $29 a month projected to $750, $50 to $3,000. You can watch it land in real time. The day I wrote this, the top thread on r/ClaudeAI was titled ["Back to the Stone Age? Our company slashed our AI budget and we're back to manual coding."](https://www.reddit.com/r/ClaudeAI/comments/1u6hyki/back_to_the_stone_age_our_company_slashed_our_ai/) The backlash isn't ideological. It's a spreadsheet.

## So what replaces it?

The replacement metric is already settled, and that's a good thing. Use it.

The FinOps Foundation, the standards body for cloud financial management, defined "[token yield rate](https://www.finops.org/insights/token-economics-the-atomic-unit-of-ai-value/)" as "the share of generated tokens that contributed to a downstream business action, after accounting for retries, abandoned sessions, and outputs that failed quality review." Glean's CEO Arvind Jain has been making the same argument: the right question isn't how many tokens a system consumes, it's the value per token, the useful outcome it produces for every token spent. In early June, the Linux Foundation [announced its intent to stand up a Tokenomics Foundation](https://www.linuxfoundation.org/press/linux-foundation-announces-the-intent-to-launch-the-tokenomics-foundation-to-establish-open-standards-for-ai-cost-management) to set open standards for AI cost management. The metric is converging from three directions at once.

I'm not going to pretend we coined token yield. We didn't. But I'll cop to describing the same thing on stage before I had the word for it: the job of a context engine is to take everything your team knows and compress it down, using the least amount of tokens to give the most amount of signal to the agent running the job. That's token yield in a sentence. The word is new. The idea is the work.

## Why does naming the metric miss the point?

Here's where the current discourse stalls. Everyone names the metric. Almost nobody names the cause. "It's an architecture problem" is doing a lot of unexamined work in these posts.

So let me name it. Andrej Karpathy has been making a version of this point for a while: the thing holding agents back is less raw intelligence than the context and environment they operate in (his [conversation with Dwarkesh Patel](https://www.dwarkesh.com/p/andrej-karpathy) is the cleanest version). I'd put it more bluntly: access is not understanding. An agent with read access to your entire codebase and a dozen MCP servers still doesn't know why the code is the way it is. It knows code that compiles. It's syntactically genius, and underneath there's intent it can't see. Tokens are how it pays for guessing at that intent.

## Where did the extra ten million tokens go?

That's the question that matters, because waste isn't abstract. It shows up as four mechanisms, and each one has a signal you can actually measure.

**Broad search.** With no authoritative source to go to, an agent queries everything. It reads, re-reads, and pattern-matches across half your repo to reconstruct context it should have been handed. In one [study of agentic software tasks](https://arxiv.org/abs/2601.14470), input tokens were the single largest share of consumption, 53.9% on average, and the iterative review-and-refinement stage ate 59.4% of all tokens, far more than actually generating code. Most of the bill is the agent talking to itself to figure out what's going on. The signal: tool calls per task.

**Re-search.** No memory of what it already learned, so it relearns it next turn. The context window inflates with noise, and noise makes the next decision worse. The signal: repeat-read ratio and cache hit rate.

**Mis-routing.** This is satisfaction of search, and I steal the example from radiology: a reader finds something that looks like the tumor, stops, and says "found it," and misses the second finding that was the actual problem. Agents do this constantly. They retrieve a plausible-looking but stale doc, treat it as ground truth, and confidently build on top of the wrong thing. The signal: wrong-source retrieval rate.

**Runaway loops.** Wrong assumption, retry, wrong assumption, retry, all the way to the budget wall. The research here is stark: on the same task, runs can differ by [up to 30x in total tokens](https://arxiv.org/abs/2604.22750), accuracy tends to peak at intermediate cost rather than maximum cost, and models can't reliably predict their own spend (correlation up to about 0.39). The signal: retry rate and cost-per-session in the fat tail.

The counter-image is a waterfall. When an agent has the right data, every choice it makes leads to a better next choice. Compounding runs in both directions. That's the whole argument: bad context doesn't add cost linearly, it compounds it.

## Do self-running loops fix this or feed it?

Right now the loudest idea in AI coding is to stop prompting and start writing loops, agents that prompt other agents and run while you sleep. Boris Cherny, who leads Claude Code at Anthropic, [told Fortune](https://fortune.com/2026/06/11/anthropic-claude-boris-cherny-doesnt-write-code-by-hand-anymore/) that in most of his sessions now "it's actually another Claude that does the prompting." Peter Steinberger's "you should be designing loops that prompt your agents" [post](https://x.com/steipete/status/2063697162748260627) racked up millions of views, and Addy Osmani's ["Loop Engineering"](https://addyosmani.com/blog/loop-engineering/) writeup put a name on the practice.

Notice the crowd isn't sold. On Steinberger's post, the replies split roughly 40% "real shift" to 60% "premature or bad advice," and developers are already shipping circuit breakers: an "Ask HN: what works for cutting AI token costs?" thread and tools like Guardian Runtime, a local firewall that tracks agent token usage and enforces budgets, both surfaced on [Hacker News](https://news.ycombinator.com/item?id=48457585) this month. People sense the trap.

The trap is simple. A loop is an amplifier. Point a context-rich agent at a loop and good decisions compound overnight. Point a context-blind agent at the same loop and you've built a runaway cost center on a timer. The cost of bad context compounds with every step toward autonomy. If the agent doesn't know what it's supposed to be doing, you didn't save engineering time overnight. You just burned a ton of tokens proving it.

## What does the controlled test show?

We ran the obvious experiment: same prompt, same model, same codebase, context on versus context off. I'll give you the honest version, because the honest version is more useful than a clean one.

The task was a real change in a large Kotlin service. Without context, the agent spent about 21 million tokens and two and a half hours, and an engineer had to sit there going "that's dumb," correcting it, and correcting it again. It literally broke things. If that output had gone to main, it would have taken down the service. With context, the same prompt on the same model spent about 10 million tokens, finished in 25 minutes, and needed zero corrections. A second benchmark told the same story: tool calls dropped from 80 to 29, turns from 50 to 19, and the no-context run simply ran out of turns and never delivered. (Methodology, full honesty: I took both outputs, handed them to Claude, and asked it to write up the comparison. You can read the [fuller write-up and how to reproduce it](https://getunblocked.com/blog/same-prompt-same-model-different-context).)

I want to be careful about the claim. These are controlled tests on single tasks, not a benchmark suite. The directional result is the point: with good context, you can cut the bill on a real task by something like half, while the output gets better instead of worse. Yield went up on both axes at once.

## What to do this week

You don't need a platform to start. You need to stop measuring the wrong thing.

Instrument the four signals above: tool calls per task, repeat-read ratio, wrong-source retrieval, and the retry-rate fat tail. Track [cost per merged PR](https://getunblocked.com/blog/cost-per-merged-pr), not cost per token, because a merged PR is a downstream business action and a token is not. Put a ceiling on your loops before you let them run unattended, the same way the [auto-loop crowd is learning to](https://getunblocked.com/blog/agent-auto-loop-token-cost). And if you want a framework for the whole cost picture, the [AI tokenomics breakdown](https://getunblocked.com/blog/ai-tokenomics-cost-framework) lays it out.

Then measure your own yield delta. Take one real task in your codebase and run it twice, context off and context on, and look at the token count and the number of times a human had to correct it. If you'd rather not run the experiment and just want the result, that's what we do at Unblocked: hand the agent decision-grade context from the systems your team already uses, so it spends tokens solving the problem instead of reassembling it. (A companion piece on the output side of this, measuring whether AI is actually making your team faster, is coming next.)

## The scoreboard

Tokens are the bill. Context is the cause. Yield is the scoreboard.

The teams that win the next year won't be the ones who spent the least or the most. They'll be the ones who got the most useful work out of every token, because their agents started every task already understanding the system. Get that right and the goal stops being "cheaper AI." It becomes something better: every PR looks like your best engineer on that part of your system wrote it.