All posts

AI Tokenomics: How to Measure and Control What AI Coding Tools Actually Cost

Dennis PilarinosDennis Pilarinos·Jun 8, 2026·Engineering Insights · Context Engineering
AI Tokenomics: How to Measure and Control What AI Coding Tools Actually Cost

TL;DR: Two market camps fight over token price (route down to cheaper models versus burn more tokens for agent autonomy), and both ignore the variable that actually pays off: token yield, the share of spend that becomes shipped, mergeable code. Measure cost-per-useful-output, separate your price wins from your yield wins, and treat context as the real lever. The four spokes below go deep on routing, agent loops, merged-PR cost, and cutting token costs.

The two loudest pieces of AI-cost advice in 2026 contradict each other, and most engineering leaders are following both at once. One camp says route everything down: push tasks to cheaper models and watch the bill shrink. The other camp, led by the labs, ships self-running agent loops that auto-detect data and auto-write their own prompts, deliberately burning more tokens to buy autonomy. AI tokenomics is the discipline that resolves this fight. It is the economics of how engineering teams spend tokens and what shipped output they get back. Here is the turn both camps miss: they are arguing about token price while ignoring token yield. The cheaper-token camp optimizes the cost of each call. The agent-loop camp accepts higher spend for capability. Neither asks the only question that matters to a CFO: did the spend produce mergeable work? That is what this discipline measures, and that is where the real money hides.

What is AI tokenomics?#

AI tokenomics is the economics of how engineering teams spend tokens against the shipped output they get back. The signal is loud: enterprise LLM API spend doubled from $3.5B to $8.4B in just six months (Menlo Ventures, 2025), and coding sits near $4B of that broader generative AI investment (Menlo Ventures, 2025).

That growth curve forces a discipline that did not exist two years ago. When tokens were a rounding error, nobody tracked them. Now they are a line item your finance team asks about by name. The field splits the problem into two halves: the price side (what each token costs) and the yield side (what fraction of spend turns into shipped code). Most teams have tooling for the first half and nothing for the second. They can tell you the per-million-token rate of every model they call. They cannot tell you how many of those tokens ended up in a merged pull request. That gap is the whole game, and it is why a $4B coding-tool market still struggles to prove its return.

Why are token prices falling while bills keep rising?#

Token prices are collapsing, yet your invoice keeps climbing, and the reason is volume. The price to hit a fixed benchmark has fallen between 9x and 900x per year, a median of roughly 200x annually since 2024 (Epoch AI, 2025). Cheaper per token, far more tokens consumed.

This is the paradox that makes the cost side worth measuring at all. Falling unit prices feel like a discount, so teams loosen their guard. Then agentic usage explodes the count. Goldman Sachs Research projects token consumption could rise by over 24x in the coming years, driven mostly by agents that loop, retry, and self-prompt, with companies like Uber and Microsoft already feeling the pinch of tokenized billing (Goldman Sachs via Tom's Hardware, 2026).

Put the two trends side by side and the lesson lands. Price per unit drops fast. Units consumed rise faster. The product of the two, your actual bill, goes up. Optimizing the falling number while the rising number runs unchecked is how teams convince themselves they are saving money while spending more of it every quarter.

What is token yield, and why does it matter more than token price?#

Token yield is the share of tokens you spend that produce shipped, mergeable output, as opposed to tokens burned on retries, rework, mis-routed tasks, and runaway agent loops. Low token yield means you are paying for motion, not output. The evidence that motion dominates is stark: only 39% of organizations report any EBIT impact from AI, most of it under 5%, with roughly 6% calling it significant (McKinsey, 2025).

Hold that definition still, because the rest of this piece returns to it verbatim. Token yield is the share of tokens you spend that produce shipped, mergeable output. It is the yield-side counterpart to every price-side metric the two market camps obsess over.

Price-side optimization asks: how cheap was that call? Route it down, cache it, trim the prompt. All real, all bounded. Yield-side optimization asks one harder question: did it ship? A team can cut its per-call cost in half and still see token yield crater if half the cheaper calls produce code that gets reverted. Spend is up across the industry. Realized value, the McKinsey numbers tell us, stays thin. That gap lives entirely on the yield side, and you cannot cache your way out of it.

Does routing tasks to cheaper models actually save money?#

The first camp swears by routing to cheaper models, and for the right tasks the move genuinely pays off. RouteLLM demonstrated that you can route about 85% of calls to a cheaper model while preserving roughly 95% of frontier-model quality (RouteLLM, ICLR, 2025). For classification, boilerplate, and well-scoped edits, that is real savings.

But routing only ever touches price per call, and that distinction bites hard. A mis-routed task tells the story. Send a gnarly refactor to a cheap model, get back code that does not compile, then retry it on the frontier model anyway. You have now paid for the cheap call, the failed attempt, the developer's review of broken output, and the expensive call you should have made first. The "savings" went negative. Routing is a price tool. It earns its place in your stack for tasks that genuinely tolerate a smaller model. The trap is treating it as a yield strategy, because the moment a route misses, the retry erases the discount and then some. We unpack the routing math in model routing for coding agents.

Do self-running agent loops burn tokens you can't see?#

The second camp bets on self-running agent loops, which burn tokens at a rate most dashboards never surface. The hard data: multi-agent systems use roughly 15x more tokens than a chat interaction, single agents about 4x, and token usage alone explains around 80% of the variance in performance (Anthropic, 2025). Autonomy has a price, and it is paid in tokens.

That trade can be worth it. An agent that auto-detects the right files and writes its own prompts can solve problems a single chat turn never could. The autonomy is real. The risk is that the loop has no natural stopping point. An agent stuck on a flaky test will keep retrying, keep reading files, keep spending, and nothing in the default setup says "stop, this is not converging." That is the invisible burn: tokens consumed inside a loop you are not watching, on a task that may never ship. The discipline treats this as a budget problem with a hard cap, not a capability to admire. Set a ceiling per task. We cover loop-level controls in agent auto-loop token cost.

How do you measure cost-per-useful-output?#

Cost-per-useful-output is the core AI tokenomics metric, and the formula is deliberately blunt: total token spend divided by shipped-and-merged units, such as PRs or fixes. It is not cost per call and not cost per token. Anthropic's own finding that token usage explains about 80% of performance variance (Anthropic, 2025) is exactly why you must anchor the denominator to shipped work, not activity.

Here is why the three metrics diverge. Cost per token tells you the rate. Cost per call tells you how much each interaction cost. Neither tells you whether anything useful came out the other end.

A worked example, illustrative only. Suppose two teams each spend $10,000 in tokens this month.

TeamToken spendMerged PRsCost per merged PR
Team A$10,000200$50
Team B$10,00050$200

Same spend. Same per-token rate. Team B's cost-per-useful-output is four times higher because three quarters of its spend went to motion that never merged. A cost-per-call dashboard would show the two teams as identical. Cost-per-useful-output makes the yield gap visible. That is the number to instrument first. We go deeper in cost per merged PR.

What's the cheapest token optimization most teams skip?#

Of all the legitimate price wins, prompt caching is the cheapest, yet a startling number of teams never turn it on. The payoff is large: caching can cut costs by up to 90% and latency by up to 85% on repeated context (Anthropic, 2025). If your agents re-send the same system prompt and codebase context on every call, you are paying full freight for tokens the model has already seen.

So yes, do the easy price wins. Cache aggressively. Route the tasks that tolerate a smaller model. Trim prompts that have gone bloated. These are real, and skipping them is leaving money on the table.

But notice the ceiling. Caching tops out near 90% on the repeated portion of your context, and routing tops out at the share of tasks a cheaper model can safely handle. Both are bounded by definition. Once you have captured them, the price side is largely tapped out, and your bill is still governed by how many tokens it takes to produce a merged PR. That number, token yield, has no comparable ceiling, which is why it is the larger lever. The price wins are the warm-up. We catalog them in how to reduce AI token costs.

Why is context the real lever on token yield?#

Context is the variable that moves token yield, and the mechanism is now well documented. Anthropic frames context as a finite "attention budget" and describes context rot, the degradation that sets in as the token count climbs and relevant signal gets diluted (Anthropic, 2025). Feed an agent the wrong context and it retries, re-reads, and loops, which is precisely how token yield collapses.

We have measured the other direction first-hand. In a controlled test holding the prompt and the model fixed and changing only the context, better context cut tokens by 42%, ran 27% faster, and required 64% fewer tool calls (Unblocked, 2026). Same question, same model, decision-grade context, far less waste.

This is where the "context tax" connects to AI tokenomics. The overhead of loading bloated or irrelevant context into every call is a real cost, and it compounds across every interaction. Reduce it and yield rises without touching the model or the prompt. The proof and the tax math live in same prompt, same model, different context and the MCP token budget autopsy.

What does low token yield look like in delivery data?#

Low token yield shows up as motion without delivery, and the clearest measurement comes from production teams. High-AI-adoption teams merged 98% more pull requests, yet code review time rose 91% and there was no gain in org-level delivery (Faros AI, 2025). More PRs, more review load, flat output: that is low yield made visible in the delivery pipeline.

Read that pattern carefully, because it is the same cost paradox expressed in human time rather than dollars. The tokens produced 98% more PRs. The organization shipped no more value. The extra PRs did not vanish; they piled into review queues and consumed 91% more reviewer attention. Every one of those PRs cost tokens to generate and human hours to evaluate, and the net to the business was zero.

That is what paying for motion looks like at the team level. The fix is not generating more PRs faster. It is raising the share of generated work that is genuinely mergeable, which is token yield by another name. We connect this delivery signal to a fuller measurement model in context-adjusted productivity and how to measure AI productivity.

A practical AI tokenomics starter framework#

You can stand up a working token-economics practice this quarter, and the supporting numbers make the sequence obvious. With enterprise LLM API spend doubling to $8.4B in six months (Menlo Ventures, 2025) and only 39% of orgs seeing any EBIT impact (McKinsey, 2025), the priority is proving yield, not chasing price.

Four moves, in order.

Instrument cost-per-useful-output first#

Wire up total token spend divided by shipped-and-merged units before you change anything else. Without this number, every other decision is guesswork. It is the baseline the whole practice hangs on.

Separate price wins from yield wins#

Put caching and routing in one column (price) and context quality in another (yield). Capture the price wins fast, because they are easy, then turn your attention to yield, where the bigger and uncapped gains live.

Set a runaway-loop budget cap#

Give every agent task a hard token ceiling. Multi-agent systems already run 15x hotter than chat (Anthropic, 2025), so an uncapped loop on a flaky task can quietly drain a budget.

Baseline before you switch models#

Lock your cost-per-useful-output number before swapping models. Otherwise you cannot tell whether a model change helped yield or just moved the price around.

Frequently asked questions#

What is AI tokenomics?#

AI tokenomics is the economics of how engineering teams spend tokens versus the shipped output they get back. It has two sides: the price side (what each token and call costs) and the yield side (what fraction of spend becomes mergeable code). With enterprise LLM API spend at $8.4B and rising (Menlo Ventures, 2025), both sides now matter.

What is token yield?#

Token yield is the share of tokens you spend that produce shipped, mergeable output, rather than tokens lost to retries, rework, mis-routes, and runaway loops. Low token yield means you are paying for motion, not output. McKinsey found only 39% of orgs see any EBIT impact from AI (McKinsey, 2025), a classic low-yield signal across the industry.

Is routing to cheaper models the best way to cut AI coding costs?#

Routing helps on price per call and works well for tasks a smaller model handles safely. RouteLLM kept roughly 95% of frontier quality while routing 85% of calls down (RouteLLM, ICLR, 2025). But a mis-routed task that needs a frontier retry erases the saving. Yield matters more than per-call price.

How do you calculate cost-per-useful-output?#

Divide total token spend by shipped-and-merged units (merged PRs, accepted fixes), not by calls or tokens. Two teams can spend the same and post wildly different cost-per-useful-output if one ships four times the merged work. Because token usage explains about 80% of performance variance (Anthropic, 2025), anchoring the denominator to shipped output is essential.

Where the Spend Should Go#

Price and yield are different problems, and only one of them is underpriced. The industry has spent two years getting good at the price side: cheaper models, prompt caching, smarter routing. Those wins are real and bounded, and you should take them. But the falling-price story (down as much as 900x per year, per Epoch AI, 2025) has lulled teams into optimizing the one number that was always going to drop on its own, while the number that decides whether AI pays off, the share of spend that ships, goes unmeasured.

Put the next dollar of effort on yield. Instrument cost-per-useful-output, cap your loops, and treat context as the lever it is. In our controlled test, better context cut tokens 42% on the same prompt and model, which is why we built the context engine for engineering teams to surface the reasoning behind a change, the part raw code never explains. Start with the number that tells you whether any of it shipped.