Key Takeaways
• Prompt caching cuts up to 90% off stable prompt prefixes, so turn it on first.
• Batch non-urgent jobs for a flat 50% discount on a 24-hour window.
• Route easy calls to cheaper models: RouteLLM matched ~95% of GPT-4 quality while sending only ~26% of calls to the frontier model.
• Retrieve curated context instead of dumping it; the Unblocked test cut tokens 42% with 64% fewer tool calls.
• Bound agent loops, because agents burn around 4x and multi-agent systems around 15x the tokens of chat.
• Prune MCP tool definitions before they tax you roughly 42K tokens per call.
• Measure cost-per-useful-output (token yield), not cost per token.
Prompt caching can cut the cost of stable prompt prefixes by up to 90% and latency by 85% (Anthropic, 2025), and that is only the first of seven ways to reduce AI token costs. The goal here is simple: cut waste, not capability. Most advice on how to reduce AI token costs goes straight for the per-token price, cheaper models and discounts. That helps. But the biggest gains come from cutting tokens that never contributed to output: redundant context, runaway loops, tool-definition bloat.
The pressure is real. Enterprise spend on large language models more than doubled in six months, from $3.5 billion to $8.4 billion (Menlo Ventures, 2025), with coding now a roughly $4 billion category (Menlo Ventures, 2025). Here is the twist: per-token prices are falling fast. The price to reach a given capability has dropped between 9x and 900x per year (Epoch AI, 2025), yet bills keep climbing. That gap is waste.
These seven techniques are ranked from quickest win to highest value. The first three lower token price and carry almost no risk. The next three cut LLM token costs by removing waste. The last makes the whole thing measurable. If you only do one thing this week to reduce AI token costs, turn on caching. If you want durable AI cost reduction, fix your context, which is exactly what curated retrieval over context-dumping does.
Why do AI token costs balloon in the first place?#
Token bills rarely balloon because the per-token price is high. They balloon because of volume, and most of that volume never contributes to output. Anthropic's own pricing shows input and output rates that are fractions of a cent per thousand tokens (Anthropic, 2025). The problem is how many tokens you push through, not what each one costs.
So the useful distinction is token price versus token waste. Price is what you pay per token. Waste is tokens that get billed but do no work: a 50,000-token repo pasted into the window when 800 lines mattered, an agent re-reading the same file four times, a stale conversation history dragged along every turn. To reduce AI token costs sustainably, you cut the tokens that don't contribute to output. That single reframe drives the entire ranking below.
The data: Anthropic's published pricing lists input and output token rates at fractions of a cent per thousand tokens (Anthropic, 2025), which means runaway bills are usually a volume-and-waste problem, not a per-token price problem. For the full vocabulary, see the token yield framework.
Which token-cost techniques give the quickest win?#
The fastest savings come from lowering token price, and you can ship all three this week with zero capability loss. Techniques 1 through 3 (caching, batching, routing) cut LLM token costs without touching what your agent can do. Anthropic reports prompt caching alone removes up to 90% of cost on stable prefixes (Anthropic, 2025), which makes it the obvious starting point.
Then comes the harder, higher-value work. Techniques 4 through 6 (retrieval over dumping, bounded loops, MCP trimming) attack waste directly, the tokens that never earned their keep. Technique 7 instruments the whole system so you can prove a "cheaper" change did not quietly cost you quality. Quick wins first, durable wins second.
Citation capsule: Because prompt caching can cut up to 90% off stable prompt prefixes and 85% off latency (Anthropic, 2025), the cheapest, lowest-risk path to token cost optimization is to ship the three price levers before refactoring how context flows.
Technique 1: Turn on prompt caching#
Prompt caching is the highest-return change you can make today. Anthropic reports up to 90% cost reduction and 85% latency reduction on stable prompt prefixes that get reused across calls (Anthropic, 2025). If your agent sends the same system prompt and instructions every turn, you are paying full price to reprocess identical tokens.
How to apply it: cache the stable parts. System prompts, long instructions, document context, and tool definitions all qualify. Order stable content first so the cached prefix stays intact, then append the volatile user input last.
When it backfires: caching volatile prefixes. If the front of your prompt changes every call, you pay cache-write overhead and still miss the cache. On very short prompts the write cost can exceed the savings, so cache long, reused content only.
Per Anthropic's measurement, up to 90% cost reduction and 85% latency reduction follow when stable prompt prefixes are cached and reused across calls (Anthropic, 2025), making prompt caching the quickest, lowest-risk way to reduce AI token costs.
Technique 2: Batch non-urgent work#
Batching is a free 50% discount on anything that does not need an instant answer. OpenAI's Batch API charges a flat 50% less than synchronous calls for jobs returned within a 24-hour window (OpenAI, 2025), and Anthropic offers comparable batch pricing on its platform (Anthropic, 2025).
How to apply it: route offline work to the batch lane. Nightly evals, data backfills, bulk summarization, and offline classification all tolerate latency. Keep your interactive agent paths on real-time endpoints where users are waiting.
When it backfires: anything latency-sensitive. The 24-hour SLA makes batching useless for interactive agents or live chat. Send a user-facing request to the batch queue and you trade a 50% discount for an unusable product.
Citation capsule: OpenAI's Batch API applies a flat 50% discount versus synchronous pricing for jobs completed within a 24-hour window (OpenAI, 2025), so any offline workload is a free half-off lever for AI cost reduction.
How do you cut model cost without losing answer quality?#
You route by difficulty. Not every prompt needs your most expensive model, and a good router keeps quality high while sending most traffic to cheaper models. RouteLLM matched ~95% of GPT-4 quality while sending only ~26% of calls to the frontier model, an ~85% cost reduction on MT-Bench (RouteLLM, ICLR, 2025).
The principle: match model cost to task difficulty. Easy prompts get the cheap model. Hard prompts escalate. The savings come from the fact that most production traffic is genuinely easy.
Per the published benchmark: RouteLLM matched ~95% of GPT-4 quality while sending only ~26% of calls to the frontier model, an ~85% cost reduction on MT-Bench (RouteLLM, ICLR, 2025), turning model selection into a real token cost optimization lever.
Technique 3: Cascade and route to cheaper models#
Routing turns one expensive model into a tiered system. RouteLLM matched ~95% of GPT-4 quality while sending only ~26% of calls to the frontier model, an ~85% cost reduction on MT-Bench (RouteLLM, ICLR, 2025). That is most of your traffic at a fraction of the cost.
How to apply it: classify each prompt by difficulty, default to the cheap model, and escalate only when the prompt looks hard. A cascade is the simplest version: run the cheap model first, then fall through to an expensive one when confidence drops below a threshold. Independent research backs the cascade pattern. A Caltech team showed that adding early abstention to a cascade cut inference cost by 13.0% on average while reducing error by 5.0% (Caltech, arXiv, 2025), so a smarter cascade can spend less and answer better at the same time.
When it backfires: a bad router creates a quality cliff on hard prompts it misclassifies as easy. And on tiny workloads, the routing and classification overhead can eat the savings. See model routing for coding agents for the nuance.
Citation capsule: A well-tuned router can route the bulk of calls to cheaper models while preserving most strong-model quality (RouteLLM, ICLR, 2025), but a mis-tuned one trades savings for a quality cliff on the hardest prompts.
What is the single biggest source of wasted tokens?#
Context. The most expensive habit in agent design is treating the context window as free storage and dumping everything into it. Anthropic's context engineering guidance reframes context as a finite attention budget, not a warehouse, and argues that curated context beats stuffing the window (Anthropic, 2025).
This is the pivot from price to waste. Caching, batching, and routing lower what you pay per token. Context discipline removes tokens entirely. When you stop pasting whole repositories and start retrieving the relevant slice, the bill drops and quality often improves, because the model is not distracted by noise.
Citation capsule: Anthropic's engineering team describes context as a finite attention budget and recommends curating what enters the window rather than stuffing it (Anthropic, 2025), which makes context the single biggest source of wasted tokens to attack.
Technique 4: Retrieve curated context instead of dumping it#
Context is a finite attention budget, so retrieve the relevant slice rather than pasting the whole repo. In an Unblocked controlled test, an AI agent given curated context used 42% fewer tokens and made 64% fewer tool calls than the same agent dumping context (Unblocked, 2026). Same prompt, same model, different context.
How to apply it: retrieve the relevant slice with curated retrieval over context-dumping. This is the core idea behind Unblocked, the context engine that curates what agents retrieve. Prune stale conversation turns, compress verbose tool outputs, and pass the model what the task needs instead of everything you have. UserTesting found agents run 20-30% less productive without the right context (Unblocked, 2026), so curation protects quality as much as cost.
When it backfires: over-pruning. Drop context the task actually needed and the agent re-asks, and a re-ask can cost more tokens than the pruning saved.
Citation capsule: The Unblocked same-prompt test found curated context cut tokens 42% and tool calls 64% versus dumping (Unblocked, 2026), which is why retrieval is the highest-value waste-cutting lever for AI cost reduction.
Why do agentic workflows quietly multiply your bill?#
Because agents loop. A chat turn is one call, but an agent reads, plans, acts, and re-reads, often many times per task. Anthropic reports that agents use about 4x the tokens of chat interactions, and multi-agent systems use about 15x (Anthropic, 2025). The multiplier is invisible until the invoice arrives.
That multiplier is fine when each iteration earns its tokens. It becomes pure waste when a loop spins without converging: re-reading the same file, retrying a failed step with no new information, or wandering past the point of useful work. Bounding the loop is how you keep agentic power without the runaway bill.
Source: Anthropic measured agents consuming roughly 4x the tokens of chat and multi-agent systems roughly 15x (Anthropic, 2025), which is why unbounded agent loops quietly multiply token bills.
Technique 5: Bound your agent loops#
Agent loops need hard stops. With agents burning around 4x the tokens of chat and multi-agent systems around 15x (Anthropic, 2025), an unbounded loop is the fastest way to a surprise bill. The fix is to make the loop terminate on purpose, not by accident. This multiplier is about to get worse: Goldman Sachs projects agent token demand could rise as much as 24x in the next few years, and companies like Uber and Microsoft are already feeling the bite of tokenized billing (Goldman Sachs via Tom's Hardware, 2025). Bound the loop now, before the volume scales.
How to apply it: cap iterations, set a token or step budget per task, and add explicit stop conditions. Insert a verify step before the agent re-loops so it only continues when there is genuine new work to do, not just because it can.
When it backfires: caps set too low. If you truncate legitimate multi-step work, the agent returns incomplete output and the user triggers a costly retry that consumes more tokens than the cap saved. See agent auto-loop token cost for sizing budgets.
Citation capsule: Because agents use about 4x and multi-agent systems about 15x the tokens of chat (Anthropic, 2025), bounding loops with iteration caps and stop conditions is essential to reduce AI token costs in agentic workflows.
Technique 6: Trim the MCP tool-definition tax#
Tool definitions are tokens you pay before any real work starts. Connecting GitHub's MCP server can inject roughly 42,000 tokens of tool definitions into the context window on every call, before the agent does anything useful (Unblocked, 2026). That tax repeats every single request.
How to apply it: prune the tool list to what the task actually needs. Lazy-load tools instead of registering all of them upfront, and collapse verbose JSON schemas into the minimum the model needs to call them correctly. A focused toolset is cheaper and easier for the agent to reason about.
When it backfires: pruning a tool the agent needs later. Remove something mid-task and the agent has to re-plan, which costs tokens and time. See the MCP token budget autopsy for what to keep.
Citation capsule: Unblocked measured GitHub's MCP server injecting around 42,000 tokens of tool definitions per call before any work begins (Unblocked, 2026), making MCP tool trimming a concrete way to cut LLM token costs.
How do you know any of this is actually working?#
You measure cost-per-useful-output, not cost-per-token. Cheaper tokens mean nothing if quality drops and your agents produce fewer merged PRs. This is where most token cost optimization efforts go blind: they track price and never check whether the output still ships. The fix is to instrument outcomes.
We call the metric token yield: useful output per token spent. Lowering price moves the denominator. Raising yield moves the ratio that actually matters. A 42% token cut that produces the same merged PRs counts as a yield win, and yield is what compounds over time.
Citation capsule: Token yield, defined as useful output per token, reframes AI cost reduction around the ratio that compounds; the same-prompt test showing a 42% token cut at equal output (Unblocked, 2026) is a pure yield gain, not a price discount.
Technique 7: Measure cost-per-useful-output (token yield)#
The highest-value technique is the one that makes the other six honest. Token yield, useful output per token spent, is defined in the token yield framework, and it is the only metric that catches a "cheaper" change quietly lowering quality.
How to apply it: tag spend to outcomes, not calls. Attribute tokens to merged PRs, resolved tickets, or shipped features, then track cost-per-useful-output over time. When yield drops after a change, you caught a regression that a cost-per-token dashboard would have hidden.
When it backfires: a vanity yield metric. If you define useful output loosely you can show yield gains while shipping nothing, so tie it to merged PRs or resolved tickets, not raw call counts.
Why it ranks highest: raising yield beats lowering price. A 42% token cut that ships the same number of merged PRs is pure yield (Unblocked, 2026). Start measuring cost per merged PR and the right priorities become obvious.
Citation capsule: Instrumenting token yield, useful output per token (Unblocked, 2026), is the only one of the seven techniques that reveals when a cheaper change has quietly cut quality, which is why it ranks highest.
Start With the Cheapest Win#
Start with the levers that cost nothing in capability. Turn on prompt caching today for up to 90% off stable prefixes (Anthropic, 2025), move offline jobs to batch for a flat 50% discount (OpenAI, 2025), and route easy prompts to cheaper models. None of those throttle your agents. All three ship this week.
Then climb to the durable gains. Curated retrieval over context-dumping, bounded loops, and trimmed tool definitions cut the tokens that never contributed to output, and the same-prompt test puts that at 42% fewer tokens with 64% fewer tool calls. Finally, instrument token yield so you can prove it. Unblocked, the context engine for engineering, exists for exactly that fourth lever: feeding agents the right slice instead of the whole repo. Cut waste first. Capability stays.
Frequently asked questions#
How much can I realistically reduce AI token costs?#
Most teams can reach 50-90% depending on the lever. Prompt caching alone removes up to 90% on stable prefixes (Anthropic, 2025), batching adds a flat 50% on offline work (OpenAI, 2025), and curated context cut tokens 42% in the Unblocked test (Unblocked, 2026). Stack the levers for compounding savings.
Does cutting tokens hurt my agent's quality?#
Not if you cut waste instead of capability. Curated retrieval cut tokens 42% with no loss in output in the Unblocked controlled test, and UserTesting found agents run 20-30% less productive without the right context (Unblocked, 2026). Removing noise often improves quality, because the model attends to what matters.
Which technique should I implement first?#
Prompt caching. It delivers up to 90% cost reduction on stable prefixes and 85% lower latency (Anthropic, 2025), carries almost no risk, and ships in hours. Batching and routing follow as the next two quick wins before you take on the harder context-efficiency work.
What is token yield and why does it matter?#
Token yield is useful output per token spent, defined in the token yield framework. It matters because cheaper tokens are worthless if quality drops. A 42% token cut that ships the same merged PRs (Unblocked, 2026) raises yield; that is the ratio worth tracking.



