Key Takeaways
• Routing saves real money on simple, well-scoped tasks where a cheaper model one-shots the work.
• On complex tasks, a smaller model's "almost right" output triggers retries and rework that can cost more than one frontier call.
• A router needs context to classify task complexity correctly, or it mis-routes hard work to weak models.
• The cheap worker model also needs context to one-shot the task, because smaller models degrade faster without it.
• Judge routing by token yield, the share of tokens that ship. The monthly invoice falls before the savings are real, so it misleads.
A team installs model routing for coding agents: an LLM model router that sits in front of their fleet. The next monthly invoice lands 30-40% lower, and leadership celebrates the model routing cost savings. Then, over the following weeks, something shifts. Iteration count climbs. Pull requests bounce back from review. The agent quietly re-opens tasks it had marked "finished," and engineers spend afternoons untangling output that looked right but wasn't.
So why did the bill fall while the savings disappeared? Cheaper models look irresistible right now because per-token prices are collapsing: the inference price to reach a given capability fell anywhere from 9x to 900x per year, depending on the task (Epoch AI, 2025). When prices drop that fast, routing to the cheap tier feels like free money. The catch is that the per-call price is not the bill that matters.
Model routing for coding agents optimizes the price of each individual call, but it can quietly destroy your token yield: the share of tokens that produce shipped, mergeable output versus tokens burned on retries, rework, and mis-routes. The invoice drops the moment you route cheap. Whether you actually saved money depends on what those cheaper calls produced, and that is a different question entirely. (We define token yield in full in our AI tokenomics framework.)
What is model routing for coding agents?#
Model routing for coding agents is a dispatch pattern: a router classifies an incoming task and sends it to the cheapest model judged capable of handling it. Easy work routes to a cheaper model; hard work escalates to a frontier model. The best evidence that this works comes from RouteLLM, which trained a router that kept roughly 95% of GPT-4 quality while sending only about 14-26% of calls to the strong model, routing up to ~85% cheap for a 75-85% cost cut versus always running frontier (RouteLLM, Ong et al., ICLR 2025).
That is the best case, and it is genuinely impressive. But notice the framing distinction. An LLM model router classifies the task up front and dispatches once. Model cascading does something different: it tries the cheap model first and escalates only when the output fails a quality check. Routing gambles on a single up-front guess. Cascading pays for a second attempt when the first falls short. Both aim at the same prize, and both depend on a variable the marketing rarely mentions. The rest of this piece complicates that best case.
Why does routing save money on simple tasks?#
Routing saves real money on simple tasks because the price spread between tiers is large and the work is genuinely easy. Anthropic's published rates run from Haiku 4.5 at $1 input / $5 output per million tokens, to Sonnet 4.6 at $3/$15, to Opus at $5/$25 (Anthropic pricing). OpenAI's per-model pricing shows a comparable tier spread (OpenAI pricing). Routing a trivial task to the bottom tier captures that gap directly.
Consider a boilerplate rename, a one-line config edit, or a straightforward unit test. A cheaper AI model handles these in a single pass, and the frontier premium buys nothing. Paying Opus output rates for a variable rename is pure waste; routing it to Haiku is the obvious move. This is the legitimate core of model routing for coding agents, and on a high-volume stream of small, well-scoped tasks the savings compound fast. The discipline of choosing to route to a cheaper model when the task truly is cheap is the whole point. The danger starts when "simple-looking" and "simple" diverge, which is exactly what happens next. For more on this, see how to reduce AI token costs.
Why can a cheaper model cost more on complex tasks?#
A cheaper model costs more on complex tasks through the "almost right" trap. The model returns output that compiles but misses an edge case or breaks a team convention. The agent retries, a human debugs, the PR bounces, and each loop re-pays input tokens. The pain is well documented: in the Stack Overflow 2025 Developer Survey, 66% of developers named "AI solutions that are almost right, but not quite" a top frustration, and 45% said debugging AI-generated code is more time-consuming than they expected (Stack Overflow 2025).
Here is the mechanism in token terms. The cheap call consumes three or four passes plus human review instead of one clean frontier pass, so a call that should cost a fraction of frontier ends up costing a multiple once you count the rework. Worse, you often cannot predict which task will balloon. On SWE-bench Verified, agentic runs on the same task have differed by up to 30x in total tokens, and models underestimate their own cost, with self-prediction correlation no higher than 0.39 (How Do AI Agents Spend Your Money?, arXiv 2026). A cheap mis-route is a gamble on a number nobody can see in advance.
Why isn't cheaper always lower quality, or higher cost?#
Cheaper is not a clean proxy for weaker, because the cost-versus-quality curve is non-linear. Models sitting at the same intelligence tier vary widely in price, and models at very different prices sometimes land close in quality (Artificial Analysis). "Cheapest" and "weakest" are not the same axis. A naive LLM model router that sorts purely by sticker price will mis-route in both directions: overpaying for easy work and starving hard work.
The lesson is to route by task fit. Price rank alone makes a poor policy. A mid-tier model may one-shot a task that a slightly cheaper model would botch, making the mid-tier choice the actual cost-saving move once you count yield. Picking the cheapest model that fits the task beats picking the cheapest model on the list. That distinction sounds obvious, yet most routing setups bake price ordering directly into the policy. Fit depends on what the task actually requires, and the router can only judge that if it can see the task clearly, which raises the question of what the router is actually looking at.
What does a router need to classify task complexity correctly?#
A router needs context to classify task complexity correctly. A router that sees only the raw prompt mis-grades difficulty constantly. A "small bug fix" that quietly touches three services looks trivial in the prompt text alone. The surrounding code, prior architectural decisions, and team conventions are what reveal the real complexity. Anthropic's engineering team frames context as a finite attention budget and notes that without curated context, models work from a thin, degraded signal (Anthropic, context engineering, 2025).
When the classification signal is thin, routing-by-price errors compound. The router grades a cross-cutting change as easy, sends it to a cheaper model, and the "almost right" trap springs shut. This is the first missed variable. The router itself is a model making a judgment, and like any model it produces better judgments with better inputs. Feeding the classifier real context, the institutional knowledge of why the code looks the way it does, is what lets it tell a genuine one-liner from a one-liner that detonates across the codebase. Curated context here is not a nicety; it is the difference between routing and guessing. See our MCP token budget autopsy for how context delivery itself can leak tokens.
Why does the cheap worker model still need context to one-shot the task?#
Even a correctly routed task fails if the cheap worker model lacks context, because smaller models degrade faster than frontier models when context is missing or noisy. The cheap model is precisely the one that most needs curated context to land the work in a single pass. In an Unblocked controlled test, the same prompt on the same model with curated context used 42% fewer tokens, ran 27% faster, and made 64% fewer tool calls (Unblocked controlled test). Good context turns a model that would have flailed into one that one-shots the task.
This is the second missed variable, and it pairs with the first. The router needs context to grade the task; the worker needs context to complete it. Skip either and model routing for coding agents optimizes the per-call price while the savings leak out through retries. There is encouraging research on the cascading side, too. Adding early abstention to a cascade cut inference cost 13.0% on average while also reducing error 5.0% across six benchmarks (Cost-Saving LLM Cascades with Early Abstention, Caltech, arXiv 2025). Knowing when a model should decline rather than guess protects yield. Context is what makes that judgment, on both the router and the worker side, reliable. Unblocked, the context engine for engineering, exists to feed both decisions.
How do you tell real savings from a vanishing one?#
You tell real savings from a vanishing one by measuring token yield, not the invoice. The invoice falls the instant you route cheap, so it is the wrong signal. Recall the spread: Haiku at $1/$5 versus Opus at $5/$25 per million tokens (Anthropic pricing). A Haiku call is roughly a fifth the price, so three or four retries plus a debugging session can quietly exceed one clean Opus pass. The cheaper call only wins if it ships.
Here is the decision rule. Track retries and rework per routed task, then compare shipped-output tokens to total tokens before and after you enable routing. If shipped-token share holds or rises while the bill falls, the savings are real. If shipped-token share drops, the bill fell because work moved into a rework loop rather than because you got more efficient. Yield is the only number that tells you whether routing actually helped. We walk through the full token-yield definition and the surrounding economics in our AI tokenomics framework, and we connect it to delivery measurement in context-adjusted productivity.
Frequently asked questions#
How does model routing work?#
Model routing for coding agents is a dispatch pattern where a router classifies each incoming task and sends it to the cheapest model judged capable of handling it. Simple work routes to a cheaper AI model, while hard tasks escalate to frontier models. RouteLLM showed a trained router can hold ~95% of frontier quality while routing most calls cheap (RouteLLM, ICLR 2025).
Does routing to a cheaper model always save money?#
No. On complex tasks, a smaller model produces "almost right" output that triggers retries and rework, which can exceed the cost of one frontier call. In the Stack Overflow 2025 survey, 66% of developers cited near-miss AI output as a top frustration and 45% found debugging AI code more time-consuming (Stack Overflow 2025). The per-call price drops, but token yield can collapse.
What is the difference between model routing and model cascading?#
Routing classifies the task up front and dispatches once to a single chosen model. Model cascading tries the cheap model first and escalates to a stronger one only when the output fails a quality check. Cascading can pair with early abstention to cut inference cost 13.0% and reduce error 5.0% at once (Caltech, arXiv 2025). Routing gambles on one guess; cascading pays for a retry.
How do I measure whether model routing actually saved money?#
Track token yield rather than the per-call price or the monthly bill. Compare shipped-output tokens to total tokens, including every retry and rework loop, before and after enabling routing. The invoice falls the moment you route cheap, but pricing spreads as wide as $1/$5 to $5/$25 per million tokens mean a few retries can erase the gap (Anthropic pricing).
How to Route Without Losing the Savings#
Routing works when you respect what it actually optimizes. Route by task fit, not by sticker price, because the cost-versus-quality curve is non-linear and the cheapest model is not always the right one. Feed the router context so it grades complexity correctly instead of mistaking a cross-cutting change for a one-liner. Feed the worker context so it one-shots the task instead of leaking the savings back through retries. And measure token yield instead of the invoice, because the bill drops before you know whether anything shipped.
The thread running through all four is context. A router starved of context guesses; a cheap worker starved of context flails; and a team watching only the invoice celebrates a saving that rework is busy erasing. Unblocked feeds coding agents decision-grade context, the kind that lets both the router and the worker make cheaper, more accurate calls on the same prompt and the same model. Get the context right, and model routing for coding agents stops being a gamble and starts being a genuine, measurable saving.



