All posts

Where Your MCP Token Budget Actually Goes: A 2026 Autopsy

Dennis PilarinosDennis Pilarinos·May 14, 2026·Context Engines · Engineering Insights
Where Your MCP Token Budget Actually Goes: A 2026 Autopsy

Open Claude Code in a typical engineering setup. Seven MCP servers installed: GitHub, Jira, Slack, Notion, Confluence, Playwright, Sentry. A CLAUDE.md with team conventions. Two project rules files. Before you type a single character, your context window already holds an itemized bill.

Line itemTokens (approx)
GitHub MCP tool definitions~42,000
Playwright MCP tool definitions~13,600
Other MCP tool definitions (sum)~9,800
System prompt + Claude Code preamble~3,500
CLAUDE.md + project rules~2,400
Subtotal — context tax~71,300

That is the bill before your prompt, and it is the MCP token cost few teams audit. Type "fix the failing test in auth.spec.ts" and you have already spent about 35% of a 200K context window on overhead. Scott Spence, writing about his own Claude Code stack, called it burning through a third of it just loading tools in 2025. Joe Njenga's production session measured 51K tokens of MCP bloat before Anthropic shipped the tool-search subagent, then cut it to 8.5K with one configuration change.

That fixed overhead has a name worth using: the context tax. It is the tokens you pay just to have tools registered, before solving anything. This autopsy breaks the tax into four line items, prices it against the cheaper CLI alternative, names four reduction patterns, and closes with three numbers you can measure on Monday.

The bill, before the work: A Claude Code session with 5-10 MCPs installed typically burns 50,000-67,000 tokens before the user types a first prompt, community-measured across r/mcp, dev.to, and engineering blogs in 2025-2026. The GitHub MCP alone accounts for ~42,000 of those tokens in tool-definition schemas. That fixed overhead is the context tax: tokens paid just to have tools registered, before solving anything. OnlyCLI's 2026 benchmark put MCP at 4-32x the per-operation token cost of equivalent CLI tools. Joe Njenga measured a 46.9% main-thread bloat reduction with Claude Code 2.0's new tool-search subagent.

What is the context tax?#

The context tax is the fixed token overhead an AI coding agent pays loading tool definitions, system prompts, and context preambles before any user prompt is processed. It compounds per session and per turn, because every tool definition reloads on every model call inside an agent loop. In 2025-2026 it is one of the largest drivers of unexplained token spend across MCP-heavy stacks, per Joe Njenga's 51K-to-8.5K production measurement and similar community autopsies on dev.to and Medium.

The compounding mechanic is worth naming because most teams underestimate it. Tool definitions are not paid once at session start. They re-enter context on every model call, because the model needs the full schema to reason about which tool to call next. A 50-turn session with a 50K-token MCP preload is paying that 50K many times over in attention cost, even though the dollar bill is metered separately.

Think of it as a parallel to context debt, the term Unblocked coined earlier this year alongside the trial-and-error tax framing. Context debt is the cost of missing context: wrong assumptions an agent makes because it lacks the institutional knowledge to know better. Context tax is the cost of loading context: tokens spent on schemas and preambles before any real work begins.

Both compound silently. Both stay invisible until the monthly bill arrives. And the architectural fix is the same in both cases: load less, fetch on demand.

The two coined concepts solve from opposite directions but converge on the same answer. Reducing context tax means stripping preloaded tool definitions. Reducing context debt means surfacing the right answer at the moment of need. Both point toward on-demand retrieval and away from upfront registration.

Where does the MCP token budget actually go?#

A typical MCP-heavy session distributes its preload tokens across four sources, in order of size: tool definitions, system prompt and preamble, accumulated tool results, and memory-MCP overhead. Nebulagg's dev.to autopsy measured GitHub's MCP at roughly 42,000 tokens in schemas alone in 2026. Amzani at dev.to called the phenomenon your MCP server eating your context window, and demiliani framed it as the too-many-tools problem. The autopsy below splits the overhead into four addressable line items.

Tool definitions are the biggest line item#

Tool definitions dominate. The GitHub MCP server alone charges roughly 42,000 tokens in JSON Schema definitions, argument types, and example payloads, per Nebulagg's dev.to autopsy. Piotr Hajdas's more recent measurement put the figure at 55,000 tokens across 93 distinct tool definitions, suggesting the schema has expanded as GitHub MCP added features. Playwright's roughly 21 tools take around 13,600 tokens. The mcp-omnisearch server packs about 20 tools into 14,000 tokens, per Scott Spence's measurement.

The spec is the reason. MCP requires verbose JSON Schema with examples and error formats, and every definition reloads on every model call inside a turn. Five servers feel cheap. Ten compound. Fifteen and you are paying a serious tax before the agent reasons about anything.

The system prompt and preamble keep the lights on#

The base Claude Code system prompt sits near 3,500 tokens, consistent with Anthropic's published costs guidance and community estimates from 2025-2026. A typical CLAUDE.md plus project rules adds another 1,500-3,000 tokens.

Both reload every turn. Multiply by 30-50 turns in a long session and the preamble alone is a meaningful slice of the bill, even before MCP enters the picture.

Tool-result accumulation is the tail tax#

The preload is only half the story. Every tool call returns content back into the context window, and over a 50-call session those returns can balloon past 500,000 tokens, per patterns documented in the doobidoo/mcp-memory-service README and reinforced across r/mcp discussions in 2025. Anthropic's context-engineering guidance recommends compaction precisely because of this pattern.

Distinguish the two halves carefully. Preload tax is fixed and predictable. Accumulation tax is dynamic and grows with tool-call volume. Mert Köseoğlu's Context Mode write-up puts hard numbers on the accumulation side: a Playwright snapshot weighs 56 KB, twenty GitHub issues weigh 59 KB, and Context Mode cut a 315 KB session payload to 5.4 KB in his test workload.

Memory MCPs add their own overhead#

When teams install memory servers like Stash, MemPalace, Hindsight, or agentmemory to fix the "the agent forgot my stack" problem, each one adds another 2,000-5,000 tokens in preload depending on stored memory volume, per the doobidoo/mcp-memory-service README and parallel measurements across r/mcp threads in 2025-2026. The paradox is hard to miss. Tools added to reduce context debt levy their own context tax.

That tradeoff is the bridge to every reduction pattern below. If the cure costs almost as much as the disease, the architecture is wrong.

How is context tax different from context debt?#

Two costs, opposite directions. Context debt is what you pay when context is missing: wrong assumptions an agent makes because it lacks the institutional knowledge to know better. Context tax is what you pay to load context: tokens spent on tool schemas and preambles before any real work begins. Both compound. Both stay invisible until the bill arrives. The architectural fix is the same in both cases: load less, fetch on demand.

DimensionContext taxContext debt
What it isFixed preload cost (tool defs, system prompts)Dynamic cost of wrong or missing context
When you payEvery session start, every turnWhen the agent rebuilds rejected work
What it looks likeToken count high before first promptTrial-and-error loops, regenerated decisions
Architectural fixProgressive disclosure, tool-search subagentOn-demand retrieval of PRs, decisions, threads
Coined whereUnblocked, "MCP Token Budget Autopsy" (May 2026)Unblocked, "The 12-Line PR That Took 5 Days" (2025)

The parallel matters because most engineering teams optimize for one and ignore the other. A team that prunes MCPs to cut the tax can still pay heavy context debt if the agent never sees the right Slack thread. A team that wires up perfect institutional memory but installs 15 MCPs pays the tax anyway. The two coined concepts belong on the same dashboard.

How much does the context tax actually cost?#

OnlyCLI's 2026 benchmark put a per-operation number on the MCP token cost, measuring it at 4-32x the equivalent CLI tools. Scalekit's independent replication put it on real money: at 10,000 operations a month and Sonnet 4 pricing, the CLI path runs about $3.20 while the MCP path runs about $55.20. Anthropic's first-party costs guidance anchors the per-developer math: average enterprise spend is roughly $13 per developer per active day, with a typical monthly band of $150-250.

The dollar conversion lands differently depending on stack size. A lean three-server setup is cheap. A heavy fifteen-plus stack adds up across a 200-engineer org.

MCP setupApproximate context taxPer-developer per-day overhead (Sonnet 4.5 input rate)
3-4 MCPs (lean)~25K tokens~$0.40
7-10 MCPs (typical)~55-70K tokens~$1.10
15+ MCPs (heavy)~120K+ tokens~$2.00+

Numbers are illustrative, anchored on community-reported preload measurements. Actual sessions vary by turn count, tool-result accumulation, and model. At a 200-engineer team paying the "typical" rate, the context tax alone is roughly $220 per working day, or about $55,000 per year, before any productive work tokens.

The caveat matters too. CLI is not strictly cheaper across every workload. The 4-32x range is one comparison axis, and tool surface, agent type, and task complexity all shift the equation. Treat the benchmark numbers as a directional signal, not a verdict.

Why doesn't a bigger context window help?#

Bigger windows make more room for the tax, but they do not make the tax cheaper, and they make retrieval accuracy worse. Anthropic's MRCR v2 benchmark at 1M tokens scored Opus 4.6 at 76% and Sonnet 4.5 at 18.5% on 8-needle retrieval, which means roughly 80% of multi-needle questions return wrong answers at full window on the cheaper model. Chroma's 2025 context-rot study found all 18 tested models degrade at every input length, not only at the ceiling.

The two effects stack. Tax compounds on the input side. Rot compounds on the retrieval side. A 1M window with 70K of tax loaded looks comfortable on paper and then quietly returns the wrong needle when you ask. The retrieval failure is documented in detail in context rot at scale.

The takeaway is unintuitive. The fix for a tight context budget is not a bigger window. The fix is a smaller load. Anthropic's own progressive disclosure pattern in their context-engineering guidance makes the same architectural argument: do not register what you will not use.

Frequently asked questions#

How many tokens does the GitHub MCP server cost?#

Roughly 42,000 tokens in tool definitions alone, per Piotr Hajdas's 2025 dev.to autopsy and corroborating Medium engineering posts. That figure is the preload schema cost only. It does not include any token returns from actual GitHub API calls, which accumulate as additional tool-result tax across a session.

Is MCP always more expensive than a CLI agent?#

No, but for equivalent operations OnlyCLI's 2026 benchmark measured 4-32x higher per-operation token cost on MCP. Scalekit's replication put the monthly gap at $3.20 (CLI) versus $55.20 (MCP) at 10,000 operations on Sonnet 4. The gap widens as installed MCP count grows. CLI wins on raw token efficiency for well-scoped operations. MCP wins when you need cross-tool reasoning. Pick by workload, not by ideology.

Does Claude Code's tool-search subagent eliminate the context tax?#

It reduces the tax materially without eliminating it. Joe Njenga measured a 51K-to-8.5K main-thread reduction (46.9%) in production with the new subagent. The subagent itself carries overhead, and definitions still load when invoked, so the tax shifts from preload to on-demand rather than vanishing. See Anthropic's sub-agents documentation for configuration details.

What is the difference between context tax and context debt?#

Context tax is the fixed cost of loading tool definitions, system prompts, and preambles before any work begins. Context debt is the cost of missing context, namely wrong assumptions an agent makes when it lacks PRs, decisions, or Slack threads. Both compound silently. The architectural fix, load less and fetch on demand, addresses both.

How do I measure my own context tax?#

Three numbers, in order: tokens used before your first prompt, the percentage attributable to MCP tool definitions versus system prompt, and per-call accumulation rate across a 20-call session. Anything over 50K on the first number is an audit signal. The "What to Measure This Week" section below has the exact technique.

What patterns reduce the context tax?#

Four patterns, ranked by ROI in 2025-2026 community measurements. Joe Njenga's 46.9% reduction with Claude Code 2.0's tool-search subagent is the canonical data point. Scott Spence's pruning approach reclaimed roughly a quarter of his context budget by retiring low-value servers. Anthropic's progressive disclosure pattern generalizes the principle.

  1. Tool-search subagent (Claude Code 2.0.6+). Anthropic's tool_search feature loads tool definitions on demand rather than preloading every schema. Joe Njenga's production session went from 51K to 8.5K main-thread tokens, a 46.9% reduction. Configuration is small. Payoff is large. Start here.
  2. Progressive disclosure. Anthropic's context-engineering blog lays out the pattern. Do not load all tools upfront. Surface them by user intent. Pair with compaction strategies on the accumulation side.
  3. MCP server pruning. Keep three to five high-value servers, retire the rest. Scott Spence's audit showed 25-30% context budget reclaimed by this alone. The hardest part is political, not technical: every server you remove had a champion.
  4. On-demand retrieval over tool-definition preload. The most structural fix. Surface answers (PRs, decisions, Slack threads, Confluence docs) at the moment of need rather than registering schemas that preload on every session.

Sam Younger, Engineering Manager at UserTesting, described the pattern this way:

"Unblocked is the first MCP queried for everything we look up. It's not just checking the code, the code could be wrong. It pulls the Confluence docs, the feature planning documents, the Slack conversations."

That description captures the on-demand retrieval pattern in practice. Unblocked is the institutional context layer underneath MCP, CLI, and Skills, unifying PRs, Slack, Jira, Notion, Confluence, S3, and code repos so coding agents can fetch the right answer instead of preloading every schema. MCP servers alone aren't enough for enterprise context, because MCP is a transport protocol, not a context engine.

Combined, the four patterns map to a 25-50% reduction band in published community measurements: Scott Spence's MCP pruning audit reclaimed roughly 25-30% of context budget on its own, and Joe Njenga's tool-search subagent experiment reached 46.9% on the main thread. Stacking pattern 1 (tool-search subagent) with pattern 3 (pruning) is the highest-impact starting move because they target different halves of the bill: subagent reduces what every server costs you, pruning reduces how many servers you have.

When does an MCP-heavy stack actually make sense?#

Three thresholds map to three strategies, based on the autopsy data above and corroborated by OnlyCLI's benchmark range. Below five servers the tax is manageable and the value is usually clear. Between five and ten the tax becomes meaningful and structural fixes pay for themselves. Above ten the tax dominates the budget and either aggressive pruning or a shift to CLI and on-demand retrieval is required.

MCP countContext tax bandRecommended strategy
Under 5Low (~25K tokens)Stay MCP-native; tax is acceptable
5-10Medium (~55-70K)Install tool-search subagent; audit quarterly
10+High (~120K+)Prune aggressively or migrate equivalents to CLI/Skills/on-demand retrieval

The rubric is directional. Workload shape matters more than raw count: a single chatty server can outspend five quiet ones. Playwright is the canonical example. One snapshot call returns roughly 56KB of DOM serialization back into context, per Mert Köseoğlu's Context Mode measurements. A static analysis server that returns "no issues found" for the same call costs almost nothing. Two MCPs, identical count, very different tax footprint.

Pair the count check with the three measurements in the closing section to get the real picture, and consult our MCP vs CLI decision rubric for the workload-by-workload comparison.

What to measure this week#

Three numbers, instrumented Monday, will tell you whether your context tax is in the green or running hot. Track them per developer for a week. Compare to the bands above.

  1. Tokens before first prompt. Fire up a fresh session. Screenshot the token counter before typing. Anything over 50K is an audit signal. Anything over 80K means the tax is bigger than most productive prompts.
  2. MCP tool-definition share. Total session tokens versus MCP-attributed tokens. If MCP exceeds 30% of total spend, the tax is doing too much work. Aim for under 20% after applying patterns 1 and 3 above.
  3. Per-call accumulation rate. Tokens added per tool call across a 20-call session. If the average call returns more than 2K tokens of content into your window, the tail tax is bigger than the preload, and compaction or scoped tool calls become the higher-priority fix.

Each metric has a different surfacing technique. Tokens before first prompt appears in Claude Code's status line when you open a session; the /cost command surfaces it explicitly on demand. MCP-attributable share requires reading session JSONL logs under ~/.claude/projects/... and totaling tokens by tool source. Per-call accumulation rate is the same JSONL, totaled across a 20-call window. None of the three requires custom tooling, only a willingness to look.

Run the three numbers next Monday. Run them again the Monday after applying the tool-search subagent. The delta is the bill you stop paying.