All posts

MCP Tool Overload: A Measured Guide to Context-Window Bloat

Dennis PilarinosDennis Pilarinos·May 16, 2026·Context Engines · Engineering Insights
MCP Tool Overload: A Measured Guide to Context-Window Bloat

Key Takeaways

Three MCP servers burned 143,000 of a 200,000-token context window, 72% of the budget, before the agent saw a user message (Stephanie Goodman, AgentPMT, 2026).

The RAG-MCP project measured tool-selection accuracy falling from 43% to under 14% as tool count grew, a threefold degradation that happens silently before any bad answer ships.

Platform vendors have converged on hard limits: Cursor caps at 40 tools, Claude Code's output quality degrades past 50, the OpenAI Tools API maxes at 128, Claude's tool-list capacity tops out near 120.

Per-tool schema averages 300 to 600 tokens. A 10-server, 15-tools-each setup burns roughly 75,000 tokens of context tax before turn one (Piotr Hajdas, DEV.to, 2026).

The fix is not fewer servers in the wrong place. It is allow-listing for hot-path tools, Anthropic's Tool Search subagent for the rest, and the architectural pattern Cloudflare proved with Code Mode (1.17M tokens compressed to 1,000).

How much context does MCP tool overload actually cost?#

A three-server, vanilla MCP setup burned through 72% of a 200K context window before the agent read its first user message. Stephanie Goodman's measurement at AgentPMT in February 2026 found that GitHub, Playwright, and an IDE integration consumed 143,000 of 200,000 tokens on tool schema alone, not on actual work (AgentPMT, 2026).

Here is what the receipt looks like across the most-cited 2026 measurements:

SetupTool schema (tokens)Window consumed
GitHub MCP alone (35 tools, AgentPMT)~26,00013% of 200K
Slack MCP alone (11 tools, AgentPMT)~21,00010.5% of 200K
3 servers: GitHub + Playwright + IDE (AgentPMT)~143,00072% of 200K
5-server modest config (Hajdas, DEV.to)~55,00027.5% of 200K
10-server power user (Hajdas estimate)~75,00037.5% of 200K
Cloudflare full native MCP, pre-Code-Mode~1,170,000exceeds any standard window

Goodman calls this the "bloat tax." Across the P8 cluster, we call it the context tax: the fixed token overhead an agent pays loading tool definitions before any user prompt is processed. The AgentPMT figures are direct measurements; the Hajdas rows on the table are measured at the low end and extrapolated at the 10-server end, per his own DEV.to methodology note.

Piotr Hajdas, writing on DEV.to in January 2026, ran a parallel measurement on a more modest five-server setup and arrived at around 55,000 tokens at conversation start, with extrapolation to roughly 75,000 for a 10-server power-user config (Piotr Hajdas, DEV.to, 2026). Either number ends the same way. You pay schema rent before you say hello.

The most striking 2026 data point sits at the high end of this scale. Cloudflare's engineering team disclosed in their February 2026 Code Mode write-up that their native MCP exposure totaled roughly 1.17 million tokens of tool definitions (Cloudflare, 2026), well beyond any production context window. That number set the ceiling for what "too many tools" can mean when nobody is gating the surface.

At what point does adding tools degrade accuracy, not just budget?#

Around 50 tools, output quality starts to slip. Past roughly 120, accuracy collapses. The RAG-MCP research project measured tool-selection accuracy falling from a 43% baseline to under 14% as tool count grew, a threefold degradation reported by Stephanie Goodman (AgentPMT, 2026). The cost is not just tokens. It is wrong answers, confidently delivered.

Three independent 2026 findings line up on the same curve:

SourceThresholdWhat was measured
RAG-MCP, via AgentPMTtool count ramp43% to <14% selection accuracy
Cursor (product decision)40 tools, hard capTelemetry on degraded outputs
Hajdas, DEV.to50+ toolsVisible quality drop, tangent chasing

Piotr Hajdas described the degradation pattern plainly. With Claude Code loaded past 50 tools, the model started chasing tangents and referencing tools instead of answering actual questions (Hajdas, DEV.to, 2026). That is not a token budget failure. It is an attention failure that token budgets cannot fix.

The Cursor cap is informative. Per the 2026 platform-limits roundup at AgentPMT and Hajdas, the number is not arbitrary; it's the point above which adding tools degraded outputs in production telemetry. The cap is a product-level admission that the failure is structural.

Why does adding tools degrade accuracy in the first place?#

Three mechanisms compound. Every tool schema sits in the model's working memory, competing with the actual task. Similar tool names create selection ambiguity, common verbs like search and list appear in dozens of MCP servers and force the model to disambiguate every call. And the schema cost is paid on every turn, not once: the longer the conversation, the more times you pay the tax.

The first mechanism is straightforward attention economics. Longer context windows mean the model has more to process, which slows responses and increases the chance the relevant signal gets buried. The pattern is described as tool-space interference in Lunar.dev's 2026 analysis (Lunar.dev, 2026), which frames working memory as a fixed resource that tool schema competes with the prompt to occupy.

The second mechanism is name collision. Lunar.dev's analysis of typical five-server setups measured per-turn tool metadata at 30,000 to 60,000 tokens, 25% to 30% of a 200K window, purely on overlapping schema (Lunar.dev, 2026). When github.search and notion.search and playwright.search all sit in the same surface, the model has to compare and choose every time.

The third mechanism is reload. MCP tool definitions are not cached across model calls inside the agent loop. Every turn, the full schema list reloads. A 20-turn agentic session with a 30K-token tool surface pays the 30K tax 20 times, even when the tools never get called.

Combine the three and the failure mode is structural, not configurational. You cannot prompt your way out of attention economics.

What are the empirical platform limits, and why do they exist?#

Every major platform converged on a tool-count ceiling, and the numbers are not arbitrary. Cursor's hard cap at 40 tools, Claude Code's quality degradation point at 50, the OpenAI Tools API limit at 128, and Claude's tool-list capacity near 120 are the values each vendor's telemetry surfaced when output quality started to fall apart. Four teams. One conclusion.

PlatformTool limitWhat it represents
Cursor40 (hard cap)Product-enforced ceiling
Claude Code50+ toolsQuality degradation onset
OpenAI Tools API128API-level maximum
Claude (tool-list capacity)~120Effective tool registration

The pattern is the part to internalize. Teams shipping agents at production scale, with telemetry on actual user outcomes, have independently arrived at the same range. The free-for-all of "more tools is more capability" turned out to have a sharp ceiling once anyone measured it.

The limits exist because attention is finite. They are not implementation bugs that newer model versions will fix. Anthropic's own engineering guidance on effective context engineering is consistent with this: tool surfaces should be progressive, not preloaded.

What does the cost compounding look like over a session?#

Tool definitions are reloaded into every model turn. At 300 to 600 tokens per definition, a 50-tool setup carries roughly 15,000 to 30,000 tokens of context tax per turn before the conversation. Multiply by a 20-turn agentic session and you have spent up to 600,000 tokens in schema rent for a task that may have needed 50,000 tokens of real work. That is the per-turn nature of the tax, and it is what makes it so easy to miss.

Here is the math on the canonical anchor:

  • 50 tools loaded, 500 tokens average per definition: 25,000 tokens per turn
  • 20-turn agentic session: 500,000 tokens just in tool schema
  • Useful work in the same session: typically 30K to 60K tokens

The tax is not paid once at session start. It is paid every time the model returns to the loop. Production measurements of Claude Code before the Tool Search subagent shipped, surfaced in the GitHub MCP token-cost piece, recorded a main-thread baseline near 51,000 tokens that reloaded on every call.

What gives the math its bite is that the cost is invisible to most teams. There is no warning when 70% of your window is gone. The agent just starts giving worse answers and you assume the model regressed.

What is the measured fix that actually works?#

Three production-grade patterns landed in 2026 with measured results. Anthropic's Tool Search subagent (GA February 2026) preserved 85% of context versus conventional tool loading by gating tool-definition loading on the main thread (AgentPMT, 2026). Allow-listing in mcp.json cuts schema cost roughly 80% for known-narrow workloads. And Cloudflare's Code Mode compressed 1.17M tokens of native MCP definitions down to roughly 1,000, an order-of-magnitude reduction, by exposing tools through a code-execution surface instead of a schema list (Cloudflare, 2026).

The three patterns address the tax at different layers:

  1. Tool Search subagent. The main thread sees a single search tool. Real tool definitions load only when the subagent decides which to consult. Read the Anthropic Tool Search documentation for the configuration shape.
  2. Allow-listing. Most clients (Claude Code, Cursor) accept a tools array per server in mcp.json. Loading 5 of 50 tool definitions pays roughly one tenth the schema cost on every turn.
  3. Code Mode (Cloudflare). Instead of exposing each MCP tool as a JSON Schema entry, the agent gets a typed code surface and calls tools as functions in a sandboxed runtime. The schema cost goes near zero. See Cloudflare's Code Mode write-up for the architecture.

For a per-server deep-dive on the heaviest single offender, the GitHub MCP autopsy walks through four ranked fixes with measured numbers from the same 2026 cohort. Don't re-litigate fixes 1 to 4 from that piece. Stack patterns instead.

How do you decide which tools to keep loaded versus deferred?#

Use a three-question filter that scales from solo developer to platform team. First, how often is this tool called per session? Sub-10% utilization belongs in Tool Search, not in the always-loaded set. Second, how specific is it to the workload? Generic search and list verbs that overlap with other servers add cost and ambiguity. Third, what is the per-call ROI? Heavy schema is acceptable only when the tool is path-critical for the hot workflow.

Apply the filter to a canonical seven-server setup, GitHub, Jira, Slack, Notion, Confluence, Playwright, Sentry, and the before-and-after typically looks like this:

CategoryBefore filterAfter filter
Always-loaded tools~80 (full surface)~12 (hot-path only)
Tool schema per turn~45,000 tokens~7,000 tokens
Deferred via Tool Search0~68
Context window consumed by schema22.5%3.5%

The filter often produces uncomfortable conversations. Teams discover their actual hot-path is five GitHub operations, two Slack reads, and one Sentry list. Everything else exists "in case." Deferred-loading via Tool Search is exactly the place for "in case" tools, where the surface is searched on demand.

For the underlying budget framing, see the P8 token-budget autopsy. For why this matters at the agent-behavior layer rather than just the cost layer, Claude Code forgets the codebase walks through what happens when working memory is consumed by schema instead of project state.

Where does this go next, the architectural shift past tool overload?#

Tool overload is a symptom of treating MCP as the context layer. MCP is transport. It is not the layer. The architectural answer is to treat retrieval as a service on top of MCP, so the agent is not holding the full tool surface in working memory. The agent calls out to a synthesis layer that already knows what is relevant.

That is the context layer and context engine framing. Unblocked is the context layer we build, but the architectural pattern is bigger than any one product. Anthropic's Tool Search ships the same shape inside Claude Code. Cloudflare's Code Mode ships it at the runtime. The common idea: do not preload schema for tools the agent might need. Make retrieval a callable service and pay only when you ask.

Two adjacent reads complete the picture. Why MCP servers aren't enough walks through the strategic mismatch between transport-layer thinking and context-layer needs. Context engine vs RAG covers why naive retrieval is not the answer either.

The optimization frame is the trap. Tool overload is not a problem to fix in mcp.json. It is a signal that the agent's working memory should not be doing retrieval at all.

Frequently asked questions#

How many MCP tools is too many?

For Claude Code, output quality degrades past 50 tools according to Piotr Hajdas's 2026 DEV.to measurement. Cursor caps at 40 by design. The OpenAI Tools API maxes at 128. A safe ceiling for production agentic work is around 30 to 40 always-loaded tools, with everything else accessed through Anthropic's Tool Search subagent or a similar deferred-loading pattern.

Does a larger context window solve tool overload?

No. A 200K context window still loses 72% of capacity to three popular MCP servers per AgentPMT's 2026 measurement. A 1M window does not fix the accuracy problem. The RAG-MCP threefold drop from 43% to under 14% is about tool selection, not context length. Larger windows just delay when you hit the wall.

Is tool count the only issue, or does tool design matter?

Design matters more than count, but you cannot compensate for bad design with more design. Verbose JSON Schema, parameter sprawl, and overlapping tool names amplify the bloat. Anthropic's own guidance is that tool definitions should be terse. Every parameter description is paid for on every turn the tool stays loaded.

Why not just disable MCP and use the CLI?

For scoped, repeatable operations, that is often correct. The OnlyCLI 2026 benchmark cited in the GitHub MCP autopsy measured equivalent CLI tools at 4 to 32 times cheaper per operation. For composable multi-step workflows, MCP's batching still wins. The decision rule is workload shape, not blanket preference.

When will the platform limits become irrelevant?

Probably never. They are rooted in attention mechanics, not implementation. Anthropic, Cloudflare, and others are shipping architectural patterns (Tool Search, Code Mode) that route around the limits rather than expand them. Treat the platform tool-count limits as a permanent design constraint.

The takeaway, measured#

Three MCP servers can consume 143,000 of 200,000 tokens before the agent reads a prompt. Past 50 tools, output quality degrades. Past 120, it collapses. Every major platform vendor has independently arrived at the same range. None of this is a configuration mistake to clean up. It is an architectural signal.

The pragmatic moves are well-measured. Turn on Anthropic's Tool Search subagent. Allow-list your mcp.json to the tools your team actually uses on hot paths. Audit your context tax against the cost endpoint before and after each change. Stack the patterns and most teams cut MCP schema cost by 85% or more.

The strategic move is to stop treating MCP as the context layer. Tool overload is the symptom that retrieval lives in the wrong place. Move it out of the agent loop into a service designed for it, and the tax disappears with the schema. The P8 token-budget autopsy has the full accounting; the context layer piece has the architecture.