Claude Code Context Rot: Why Accuracy Drops at 200K Tokens

Dennis Pilarinos·May 2, 2026

Claude Code Context Rot: Why Accuracy Drops at 200K Tokens

Key Takeaways

• Context rot is the measurable accuracy drop in LLM responses as input length grows, even before the window fills. Chroma's 2025 study evaluated 18 frontier models, including Claude Opus 4, and reported that performance degrades as input length increases, often in non-uniform ways (Chroma, 2025).

• In Claude Code the effect lands harder than in chat, because the agentic loop reloads files, tool results, and prior turns into context every step.

• Five retrieval patterns recover most of the accuracy loss past 200K tokens: /compact at phase boundaries, /clear between unrelated tasks, subagents for side quests, on-demand skills instead of bloated CLAUDE.md, and on-demand team knowledge via MCP.

Anthropic's own MRCR v2 benchmark scores Claude Opus 4.6 at 76% on 8-needle retrieval at 1M tokens, and Sonnet 4.5 at 18.5% on the same test (Anthropic, 2026; InfoQ, 2026). The 1M window exists. Whether your code is correct in it is a different question. Senior engineers running Claude Code daily have been describing the symptoms for months: redo loops eating tokens, models hand-waving at fabricated work, the same grep running for the third time in a session. The community now has a name for what's happening, and the research is catching up to the lived experience.

What is context rot in Claude Code?#

Context rot is the measurable degradation in LLM accuracy as input length grows, even when the window is far from full. Chroma's 2025 study evaluated 18 frontier models including Claude Opus 4 and found that "model performance degrades as input length increases, often in surprising and non-uniform ways" (Chroma Research, 2025). Anthropic now uses the same term explicitly: in its Opus 4.6 launch, Anthropic describes context rot as performance degrading "as conversations exceed a certain number of tokens" (Anthropic, 2026).

The first thing to be precise about is that context rot is not the same as hitting your context limit. The limit is a capacity ceiling. Rot is a quality slope that starts well below the ceiling. You can have 700K tokens of headroom and still watch the model lose track of a constraint you stated at turn three.

The second thing to be precise about is that this is not a Claude-specific problem. It's a transformer-attention problem. Every frontier LLM exhibits some version of it. Anthropic's own context-engineering writing names the mechanism explicitly: "as the number of tokens in the context window increases, the model's ability to accurately recall information decreases," and frames context as "a finite resource with diminishing marginal returns" rather than a bucket you fill until it overflows.

What makes Claude Code interesting is that it makes the rot visible. The agentic loop accelerates context fill, so you cross the bad-quality bands faster than you would in a chat-only session, and the symptoms surface in code, where they're concrete and reviewable.

Why does context rot hit Claude Code harder than chat?#

In Claude Code the agentic loop reloads files, tool results, and model output into context on every turn. Anthropic's own data on the 8-needle MRCR v2 benchmark at 1M tokens shows Opus 4.6 at 76% and Sonnet 4.5 at 18.5% (Anthropic, 2026), meaning the model many users actually pay for in a Pro session loses roughly 4 in 5 multi-needle retrievals near the top of the window.

Chat sessions accrete tokens slowly. A coding session does not. Every file the agent reads, every tool result, every prior turn it pulls forward, every CLAUDE.md and skill that loaded at session start, is sitting in attention's way when you ask the next question. That's the structural reason rot lands earlier here.

The model-side mitigations Anthropic has shipped acknowledge this. Opus 4.6 introduced Adaptive Thinking and a Compaction API specifically targeted at long-running agents (InfoQ, 2026). The features only exist because the problem is real. Anthropic itself has publicly admitted users were hitting limits "way faster than expected," and the company adjusted its 1-hour cache window in early 2026 in ways that made heavy users' token math worse, not better (DevClass, April 2026; XDA Developers, 2026).

The community has receipts. GitHub issue anthropics/claude-code#35296 tracks community-documented gaps in 1M-context behavior, with the verbatim user complaint that "sessions waste massive tokens on trial-and-error for things already figured out in prior sessions." That sentence captures the failure mode better than any benchmark.

Recall accuracy at 1M tokens (8-needle MRCR v2 benchmark):

Model	Accuracy at 1M tokens
Claude Opus 4.6	76%
Claude Sonnet 4.5	18.5%
Single-needle baseline (frontier average)	~90%

The gap between single-needle and multi-needle is what most users feel as "the 1M window doesn't really work."

How do you spot context rot before it costs you a session?#

Three signals show up before the model fully derails: an unprompted offer to compact well below the window ceiling, repeated trial-and-error on patterns the session has already explored, and confident hallucinations the user has to redo. Reading your local Claude Code session logs gives you a clean way to chart the build-up.

The expensive lesson is that you don't notice rot until you've already paid for it. By the time the model is generating broken code, you've spent the tokens. The cheap lesson is that the warning signs are obvious once you know what you're looking for.

The unprompted-compaction warning zone#

Claude Code's compaction logic starts engaging well below the window ceiling. Community reports tracked in GitHub issue anthropics/claude-code#35296 put the unprompted-compaction zone roughly in the low-to-mid tens of thousands of tokens range, where the model begins offering to compact, turn latency rises, and answers get hedgier. That's your first concrete signal. If you're seeing unprompted "should we compact?" prompts, you've already crossed the band where retrieval discipline starts mattering.

Trial-and-error loops, fabricated work, confident hallucinations#

The verbatim community description from issue #35296: "Sessions waste massive tokens on trial-and-error for things already figured out in prior sessions." If you've used Claude Code at length, you've seen the pattern. The agent proposes an approach you already rejected. It runs the same grep it ran twenty turns ago. It cites a function signature that doesn't exist in your tree. None of these are bugs in the model, exactly. They're the predictable output of attention smearing across too much context.

Local diagnostic: chart your session token growth#

You don't have to take the benchmarks on faith. Claude Code writes session logs to your local disk. A short shell snippet over those logs is enough to plot how fast you're actually accreting tokens. Run this in a fresh terminal:

cd ~/.claude/projects/$(ls -t ~/.claude/projects/ | head -1)
for f in *.jsonl; do
  total=$(jq -s '[.[] | (.message.usage.input_tokens // 0) + (.message.usage.output_tokens // 0)] | add' "$f" 2>/dev/null)
  printf "%s\t%s\n" "$f" "${total:-0}"
done | sort -k2 -n -r | head -10

That gives you the ten heaviest sessions by total tokens. Run it after a long working day. The number that comes back is almost always larger than the engineer using Claude Code expects.

Where session tokens go, by context fill bucket (illustrative):

0-25% fill: useful work dominates; small share goes to re-derivation, near-zero to fabricated work.
25-50% fill: useful work still leads, but re-derivation rises and fabricated work begins to appear.
50-75% fill: roughly even split between useful work and re-derivation; fabricated work climbs visibly.
75-100% fill: useful work falls to a minority share; re-derivation and fabricated work account for roughly 60% of tokens.

What five retrieval patterns keep accuracy high past 200K tokens?#

Anthropic's context-engineering writing frames context as "a finite resource with diminishing marginal returns," and describes subagents as a pattern where the side-quest work happens in a separate context and only "a condensed, distilled summary (often 1,000-2,000 tokens)" returns to the main conversation. Layering five retrieval patterns on top of normal Claude Code use, /compact at phase boundaries, /clear between unrelated tasks, subagents for side quests, on-demand skills, and on-demand team knowledge via MCP, recovers most of the quality loss without needing a model upgrade.

These patterns aren't a stack you turn on once. They're discipline you apply per session. The framing that helps: every one of these is a bet that loading less, but more relevant, beats loading more. That's the through-line.

1. `/compact` at phase boundaries, not at degradation#

Run /compact when you finish a piece of work, not when the model starts misbehaving. Anthropic's costs documentation is explicit about this: compaction works best as a deliberate phase break, not a panic move. Pass custom instructions to control what survives the summary, for example, /compact Focus on decisions made and the API contracts we settled on. By the time you're compacting reactively because the model is hallucinating, you've already paid for the rotted output.

2. `/clear` between unrelated tasks#

/clear wipes the context entirely and resets the session. It's the cheapest 50%-plus saving on this list, and the one most engineers under-use. The honest counterpoint: you do lose all in-session memory, including useful context. The savings only compound when the next task is genuinely unrelated, which means the discipline is recognizing the moment you've crossed from one concern to another. Switching from a debugging session to a feature session in the same repo is exactly that moment.

3. Subagents for side quests#

Anthropic's sub-agent docs describe the pattern plainly: every subagent runs in its own context window, "verbose output stays in the subagent's context while only the relevant summary returns to your main conversation" (Claude Code Sub-agents, 2025). The context-engineering writing puts the typical returned summary at "1,000-2,000 tokens." The concrete shape: if you ask Claude to find every place error handling lives and summarize the pattern, the file reads can add tens of thousands of tokens to a main session, but only the 1-2K summary returns when a subagent does the work. The reason this works isn't that the subagent is smarter. It's that the side quest's noise never enters the main attention surface. That's the retrieval-discipline framing, applied to compute.

4. Move CLAUDE.md instructions into on-demand skills#

CLAUDE.md loads at session start. Every line costs you on every turn, even when irrelevant. Skills load only when invoked. Domain-specific instructions, especially long lists of conventions tied to a particular workflow or service, belong in skills. Keep CLAUDE.md under a hard line count (200 is a reasonable target for most teams) and treat anything above that as a candidate for migration. Anthropic's context-engineering guidance frames this as a baseline practice, not an optimization.

5. Fetch team knowledge on-demand via MCP rather than preloading#

The retrieval problem is bigger than file-system files. Most of the context that prevents trial-and-error loops doesn't live in the working tree at all. It lives in PRs, decision threads, incident postmortems, Slack arguments about which retry strategy you settled on six months ago. Stuffing that into CLAUDE.md is the wrong move, because it loads it every session regardless of relevance. The right move is to surface it through MCP at the moment of decision. An institutional-memory layer like Unblocked connects through MCP, IDE, Slack, and CLI, so the same retrieval works whether the agent is in the editor, the terminal, or chat. Past that placement, the principle is what matters: load less, fetch on demand.

When should you start a fresh session vs. compact vs. clear?#

The decision rule that holds up under load: switching task domain, /clear. Same task, finished a phase, /compact. Past 50% context fill with hallucinations, neither helps, start fresh. The Claude memory cookbook frames this the same way: compaction is for continuity, clearing is for separation, and a new session is for recovery.

There's a temptation to treat /compact as the universal tool. It isn't. Compaction preserves continuity at the cost of fidelity, which is good when continuity matters and bad when the underlying context has rotted past the point a summary can rescue. By the time the model is hallucinating function signatures, the summary will inherit the same hallucinations. Recovery means starting fresh.

The decision tree below collapses the call into three branches based on what's actually wrong.

Decision rules: fresh session vs. /clear vs. /compact:

Past 50% context fill and seeing hallucinations? Start a fresh session. Neither /clear nor /compact will recover quality once it's rotted.
Switching task domain (debugging to feature, repo A to repo B)? Use /clear to wipe context and reset.
Same task, finished a phase? Use /compact to summarize and continue. Pass custom instructions to control what survives.

What does the future of Claude Code context management look like?#

The model-side roadmap is converging on retrieval. Opus 4.6 shipped a Compaction API and Adaptive Thinking explicitly to extend long-running agents past the rot horizon, and Anthropic's broader context-engineering writing reframes the design problem as deciding what not to load. That's the right direction. It's also a quiet admission that capacity isn't the bottleneck.

Where this goes is honest about a thing the marketing has under-stated for a while. Bigger windows aren't the strategy. Better retrieval is. Every product feature shipping at the model layer in 2026, from Anthropic's Compaction API to the streaming context-edit primitives in the cookbook, is implicitly admitting that loading less, but more relevant, beats loading more. The retrieval discipline you build at the agent layer compounds with whatever ships at the model layer.

The investment that pays off is not waiting for a 10M-token window. It's getting your team's institutional knowledge into a system the agent can fetch on demand, and getting your engineers into the habit of treating each session's context as a budget rather than a free resource. That habit shift is what separates teams whose Claude Code spend is rising while their PR-acceptance rate falls, from teams whose spend is rising while shipping more.

Frequently Asked Questions#

What is context rot in Claude Code?#

Context rot is the measurable accuracy drop in LLM responses as input length grows, even when the window has room. In Claude Code the agentic loop fills context fast (files, tool results, prior turns reload every step) so the effect lands earlier than in chat. Chroma's 2025 study confirmed every one of 18 frontier models tested degrades at every input length increment.

What's the difference between `/clear` and `/compact`?#

/clear wipes the context entirely and resets the session; use it when switching to unrelated work. /compact summarizes the existing conversation into a structured representation and continues from that summary; use it at phase boundaries within the same task. Anthropic's costs documentation treats them as different tools with different jobs, not interchangeable.

Does the Claude 1M context window actually work for coding?#

The window exists, but accuracy degrades inside it. Anthropic's own MRCR v2 8-needle benchmark at 1M tokens shows Opus 4.6 at 76% and Sonnet 4.5 at 18.5% (Anthropic, 2026). Single-needle retrieval is roughly 90% at 1M, which sounds high until you remember that's about 1 in 10 queries wrong, before counting cost and latency.

How do I see how many tokens my Claude Code session is using?#

Read the session JSONL files under ~/.claude/projects/. Each turn records token usage, and the jq snippet earlier in this post totals them per file. Anthropic's costs documentation and the Claude Code status line also surface live token usage during a session, which is the easiest way to catch a rot-prone session in flight.

Does prompt caching help with context rot?#

Caching reduces cost by reusing prior context cheaply; it doesn't reduce accuracy degradation, because the model still attends over the full reused context. The 1-hour cache window Anthropic shipped in early 2026 made the cost story worse for many heavy users (XDA Developers, 2026). Caching and rot are orthogonal problems, and conflating them leads to spend that grows faster than quality.

Closing#

Three things to take away. Context rot is real, measured by Anthropic's own benchmarks and Chroma's research, and present in every frontier model. In Claude Code it surfaces faster than in chat because of the agentic loop. The fix is retrieval discipline at every layer, from the commands you run between tasks, to the skills you load on demand, to where your team's institutional knowledge actually lives.

If you adopt the five patterns this week, the diagnostic snippet earlier in the post is the cheapest way to verify they're working. When you're ready to look at retrieval at the team level, the next read is the context-engineering pillar guide, which covers the four pillars (source curation, retrieval, ranking, feedback loops) that turn institutional memory into something an agent can actually use. After that, how a context engine actually works gets concrete about the architecture, and why your team's rejected approaches keep coming back covers the memory side of the same coin.

Claude Code Context Rot: Why Accuracy Drops at 200K Tokens

What is context rot in Claude Code?#

Why does context rot hit Claude Code harder than chat?#

How do you spot context rot before it costs you a session?#

The unprompted-compaction warning zone#

Trial-and-error loops, fabricated work, confident hallucinations#

Local diagnostic: chart your session token growth#

What five retrieval patterns keep accuracy high past 200K tokens?#

1. /compact at phase boundaries, not at degradation#

2. /clear between unrelated tasks#

3. Subagents for side quests#

4. Move CLAUDE.md instructions into on-demand skills#

5. Fetch team knowledge on-demand via MCP rather than preloading#

When should you start a fresh session vs. compact vs. clear?#

What does the future of Claude Code context management look like?#

Frequently Asked Questions#

What is context rot in Claude Code?#

What's the difference between /clear and /compact?#

Does the Claude 1M context window actually work for coding?#

How do I see how many tokens my Claude Code session is using?#

Does prompt caching help with context rot?#

Closing#

Stop feeding context.Start shipping code.

1. `/compact` at phase boundaries, not at degradation#

2. `/clear` between unrelated tasks#

What's the difference between `/clear` and `/compact`?#

Stop feeding context.
Start shipping code.