Key Takeaways
• The trial-and-error tax in Claude Code is structural. The agent rebuilds rejected work because the repo only tells it what the code looks like today, not what was tried first or why the team killed it.
• Anthropic's context-engineering cookbook frames compaction, tool-result clearing, and memory as three distinct context strategies. Across those, team-scope on-demand retrieval, the institutional context that should persist across people and projects, is the under-served axis in most agent setups today.
• The retrieval pattern that fixes this surfaces PRs, decision threads, and incident postmortems on demand at the moment of decision, not preloaded into CLAUDE.md. Measure success with re-derivation rate and first-try alignment, not agent self-reports.
The verbatim community quote from GitHub issue anthropics/claude-code#35296, filed in 2026: "Sessions waste massive tokens on trial-and-error for things already figured out in prior sessions." That sentence captures the failure mode better than any benchmark. Engineers running Claude Code daily have all watched it: the agent proposes the retry library you killed in review six months ago, runs the same grep it ran twenty turns earlier, cites a function signature that doesn't exist in your tree. The first instinct is to call it a model bug. It isn't a model bug. It's a memory problem, and the fix doesn't live at the model layer.
What's the trial-and-error tax in Claude Code sessions?#
The trial-and-error tax is the share of session tokens, plus the share of reviewer cycles, your team pays when an AI coding agent re-derives or rebuilds work the team has already done. The pattern is community-documented in GitHub issue anthropics/claude-code#35296, and Anthropic itself has acknowledged users hitting limits "way faster than expected" (DevClass, April 2026). Both costs come out of the same budget.
The two costs are easy to confuse, so it's worth separating them. The token cost is the obvious one: every minute the agent spends exploring an approach the team already explored is a minute paid for in your usage limit, and on heavy days that adds up to real dollars. The DevOps coverage of the spring 2026 token-drain pattern walks through the math at length (DevOps.com, 2026). The harder cost is the reviewer cycle. When the agent's PR contains an approach the team rejected six months ago, somebody has to explain the rejection again, the PR has to go back, and the institutional knowledge that was supposed to be settled gets re-litigated.
This is related to but distinct from context rot in Claude Code, which is a quality-degradation problem at high context fill. Context rot is about attention thinning across too much input. Institutional-memory loss is about missing input in the first place. You can have a perfectly fresh session at 5,000 tokens and still watch the agent reinvent the wheel, because the wheel was killed in a Slack thread three quarters ago and nothing in your repo points to that thread.
A widely-shared 2025 Hacker News discussion of Claude Code spend (HN 44879053) and Anthropic's own costs documentation both implicitly acknowledge the same root cause: the agent's effective context is bounded by what the team has put in front of it, and most teams haven't put their decision history in front of it.
Why does Claude reinvent the wheel? Three structural reasons.#
Three things conspire. Code shows the what but not the why. CLAUDE.md captures conventions, not decision history. And every session starts cold, with no carry-forward beyond what auto-memory persists about the user. The cookbook framing matters here: Anthropic's context-engineering cookbook treats memory as "structured note-taking, persistent external storage" so an agent can "track progress across tasks and sessions without keeping everything in active context." That's the per-user case. Team-scope memory, the institutional context across people and projects, is what most agent setups still lack.
Each reason is independently fixable. Stacking them is what produces the symptom.
Code shows the what, not the why#
git blame tells you who touched a line and when. It rarely tells you why the line is that shape rather than the obvious alternative. PR descriptions decay (the engineer who wrote it has moved teams), comments rot (someone refactored around them), and the architectural arguments that produced the current code mostly happened in places the agent can't see. When Claude Code reads a file, it gets fluent at the what. The reasoning behind the choice almost never makes it into the working tree.
CLAUDE.md captures conventions, not decisions#
CLAUDE.md is a useful tool, and it loads at session start, which is exactly why it's the wrong shape for decision history. Conventions belong there: "Use library X for retries." Decisions don't: "We considered Y and killed it because it caused incident Z in March." The first stays true for years; the second is a moment-in-time judgment that loses force as conditions change. Stuffing the second into CLAUDE.md is how files end up at 800 lines, every line costing tokens on every turn whether or not it's relevant.
Each session starts cold#
This is the structural one. Anthropic's context-engineering writing treats memory as one of three context strategies (alongside compaction and tool-result clearing) and frames it as "persistent external storage" the agent reads from across sessions. That covers within-session and per-user memory cleanly. Team-scope memory, the institutional context that should persist across people, sessions, and projects, is not what the cookbook is primarily about, and it's the axis most agent setups still leave unfilled. When the session starts, the agent has access to the codebase, your CLAUDE.md, and whatever the auto-memory file remembers about you personally. It does not have access to the Slack thread where the decision actually happened.
Where engineering decisions actually live (estimated distribution):
- PR threads: roughly 30%. Review discussion is where most decisions get argued and recorded.
- Chat / Slack: roughly 25%. The off-the-cuff debates and the "we tried that, here's why we killed it" moments.
- Tickets / Jira: roughly 20%. The structured asks and the rejection paper trails.
- Docs: roughly 15%. ADRs and runbooks for the decisions that earned a writeup.
- Code comments: roughly 10%. The thinnest layer, and the only one the agent reads by default.
What does "institutional memory" actually mean for an AI coding agent?#
Anthropic's context-engineering cookbook treats memory as "structured note-taking" the agent persists across sessions, alongside compaction and tool-result clearing as the three context-engineering strategies. That covers within-session and per-user memory cleanly. The scope axis it leaves implicit is the team. The 3×3 framing below, scope (session, user, team) on one axis and medium (file, embedding, on-demand retrieval) on the other, exposes which cell most agent setups have left empty.
The word "memory" gets used loosely, and that's part of why the conversation is muddy. Three terms get conflated, and they're all distinct. Prompt caching is a cost-reduction technique: reuse prior context cheaply. It's not memory, it's billing. Per-user auto-memory persists facts about you (your preferences, your role, your past projects). It's memory, but at the wrong scope for institutional knowledge. Team-scope memory is the institutional context that should persist across people, sessions, and projects, and it's the thing most agents do not have.
The 3×3 lays this out cleanly.
Memory at scope x medium:
| Scope | File | Embedding | On-demand retrieval |
| Session | working context | RAG over turn history | tool calls / subagents |
| User | CLAUDE.md + memory file | user embeddings | per-user auto-memory |
| Team | shared CLAUDE.md | code-only RAG | PRs, chat, tickets, docs on demand (under-served axis) |
The team-scope on-demand retrieval cell is the axis most agent setups leave empty.
The reason that cell stays empty has nothing to do with technology. It stays empty because filling it requires reaching across system boundaries (the working tree, the PR system, the chat tool, the ticket tracker) that most engineering tools historically don't reach across. MCP changed the cost of reaching, but the design work, deciding which sources count and how they should be ranked, is still on the team adopting it.
What's the retrieval pattern for team-scope memory?#
Three rules. Surface PRs, decision threads, and incident postmortems on demand at the moment of decision, not at session start. Retrieve over team-scope sources separately from file-system search, because the relevance signals are different. And do this through whatever surface the engineer is already in, IDE, terminal, or chat, so the retrieval shows up where the work is happening.
The pattern is structurally similar to the subagent pattern in Claude Code. Anthropic's sub-agents documentation describes how "verbose output stays in the subagent's context while only the relevant summary returns to your main conversation," and Anthropic's context-engineering writing puts the typical returned summary at "1,000-2,000 tokens." On-demand team-knowledge retrieval works the same way: the search over PRs and chat happens outside the main context, and only the relevant decision returns.
What to surface#
PRs that touched the same module. Decision threads in chat tied to the same area. Rejected approaches with the rejection rationale. Incident postmortems if the area has had production incidents. Ticket history if the change is following up on a known issue. Not all of these for every change. The retrieval has to rank.
When to surface it#
Three moments. When the agent is planning a change, before it edits files. When the agent is about to propose an approach that contradicts a constraint that's not in the working tree. And when a human reviewer asks "why this approach?" and the agent has to defend the choice.
How to surface it#
Through MCP, exposed at the surface where the engineer is working. An institutional-memory layer like Unblocked connects through MCP, IDE, Slack, and CLI, so the same retrieval works whether the engineer is in the editor, the terminal, or chat. Past that placement, what matters is the principle: load less, fetch on demand, keep the team-scope signal clean.
How do you measure whether institutional memory is working?#
Three signals. Re-derivation rate per session falls. First-try alignment with prior team decisions rises. Review-acceptance rate on AI-assisted PRs climbs. Don't measure agent satisfaction or perceived helpfulness, those are vanity metrics that don't correlate with whether human review cycles are getting cheaper.
Re-derivation rate#
How often does the agent propose work the team already did? Sample from a week of sessions, classify each proposal as novel, partially-overlapping, or full re-derivation, and track the share over time. Most teams that start measuring this find the baseline higher than they expected. The number that matters is the trajectory: the share should fall once on-demand team-knowledge retrieval is wired in, and it should fall faster than any model upgrade alone would deliver.
First-try decision alignment#
For changes where there's a prior team decision the proposal should respect (a chosen library, a deprecated pattern, an architectural rule), did the agent's first proposal honor the decision without prompting? This is the signal that retrieval is reaching the agent at the right moment, not just being available somewhere.
Review-acceptance rate on AI-assisted PRs#
The downstream metric. If the first two are improving and review-acceptance isn't, something else is wrong. If review-acceptance is improving while the first two are flat, you may be measuring the wrong thing. The two should move together over a quarter.
When does institutional memory hurt you?#
Three failure modes. Stale decisions outweigh recent evidence. The agent over-cites internal threads instead of the working tree. And the team treats retrieval as ground truth rather than a hypothesis. The cookbook implies the verification pattern by example, instructing agents to "use information from /memories for organisms already covered there; read source documents for any organism not yet captured" (Anthropic Cookbook, 2026), which is a working description of memory-as-hypothesis: read it, then verify against the source.
This is where the article earns its keep. Every retrieval system has failure modes, and pretending otherwise is a vendor move.
Stale decisions that no longer apply#
A rejection from two years ago shouldn't outweigh a refactor from last week. Retrieval has to rank by recency, not just relevance, and the team has to retire decisions explicitly when they stop binding. Without that, the agent will cite a thread that everyone has moved past, and the article will get bounced for a different reason than before.
Over-citing internal threads#
The agent should reach for the working tree first when the question is about current behavior. Internal threads belong in the answer when the question is about why the current behavior is the way it is. Conflating these makes the agent sound informed in the wrong direction. Anthropic's context-engineering writing frames this as a relevance-ranking problem, not a memory problem, but the failure surfaces the same way to a reviewer.
Memory is not ground truth#
A remembered constraint is a hypothesis, not a fact. The constraint may have been true when it was written, and false now. The discipline is: retrieve, surface, then verify against current code or current production behavior. Skip the verification and the agent confidently cites things that no longer hold. Anthropic's cookbook example shows this pattern in practice (read memory first, fall back to source documents for what isn't captured), and the more honest practitioner guides walk through the same logic (XDA Developers, 2026). If the retrieval pattern doesn't include a verification step, the team has built a confident-hallucination machine.
Frequently Asked Questions#
Why does Claude Code keep proposing approaches we already rejected?#
The agent has access to the working tree and your CLAUDE.md, but not to the PR comments, Slack threads, or incident postmortems where most rejections actually live. GitHub issue #35296 tracks this pattern verbatim. The fix isn't a model upgrade, it's making decision history retrievable on demand at the moment the agent is proposing.
What's the difference between Claude memory and team-scope memory?#
Claude's auto-memory (the per-user memory file) persists facts about you, the user, across sessions: your role, preferences, past projects. Team-scope memory persists facts about the team's decisions, conventions, and rejected approaches. Anthropic's context-engineering cookbook covers the per-user case explicitly through its memory-as-note-taking framing; the team scope is the axis most agent setups still leave unfilled.
Isn't this what CLAUDE.md is for?#
CLAUDE.md works for conventions ("use library X"). It doesn't work for decision history ("we considered Y and killed it because of incident Z") because that content is too long, too time-sensitive, and only relevant some of the time. Loading it on every turn wastes tokens and crowds the context. Decision history belongs in on-demand retrieval, not preloaded.
Does the Claude Code CLI work the same way as the IDE for this?#
Yes. MCP-based retrieval surfaces the same team knowledge regardless of which surface the engineer is in. An institutional-memory layer that exposes content through MCP works in the IDE plugin, the terminal CLI, Slack integrations, and chat. The retrieval pattern is what matters; the surface is where it shows up.
How do you avoid feeding Claude stale decisions?#
Rank retrieval by recency, retire decisions explicitly when they stop binding, and treat every recalled constraint as a hypothesis to verify against current code. Anthropic's cookbook example demonstrates the pattern by instructing agents to "use information from /memories for organisms already covered there; read source documents for any organism not yet captured." Teams that skip the verification step end up with an agent that confidently cites superseded rules, which is a worse failure mode than not having the memory at all.
Closing#
Three takeaways. The trial-and-error tax is a memory problem, not a model problem. Team-scope on-demand retrieval is the under-served axis in the memory taxonomy, the cell most agent setups leave empty. And measurement should be on re-derivation rate and first-try decision alignment, not on agent self-reports or perceived helpfulness.
If the symptom side of this resonates, the related read is Claude Code context rot, which covers the quality-degradation problem that compounds with the memory problem. For the broader discipline this fits inside, the context-engineering pillar guide covers source curation, retrieval, ranking, and feedback loops as the four pillars that turn institutional knowledge into something an agent can actually use. For the architecture under the hood, how a context engine works goes one level deeper.

