Three Hard Lessons from Building Context at Scale

Dennis Pilarinos·May 1, 2026·AI Agent Autonomy · Engineering Insights

Three Hard Lessons from Building Context at Scale

Takeaways:

• More tools didn't make the agent smarter, depth above retrieval did

• Hiding source conflicts felt clean but eroded engineer trust

• Caching answers in an evolving system saves tokens and breaks correctness

• Surface-conflicts-AND-live-compute is the only stable resting point; get one wrong and the others amplify

Three Hard Lessons from Building Context at Scale#

We built a context engine for AI coding agents. Three things almost broke it before we got it right.

This piece is the written retrospective behind a talk I gave at QCon London and AI Engineering London in early 2026, three context engineering lessons learned the expensive way, each from a real architectural mistake we shipped, watched fail, and rebuilt around. The lessons compound. Get any one of them wrong and the others amplify the failure.

Most vendors don't publish the architectures they shipped and rolled back. That silence is a problem, because more developers actively distrust the accuracy of AI tools than trust them, 46% versus 33%, according to the Stack Overflow Developer Survey 2025. Engineering leaders evaluating context engines deserve to see the failure modes, not just the demos.

The trust gap isn't abstract. The JetBrains State of Developer Ecosystem 2025 found that 99% of developers express some concern about AI in coding. We hit those concerns from the inside. Our first three production designs each failed in a different direction, and the failures shared a structure: every shortcut we took on context made the agent confidently wrong instead of usefully right. The cost of that compounds with every step toward autonomy, the broader argument behind stop babysitting your agents.

So this is a retrospective for the people building or evaluating context infrastructure. Three mistakes, three rebuilds, one composite lesson at the end. The context engineering lessons learned here aren't theoretical, each one cost us a rebuild.

Why didn't more access produce better answers?#

Lesson 1: We optimized for access, not understanding. We kept wiring in MCP servers and connectors. Access went up, output quality plateaued. Engineers got more documents and worse answers. Anthropic's research on effective context engineering for AI agents puts it bluntly: "as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases."

We saw the same curve internally. Adding a sixth retrieval source rarely helped. It usually hurt. The model spent attention budget on near-duplicates and stale wiki pages, and the synthesis layer had less room to actually reason. Chroma's context rot research tested 18 frontier models, including Claude 4, GPT-4.1, Gemini 2.5, and Qwen3, and found that "model performance varies significantly as input length changes." Bigger windows are not free.

The fix was inversion. We stopped asking "what else can we plug in?" and started asking "what does this question actually need?" Fewer sources, ranked harder, with a reasoning layer above retrieval rather than beside it. The Pragmatic Engineer's MCP analysis describes the same trap, the protocol makes it cheap to add tools, which makes it tempting to add the wrong ones.

If you want the long version of this argument, why MCP servers aren't enough is the companion piece. The short version: depth above retrieval. Optimize the synthesis layer, not the index.

Why does hiding conflicts erode trust?#

Lesson 2: We hid conflicts instead of surfacing them. Our first version picked one source as ground truth and discarded the rest. The output was clean, the behavior was awful. Engineers stopped trusting the system because they could see disagreements in their own heads that the system had silently resolved.

The lesson generalizes beyond us. The Stanford Legal RAG Hallucinations study found retrieval-grounded LLM systems still hallucinate on 17-34% of queries in their domain. Retrieval alone does not fix hallucination. If your sources disagree and you pick one quietly, you've built a confident wrong-answer machine on top of correct documents.

We rebuilt around three signals: authority (who wrote it, in what role), freshness (when was it last touched, by whom), and source type (PR comment versus runbook versus Slack thread). When sources agreed within tolerance, the system answered. When they disagreed materially, it surfaced the conflict instead of resolving it for you. That sounds obvious in retrospect. It was not obvious when the demo metric was "shortest answer that looks confident."

Trust is a function of transparency, not certainty. An engineer reading an answer that says "the migration runbook says X, but the last two PRs and a Slack decision in February say Y, here's the diff" trusts the system more, not less. That's the heart of decision-grade context, the answer carries the disagreements forward instead of laundering them out.

Why can't you cache an evolving system?#

Lesson 3: We cached answers instead of computing them. We cached aggressively to save tokens. It worked, until it didn't. Code changed, decisions changed, conventions evolved. Cached answers became stale answers, and stale answers got delivered with the same confidence as fresh ones.

The DORA 2025 State of AI-Assisted Software Development report frames the same dynamic at the org level: roughly 90% of developers now use AI assistance, only about one in four trust the outputs deeply, and adoption without governance correlates with higher delivery instability and increased rework. Stale context is one of the mechanisms behind that rework number. The agent answers fast, the answer is wrong, the human re-does the work.

We moved to live computation with tight invalidation. The token cost went up. The correctness cost, which is much higher, came down. Anthropic's code execution with MCP describes the architectural shape we landed on, just-in-time computation against the live system, with caching limited to derivations from immutable inputs.

The over-exploration dynamic in satisfaction of search is the human analogue. Once you have an answer that feels right, you stop looking. Caches do the same thing to systems. Don't cache the parts that move.

Why do these three lessons compound?#

These three context engineering lessons learned are not independent. Access without understanding wastes tokens on the wrong sources. Hidden conflicts plus cached answers ship wrong code with confidence. Pick any single fix and the other two pathologies amplify whatever you've solved.

A context engine that surfaces conflicts but caches the resolution will surface yesterday's conflict against today's code. A context engine that computes live but pulls from twelve poorly-ranked sources will compute the wrong synthesis faster. A context engine that ranks sources beautifully but hides their disagreement is a polished version of the original trust problem.

Surface-conflicts-AND-live-compute, on top of fewer-deeper sources, is the only stable resting point we found. Anything else is a failure mode waiting for the right input.

This is also why "context-aware" as a marketing claim has gotten cheap. The phrase covers everything from a glorified retrieval index to an actual reasoning layer. The diagnostic question is architectural, not adjectival. What does the system do when sources disagree, and how does it know its own answers are still true?

What would we do differently from day one?#

If we were starting the context engine over, three things would change immediately, and they map cleanly to the context engineering lessons learned above. First, lead with reasoning over retrieval. The synthesis layer is the product. The index is plumbing. Optimizing the index until it's perfect and bolting reasoning on top is the path we took, and it's backwards. Anthropic's effective context engineering work makes the same case from the model side.

Second, invest in conflict resolution before scale. Every source you add multiplies the conflict surface, not adds to it. Two sources have one possible disagreement. Six sources have fifteen. Build the ranking and surfacing logic when you have three sources, not when you have thirty. We learned this expensively. The Stanford Legal RAG study hallucination range stays uncomfortably high precisely because most retrieval systems still don't.

Third, never cache the parts that move. Cache derivations from immutable inputs (a parsed AST of a frozen commit, an embedding of a closed PR). Don't cache answers from evolving sources. The token math feels worse on a spreadsheet. The correctness math is dramatically better in production, especially as agents take on more autonomous work and the cost of a confidently wrong answer compounds across a multi-step task.

That's the full retrospective in three sentences. Reasoning above retrieval. Conflicts surfaced, not resolved silently. Live compute over cached answers. Everything else we built, the institutional context across PRs, Slack, Jira, Notion, Confluence, and code, sits on top of those three commitments.

How do you test for these failures in any context engine?#

You don't have to take any of this on faith. The context engineering lessons learned in sections two through four each map to a concrete diagnostic, three tests that work on any system, including ours.

Test the conflict surface. Pick a question whose correct answer requires reconciling two sources you know disagree, a stale runbook against a recent PR comment, for example. Ask the system. If it picks one and never mentions the other, you've found Lesson 2 in production. A trustworthy system surfaces the disagreement and explains the reconciliation.

Test the cache horizon. Change a documented rule in your repo (a lint config, a naming convention, an architectural decision record). Ask a question whose answer depends on the new rule. Time how long until the system reflects the change. Anything beyond a few minutes for a high-signal source means you're reading cached answers from an evolving system, Lesson 3.

Test the reasoning depth. Ask a "why" question whose answer lives across PRs, Slack threads, and ticket history, not in a single document. A retrieval-first system returns documents. A reasoning-first system returns a synthesized answer that cites the trail. That gap is Lesson 1 made visible. The DORA 2025 report trust gap, one in four developers deeply trusting outputs, lives in this gap.

The broader buyer's checklist is in how to evaluate a context engine, and the controlled-experiment methodology behind these diagnostics is in same prompt, same model, different context.

The composite lesson#

Three context engineering lessons learned, one composite shape. Reasoning above retrieval. Conflicts surfaced, not resolved silently. Live compute over cached answers. Each one fixes a different failure. Together, they're the floor a context engine has to clear before "autonomous agent" stops being aspirational.

If you're building or evaluating context infrastructure that compounds with autonomy, the diagnostics in section seven are the cheapest way to find out where any system, ours included, actually stands. The goal isn't a clever index. It's an answer an engineer can act on without re-deriving it themselves. Retrospectives like this one exist because we got there the long way, and the next team shouldn't have to.