Stale data isn't stale: what agent retrieval taught us about pruning context

Across 17,190 user-rated references in the Unblocked context engine, helpfulness rate stays flat from same-day sources back through 10-year-old ones. Old context isn't stale — and that has direct implications for managing context in brownfield codebases.

Rashin ArabMay 25, 2026Vision & Perspectives

Customers kept asking us to filter out old data.

An engineering manager at one customer put it bluntly: "I'm getting very tired of cited Slack threads from 2+ years ago as evidence." Another team asked whether old sources could decay in importance over time. Multiple customers flagged stale Notion pages, years-old Jira tickets appearing as if current, and archived docs treated as authoritative. The pattern was consistent across six organizations and more than fifteen separate pieces of feedback: old references were eroding confidence in the answers.

The ask was clear: give us freshness controls. Set a TTL. Decay the index. Stop the engine from anchoring on artifacts that no longer reflect how the system works.

Before writing the code, we pulled the numbers. They told us to put the feature down.

What we looked at#

The Unblocked context engine serves references to AI agents and to a question-and-answer surface that engineers use directly. Every time a user marks an answer helpful or not-helpful, we tag the references the engine pulled to produce it. That gives us a ground-truth signal: which sources actually contributed to an answer the user kept, and which ones contributed to one they rejected.

For the last three months of traffic (Feb 23 to May 25, 2026), we pulled every feedback-tagged reference and computed one thing for each: how old was the source document at the moment the agent used it?

The set: 17,190 references across roughly 6,000 user-rated inferences, drawn from 24 source types. Doc ages ranged from same-day to just over 16 years. Of the rated references, 12,925 came from inferences the user marked helpful; 4,265 came from inferences marked not-helpful.

If "old data hurts answer quality" is true, the not-helpful references should skew older. They don't.

Helpfulness rate by age band — bars are flat from same-day through 10+ years old, all within a few points of the 75% overall rate

The helpfulness rate of a same-day source is 77%. The helpfulness rate of a source between five and ten years old is 74%. The 10-year-plus bucket is 80% (small sample, but pointing the wrong way for the prune-the-old hypothesis).

The cumulative distributions of helpful and not-helpful references by age sit on top of each other for the full ten-year window. Age is not predicting quality at any threshold we tested.

Cumulative distribution of helpful vs. not-helpful references by age — the two curves overlap from day one through ten years

We checked the per-source breakdown to make sure the aggregate wasn't hiding a real effect inside, say, code-versus-docs. It isn't. Code references skew younger (only 29% of helpful code references are more than a year old, which makes sense since code churns). Inside every source category, though, the helpful and not-helpful age distributions still overlap. The pattern survives the cut.

A caveat on the measurement: helpfulness is measured at the inference level, not per-reference. When a user marks an answer helpful, every reference the engine pulled for that answer gets the tag. This is a blunt instrument. But over 17,000 data points, if old references were reliably producing bad answers, the age bands would show divergence even through the noise. They do not.

What's actually going on#

Once you sit with the data, the result stops being surprising.

The customer complaints were real. Old Slack threads presented as current evidence. Archived Notion pages treated as authoritative. Jira tickets in backlog described as if the feature had shipped. But the failure mode was not that the documents were old. It was that the engine cited them without distinguishing between "this describes the current state" and "this describes a decision that shaped the current state." Age correlates with that distinction sometimes, so a TTL felt like the right fix. But cutting by age treats the symptom and throws away the signal.

The reflex to prune old data comes from a model of context where every reference is interchangeable with every other one: older means more likely to be wrong, newer means more likely to be right. That model is correct for code on main. It is wrong for everything else.

A four-year-old Slack thread can be the only place a team's API ownership convention was ever written down. The Jira ticket from 2023 explaining why a service uses an unusual retry pattern is still the truth about the system today. The design doc from before the last reorg explains a constraint that everyone who was around at the time remembers and nobody new has been told. Pieces of context like these aren't subject to decay because they were never claims about the current state. They're claims about why the current state is what it is.

In a brownfield codebase running real production load, that second category (the decisions, the why, the context behind the constraint) does more of the work than the freshness of any single document does.

Agents that get this wrong end up in the same place new hires do when they cargo-cult a recent PR: technically current, missing the load-bearing reason the team did it that way to begin with.

The depth of the referenced history#

We could run this analysis because the context engine indexes the full history of connected sources. The references surfaced during our three-month sample window (Feb to May 2026) point to documents spanning more than 16 years of organizational history. Some go back decades. That range is what makes the age question answerable.

A context engine that only indexes recent history cannot test this hypothesis. If the oldest document in your index is six months old, the feedback loop has no range to measure against. The question gets answered by intuition (usually the intuition that old data must be worse), and the product ships a TTL the data would have argued against.

Depth of indexed history is how you earn this kind of finding.

What this means for managing context for agents#

If you're building or operating against a context layer for AI agents in a real codebase (not a greenfield demo, but a production system with years of history), the practical implications are pretty direct.

Don't prune by age. A document's age is not a useful proxy for how relevant or correct it is. If you're sweeping anything older than X months out of the index, you're throwing away signal at the same rate as noise, on this evidence.

Freshness is a weighting input, not a filter. Recency matters in narrow places: what's on main, what shipped this week, what the on-call rotation looks like right now. Use freshness as one signal in ranking, where the agent can still reach the older artifact if the question is about why. Don't use it as a guillotine.

Treat decision logs as first-class. PR discussions, Slack threads on architecture changes, design docs, post-incident write-ups, ticket comments arguing through a tradeoff. These are the artifacts where the why lives. They're older on average than current code. They're also disproportionately load-bearing for any agent working on top of a brownfield system, because the code itself doesn't carry the reasoning.

Resolve conflicts; don't hide them. When the four-year-old design doc disagrees with a recent PR, the answer isn't to delete one. The answer is to surface both with provenance and let the agent (or the human) see the disagreement and decide what to do. The trap is the silent pick: confidently wrong because the older source was deprecated without being deleted.

The solution space that the data does support is multi-signal. Document-type hierarchies (code is current by definition; Slack is ephemeral by nature; design docs sit somewhere between). Status awareness (a Jira ticket in backlog is not a shipped feature; an archived Notion page is not current guidance). Conflict detection (when a recent PR contradicts an old design doc, surface both with dates and let the agent reason about which applies). These are harder to build than a TTL. They are also the only interventions the data supports.

What we shipped instead#

We didn't ship the garbage-collection feature. We did ship a stronger conviction about what the context engine is doing when it works.

For agents in brownfield codebases, context isn't a fresh-data problem. It's a synthesis-and-reconciliation problem across artifacts that span the full life of the system. The agent that writes code that feels like it was written by someone who's been on your team for years is the one that can reach back a decade and pull the design doc that explains why, not the one that can only see what shipped this week.

On this evidence: old is not stale.