Context-Adjusted Productivity: Measuring What AI Actually Ships in Large Codebases

AI lifts merged PRs 98% but adds 91% to review time. Context-adjusted productivity is the number that survives that math.

Dennis PilarinosJun 3, 2026Engineering InsightsContext Engineering

Context-Adjusted Productivity: Measuring What AI Actually Ships in Large Codebases

Key Takeaways

• Context-adjusted productivity subtracts rework, review-time, and churn costs from gross AI output to show the gain that actually ships.

• Faros AI found merged PRs rose 98% with roughly zero company-level correlation, and review time climbed 91% (Faros AI, 2025).

• In mature 1M+ line repos, experienced developers were 19% slower with AI while feeling 20% faster (METR, 2025).

• The context gap, not model choice, drives most of the penalty; closing it cut tokens 42% and time 27% in our controlled test.

• You can instrument this in weeks, not quarters, by tagging AI-authored PRs and watching a rework window.

Context-adjusted productivity is the real productivity delta from AI after you subtract the rework, review-time, and code-churn costs that pile up when agents act without institutional context. It is the gain that survives once the cleanup is done, not the gross output that shows up the moment a pull request merges.

That distinction matters most when you try to measure AI productivity in a large codebase. The bigger and older the system, the more an agent is missing about why the code looks the way it does, so the gap between gross output and context-adjusted productivity widens. Raw metrics, PRs merged and lines shipped, climb fast and feel like progress. The trouble is that those same metrics climb whether the AI understood your conventions or just guessed from the nearest file.

This piece defines the term, shows why throughput overstates impact in legacy systems, and walks through a formula you can actually run.

What is context-adjusted productivity?#

Faros AI studied more than 10,000 developers across 1,255 teams in 2025 and found that AI lifted merged pull requests by 98% and tasks completed by 21%, yet the company-level correlation to actual delivered value sat near zero. Review time rose 91% and per-developer bug rates climbed 9% (Faros AI, 2025).

That split is the whole point. Context-adjusted productivity measures what survives after the hidden costs come out: the reverts, the extra review passes, the churn from code that was almost right. Gross output answers "how much did we produce?" Context-adjusted productivity answers a harder question: "how much of that did we keep without paying it back later?"

Why does this matter now? Because most dashboards still report the gross number. Leaders see PR counts double and assume capacity doubled. The merged work was real, but a chunk of it generated downstream labor that the dashboard never attributes back to its source. To genuinely measure AI productivity in a large codebase, you have to net those costs out. This is the heart of the AI productivity paradox: the output rises and the value does not follow.

Why do raw productivity metrics overstate AI in large codebases?#

GitClear analyzed 211 million changed lines of code and found copy/paste rose from 8.3% to 12.3% between 2021 and 2024, while refactoring collapsed from roughly 25% of changes to under 10% (GitClear, 2026). More duplication, less consolidation: the classic signature of code written without context.

Here is the mechanism. The larger the codebase, the more institutional context an agent is missing, and the more it falls back on nearest-file reasoning. It copies the closest pattern instead of finding the shared helper that lives three directories away. In a 50,000-line repo that mistake is cheap. In a multi-million-line system it compounds into duplication, drift, and review friction.

METR makes the size effect concrete. On mature repositories above one million lines, experienced developers were 19% slower with AI assistance, even though they reported feeling about 20% faster (METR, 2025). The perception gap is exactly why raw metrics mislead: the work feels productive while the clock says otherwise, a pattern we explore further in how to measure AI productivity.

How do you actually calculate it?#

DX released an AI Measurement Framework in late 2025 built on three axes, utilization, impact, and cost, precisely because single-number dashboards hide the trade-offs (DX, 2025). To measure AI productivity in a large codebase you need that cost axis, and context-adjusted productivity forces it onto the ledger.

The formula is straightforward:

Context-adjusted productivity = (gross output gain) - (rework cost + review-time delta + churn cost)

Walk through an illustrative example. These numbers are hypothetical and round, chosen only to show the arithmetic, not drawn from any study.

Suppose an illustrative team ships 100 AI-assisted PRs in a month, a gross gain of 40 PRs over their pre-AI baseline of 60. That looks like a 67% lift. Now net the costs. Say 12 of those PRs get reverted or substantially reworked within two weeks, AI PRs take an extra 1.5 review hours each across 100 PRs (150 hours), and churn cleanup eats another 8 PR-equivalents of effort.

Convert costs into PR-equivalents: 12 reworked + roughly 10 PR-equivalents of review overhead + 8 churn = 30. The 40-PR gross gain minus 30 leaves a context-adjusted gain of 10 PRs, a real lift closer to 17%. Same activity, very different story once context costs are on the balance sheet.

What signals approximate the context penalty?#

The 2025 DORA report found AI positively correlated with throughput yet still negatively correlated with delivery stability, the cleanest signal that speed and durability are pulling apart (DORA, 2025). That stability gap is where the context penalty hides, and a few proxy signals expose it.

Track the revert-or-rework rate on AI PRs: what share gets reverted or materially changed within N days of merge. A rising number means agents are shipping work that the team has to walk back. Track the review-time delta between AI-authored and human-authored PRs; Faros saw review time jump 91%, and your own delta localizes the cost (Faros AI, 2025).

Two more signals help. The copy-paste and churn ratio, straight from GitClear's methodology, flags duplication the agent introduced instead of reusing existing code. And the "almost right but not quite" rate, the PRs that pass CI but miss a convention, captures the subtle context misses that automated checks never catch. Together these four approximate a penalty you cannot measure directly, and they map cleanly onto the four keys in our guide to DORA metrics in the AI era.

What does closing the context gap do to the number?#

We ran a controlled A/B test, same prompt, same model, with and without an institutional context layer feeding the agent. With context, the agent used 42% fewer tokens, finished 27% faster, made 64% fewer tool calls, and scored 41 out of 50 against 24 out of 50 from an independent LLM judge (Unblocked, 2025).

Read those numbers as context-adjusted productivity in motion. Fewer tool calls and fewer tokens mean less flailing, the agent stops guessing from the nearest file because it can see the why behind the code: the PRs, decisions, and discussions that explain intent. The quality score nearly doubling is the rework cost falling before it ever lands in your repo.

One counterintuitive finding reframes what "context" even means. Across 17,190 user-rated references, helpfulness stayed flat regardless of how old the underlying documents were (Unblocked, 2025). The value of context is not freshness; a two-year-old architecture decision still explains today's code. That is why decision-grade context, the durable reasoning behind a system, moves the number more than chasing the latest docs. A standing context layer for AI agents is what makes that reasoning available at the moment the agent needs it.

How do you instrument this without a six-month project?#

Atlassian's State of Developer Experience 2025 captured the paradox in one stat: developers report saving 10 or more hours a week with AI while losing 10 or more hours a week to the friction it creates (Atlassian, 2025). You do not need a year-long study to see your own version of that wash. You need four lightweight steps.

First, tag AI-authored PRs at merge, a label or commit-trailer is enough. Second, define a rework window, say 14 days, and measure how many tagged PRs get reverted or substantially changed inside it. Third, survey perception briefly: ask whether the work felt faster, then compare that against the rework data to expose any gap like METR's. Fourth, compare AI-tagged and human-tagged PRs on review time and churn.

That is a few weeks of instrumentation, not a quarter, and it is enough to measure AI productivity in a large codebase with real confidence. The point is not a perfect number; it is a directional read on whether your context-adjusted productivity is rising or just your raw output. When the partnership between measurement and context tooling tightens, like the work we are doing with DX, that read gets sharper (Unblocked, 2025). Pair it with decision-grade context so the durable reasoning behind your system feeds every measurement.

Frequently asked questions#

What is context-adjusted productivity?#

Context-adjusted productivity is the genuine AI gain left after subtracting rework, review-time, and churn costs that arise when agents lack institutional context. It separates output that ships and stays from output that merges then gets reverted. Faros AI found merged PRs rose 98% with near-zero org-level correlation, the gap this metric exposes (Faros AI, 2025).

Why is AI productivity harder to measure in legacy codebases?#

Missing context scales with codebase size and age, so older systems hide more of the penalty. METR found developers were 19% slower on mature 1M+ line repos despite feeling 20% faster (METR, 2025). The agent guesses from nearby files, and those guesses compound across millions of lines you cannot easily audit.

What metrics reveal AI rework?#

Watch the revert-or-rework rate within a fixed window, the review-time delta between AI and human PRs, and the copy-paste or churn ratio. GitClear measured copy/paste climbing from 8.3% to 12.3% while refactoring fell below 10% (GitClear, 2026). Rising duplication and review time together signal context-starved code.

Does better context improve measured productivity?#

Yes, and measurably. In our controlled test, identical prompt and model, adding an institutional context layer cut tokens 42%, time 27%, and tool calls 64%, while quality scored 41 of 50 versus 24 of 50 (Unblocked, 2025). Closing the context gap raises context-adjusted productivity by reducing rework before it reaches review.

Putting Context on the Balance Sheet#

Gross output is the easiest number to celebrate and the easiest to misread. PRs merged nearly doubled in Faros AI's data while the company-level gain hovered near zero, and the 2025 DORA report showed throughput and stability pulling in opposite directions (Faros AI, 2025; DORA, 2025). Context-adjusted productivity puts the cost side back on the ledger so the two stop hiding each other.

The takeaway is not that AI fails to help. It is that the help is conditional on context. Feed an agent the institutional reasoning behind your code and the rework shrinks; starve it and your dashboard inflates while your delivery stability quietly erodes. To genuinely measure AI productivity in a large codebase, net the costs out and instrument the rework window. Start with a fourteen-day tag-and-compare, then read the controlled same-prompt, same-model test to see how much the context gap alone moves the number.