All posts

The AI Productivity Paradox: Why Your Devs Feel Faster but Ship the Same

Dennis PilarinosDennis Pilarinos·Jun 2, 2026·Engineering Insights
The AI Productivity Paradox: Why Your Devs Feel Faster but Ship the Same

TL;DR: Developers report large AI speedups, but controlled and team-level measurements show flat or negative output on mature codebases. Pair every perceived-speed metric with an outcome counter-metric, and supply institutional context so agents stop generating plausible-but-wrong code.

Your developers will tell you AI made them faster. The most rigorous controlled evidence to date says the opposite, and the gap between those two facts is the AI productivity paradox. In a randomized trial, experienced engineers worked 19% slower with AI tools while genuinely believing they were 20% faster (METR, 2025). Perception and measured output have decoupled. That is the whole story: the feeling of speed is real, the shipped throughput is not. Teams keep buying more tools to chase a number that self-report can't measure, when the missing ingredient is usually institutional context, the reasoning behind the code that tools like Unblocked surface for agents. Before you scale your AI rollout, it's worth understanding why this gap shows up precisely where you'd least expect it, among your most senior people on your most important code.

What is the AI productivity paradox?#

The AI productivity paradox is the divergence between how fast developers feel with AI and how much they actually ship. In a randomized controlled trial, 16 experienced open-source developers completing 246 tasks on mature repositories (1M+ lines of code) were 19% slower with AI, despite believing they were roughly 20% faster and expecting a 24% gain beforehand (METR, 2025).

That is a perception gap of nearly 39 points between belief and reality. The finding is uncomfortable because it inverts the marketing. These weren't junior developers fumbling through unfamiliar code. They were domain experts working in repositories they knew intimately, exactly the population you'd expect AI to help most.

So what's happening? AI genuinely accelerates one slice of the work, generating code, suggesting completions, drafting boilerplate, but that slice is not what gates delivery. Meanwhile, the time spent prompting, reviewing, and correcting near-miss output quietly absorbs the savings. The clock says one thing. The developer's gut says another. The AI productivity paradox lives in that mismatch.

Why do developers feel faster when they're not?#

Developers feel faster because AI removes friction from the most visible part of the work, typing and recall, even when total elapsed time rises. In the 2025 developer survey, 66% of developers said AI solutions are "almost right but not quite," and only about a third trust the accuracy of those tools while 46% actively distrust it, yet 84% use or plan to use them (Stack Overflow, 2025).

"Almost right but not quite" is the phrase that explains the whole illusion. When an agent produces 90% of a solution instantly, the effort you saved is vivid and immediate. The effort you spend hunting the missing 10%, the wrong import, the stale convention, the subtly broken edge case, feels like normal work, so your brain doesn't bill it against the AI.

Effort and elapsed time are different things. Offloading keystrokes lowers cognitive effort, which registers as "faster." But verification, debugging, and reconciling near-miss code can extend the clock well past where you'd have landed writing it yourself. We've found that the more confident the output looks, the less scrutiny it invites, and the more expensive the eventual correction becomes. If you want to track this properly, start by learning how to measure AI productivity with output-based metrics. The feeling is honest. It's just measuring the wrong thing.

Does individual speed translate to team output?#

Individual speedups largely evaporate at the team level. Analyzing more than 10,000 developers across 1,255 teams, Faros AI found AI use correlated with 21% more tasks completed and 98% more pull requests merged, but no significant correlation with company-level improvement, while PR review time rose 91% (Faros AI, 2025).

Read that again. Twice the PRs merged, and yet the business outcome didn't move. The bottleneck simply shifted downstream. More code generated means more code to review, and review time nearly doubled, eating the upstream gains before they reached production.

This is the system-level shape of the paradox. Local optimization at the keyboard creates global congestion at the review queue. Each developer is genuinely producing more artifacts, but artifacts are not outcomes. A merged PR that needs a 91% longer review, or a follow-up fix next sprint, hasn't shortened your delivery cycle; it's relocated the work and added handoff cost. This is why we track context-adjusted productivity rather than raw commit volume. When you measure throughput by what one person commits rather than what the team ships and keeps stable, you'll mistake motion for progress every time.

Where does the time actually go?#

The saved time gets recycled into rework, review, and organizational drag. In a 2025 survey of developer experience, 99% of developers reported time savings from AI and 68% said they save 10 or more hours per week, yet 50% reported also losing 10 or more hours per week to organizational inefficiency (Atlassian, 2025).

Those two numbers describe the same person. AI hands back ten hours; the broken system takes ten back. The net is a wash, which is exactly what the controlled studies show.

Code quality data points the same direction. GitClear found copy-pasted lines rose from 8.3% to 12.3% between 2021 and 2024, while refactoring activity fell from roughly 25% of changes to under 10% (GitClear, 2026). More duplication and less consolidation is a recipe for slow, expensive maintenance later. The time AI saves you in the editor today shows up as churn, review burden, and verification cost tomorrow. The hours don't vanish in this paradox. They migrate to places your dashboards rarely watch.

Wasn't AI supposed to be a huge speedup?#

The famous speedup numbers came from a different kind of task. A 2023 randomized controlled trial of GitHub Copilot found developers completed a self-contained JavaScript task 55% faster (GitHub / Peng et al., 2023). That result is real, but note the date and the setup: it predates current agentic tools and used a greenfield, well-bounded problem.

That's the crux. AI shines on isolated, low-context tasks: a fresh function, a known algorithm, a throwaway script. It struggles where most professional work lives, inside large, interdependent, history-laden codebases. The METR trial deliberately tested that harder reality on 1M+ line repositories, and the speedup flipped to a 19% slowdown (METR, 2025). Both findings can be true. The 2023 number isn't wrong; it just doesn't describe your production monorepo. Generalizing a greenfield benchmark to mature engineering is how the AI productivity paradox got mistaken for settled fact.

What causes the AI productivity paradox, and what fixes it?#

The root cause is missing institutional context: agents act on what the code says without knowing why it was built that way, producing plausible-but-wrong output that triggers rework. The 2025 DORA report found AI positively correlated with throughput but still negatively correlated with delivery stability, with 90% of teams using AI and 30% reporting little or no trust in it (DORA, 2025).

Our own controlled A/B test makes the mechanism concrete. Giving an AI agent the surrounding institutional context, the PRs, design docs, decisions, and history behind the code, cut token usage 42%, completed tasks 27% faster, made 64% fewer tool calls, and scored 41 out of 50 against an LLM judge versus 24 out of 50 without it (Unblocked, 2026).

Here's the insight most teams miss: the paradox isn't an AI capability problem; it's a context-delivery problem. An agent without your team's accumulated reasoning will confidently re-implement a rejected approach or violate a convention nobody wrote down. Unblocked supplies that institutional context, the why, who, and when behind code, at the moment the agent needs it, which is what context-adjusted productivity measures. It's also why teams stop babysitting their agents once the missing context is supplied at the source. The fix is structural: feed context at the source, then measure outcomes, not feelings.

Frequently asked questions#

Is the AI productivity paradox real or just anecdotal?#

It's measured, not anecdotal. A randomized controlled trial found experienced developers 19% slower with AI (METR, 2025), and a study of 1,255 teams found no company-level improvement despite 98% more PRs merged (Faros AI, 2025). Two independent methods, the same conclusion.

Does AI make experienced developers slower?#

In the right conditions, yes. On mature 1M+ line repositories, the METR trial measured a 19% slowdown for experienced developers despite their belief they'd sped up (METR, 2025). The effect is context-dependent; greenfield, low-context tasks still benefit, and the satisfaction of search bias explains why agents stop at the first plausible answer.

Why do surveys say AI boosts productivity if output doesn't rise?#

Because surveys measure self-reported feeling, not shipped output. Atlassian found 99% of developers report time savings, yet 50% also lose 10+ hours weekly to inefficiency (Atlassian, 2025). The perception gap is the paradox: effort drops while elapsed time doesn't.

How do you avoid the paradox on your team?#

Pair every speed metric with an outcome counter-metric, and supply context at the source. Our A/B test showed context-rich agents ran 27% faster with 64% fewer tool calls (Unblocked, 2026). Measure delivery stability alongside throughput, since DORA found AI can lift one while hurting the other.

Measuring Past the Feeling#

The AI productivity paradox won't be solved by buying more tools or surveying happier developers. It's solved by measurement discipline. In my experience, every team that falls into this trap measures the same thing: artifacts produced, prompts answered, hours felt saved. None of those are outcomes. The fix is to pair each perceived-speed metric with a counter-metric that tracks reality, throughput against delivery stability, PRs merged against review time and rework, code generated against duplication and churn.

The evidence is consistent across METR, Faros, DORA, and Atlassian: AI moves the local numbers and leaves the global ones flat unless you address what's actually slowing teams down. Often that's missing institutional context, the reasoning behind the code that an agent can't infer and a survey can't capture. Surface that context, then measure what ships and stays stable. Our full framework for how to measure AI productivity walks through the counter-metrics that hold up. Stop trusting the feeling. Start counting the outcome.