DORA Metrics in the AI Era: Necessary but No Longer Sufficient

90% of developers now use AI at work, yet the 2025 DORA report shows delivery stability still drops. Here's how to read the four keys.

Dennis PilarinosJun 4, 2026Engineering Insights

DORA Metrics in the AI Era: Necessary but No Longer Sufficient

Key Takeaways

• DORA metrics still measure delivery speed and stability accurately, but no longer explain why those numbers move in AI-assisted teams.

• The 2025 DORA report found AI is now positively correlated with throughput, yet still negatively correlated with delivery stability (Google Cloud / DORA, 2025).

• Faster, larger AI-generated pull requests increase review load and rework, so change failure rate matters more than ever.

• Pair the four keys with quality and DevEx metrics, using frameworks like DX Core 4.

• Stability degrades most when AI agents lack institutional context about your systems.

Do DORA metrics still work when half your pull requests are AI-generated? They still measure delivery accurately. They no longer explain it. DORA metrics in the AI era remain necessary, but they have become insufficient on their own. The 2025 DORA report found that 90% of developers now use AI at work (Google Cloud / DORA, 2025), and that shift changed what the numbers mean. Deployment frequency and lead time still track delivery faithfully. But they say nothing about whether the code is right, whether reviewers are buried, or whether stability is quietly eroding underneath the speed. To see the full picture now, you have to pair the four keys with quality and developer-experience signals.

What are the four DORA metrics?#

The four DORA keys are deployment frequency, lead time for changes, change failure rate, and failed-deployment recovery time (DORA, 2025). Together they balance two forces: how fast a team ships and how reliably it does so. The 2025 report reframed teams into seven archetypes, moving away from the familiar low, medium, high, and elite clusters (Google Cloud / DORA, 2025).

Why does the split matter? Two metrics measure throughput. Deployment frequency counts how often you release. Lead time for changes tracks how long a commit takes to reach production. The other two measure stability. Change failure rate captures the share of deployments that cause problems. Failed-deployment recovery time measures how quickly you bounce back.

That pairing is the genius of the framework. Speed without stability is recklessness. Stability without speed is stagnation. The framework forces teams to watch both at once, which is exactly why it has held up across a decade of research and now across the arrival of AI-assisted development.

What did the 2025 DORA report find about AI?#

The 2025 DORA report found that 90% of developers use AI at work, and more than 80% report a productivity gain from it (Google Cloud / DORA, 2025). Yet roughly 30% still place little or no trust in AI-generated code, and AI remains negatively correlated with delivery stability even as it now correlates positively with throughput.

That last point is the headline shift. Look at the historical baseline: the 2024 DORA report (labeled here as historical) estimated that every 25% rise in AI adoption was associated with roughly a 1.5% drop in throughput and a 7.2% drop in delivery stability (Google Cloud / DORA, 2024). A year later, the throughput sign flipped positive. Teams got faster.

But stability did not recover. AI still pulls delivery stability down. Independent surveys echo the tension. Stack Overflow's 2025 survey found 84% of developers use or plan to use AI tools, while 66% say AI output is "almost right, but not quite" (Stack Overflow, 2025). The speed is real. So is the friction it leaves behind.

Why is throughput up but stability still down?#

AI generates more code faster, which produces bigger pull requests, more review load, and more places for defects to hide. Faros AI found that AI adoption coincided with PR review time rising 91% and bug rate climbing 9% per developer, alongside a 98% jump in PRs merged, with near-zero company-level correlation between AI use and output (Faros AI, 2025).

That gap explains the stability problem. When developers ship more change per unit of time, each deployment carries more surface area for failure. Reviewers cannot keep pace, so subtle defects slip through. The throughput metrics light up green while change failure rate creeps upward.

Code composition is shifting too. GitClear reported that copy-pasted code rose from 8.3% to 12.3% of commits between 2021 and 2024, while refactoring fell from around 25% to under 10% (GitClear, 2026). More duplication and less cleanup is a recipe for fragility.

The four keys never broke. AI simply decoupled speed from quality in a way the original framework never had to account for, because human typing speed used to be a natural brake on how much risky change entered a system.

A four-keys-in-the-AI-era table#

Each DORA key still measures what it always did, but AI distorts the story behind the number. The fix is not to abandon the metric. The fix is to pair each key with a counter-metric that catches what AI hides. The table below maps that pairing across all four keys, drawing on the original definitions (DORA, 2025).

Metric	What it measures	What AI distorts	Counter-metric to pair
Deployment frequency	How often code reaches production	More releases can mask smaller reviewed change per release	Change failure rate per deployment
Lead time for changes	Time from commit to production	Faster commits hide slower, heavier review queues	PR review time and review-to-merge delta
Change failure rate	Share of deployments causing failures	Larger AI PRs concentrate more risk per deploy	Rework rate and code churn within 30 days
Failed-deployment recovery time	Speed of recovery after a failure	Fast recovery can normalize a rising failure count	Trust signal and developer confidence in AI output

In our work with engineering teams, the counter-metrics are where the real conversation starts. The four keys open the meeting. The pairings tell you whether the green dashboard is earned or borrowed.

Why are the four keys necessary but not sufficient now?#

DORA metrics are necessary because they measure delivery outcomes better than anything else, but insufficient because they say nothing about whether code is correct or whether developers are drowning. The DX Core 4 framework was built to close that gap, unifying DORA, SPACE, and DevEx into four dimensions: speed, effectiveness, quality, and impact (DX, 2024).

Here is the limit of the four keys. They track flow, not fitness. A team can deploy often, recover fast, and still ship features nobody trusts. The SPACE framework first made this argument, treating productivity as multidimensional rather than a single throughput line (ACM Queue, the foundational SPACE paper).

DX Core 4 operationalizes that idea. Pair delivery speed with a quality dimension that captures failure and rework, and an experience dimension that captures whether your developers are thriving or burning out. In an AI-assisted world, both additions matter more, because AI inflates speed while quietly taxing review and trust. The four keys still belong at the center. They just need company.

What should you add to DORA for AI-assisted teams?#

Add three things: a sharper focus on delivery stability, a rework or churn signal, and a perception check. METR's 2025 study found experienced developers were actually 19% slower with AI tools, even though they felt 20% faster (METR, 2025). Feeling fast is not the same as being fast.

That perception gap is why self-reported productivity alone is dangerous. Weight change failure rate and recovery time more heavily, since these are where AI risk surfaces first. Track rework rate, the share of code rewritten or reverted shortly after merge, to catch the duplication trend GitClear documented. Watch review-time deltas, because the Faros data shows review queues absorbing much of the hidden cost.

How do you keep the perception honest? Combine objective delivery data with a periodic developer survey, then compare the two. When the felt speed and the measured speed diverge, you have found a place to investigate. Measuring AI productivity well means treating both signals as partial and triangulating between them. See the pillar guide on how to measure AI productivity and the AI productivity paradox for the perceived-versus-real-gains analysis.

Why do these metrics depend on context quality?#

Delivery stability degrades most when AI agents lack institutional context about your systems, conventions, and prior decisions. Unblocked, which surfaces the institutional context behind code from pull requests, design docs, chat, and incidents, ran a controlled A/B test showing that better context produced 42% fewer tokens, 27% faster completion, and 64% fewer tool calls on the same prompt and model (Unblocked, 2025).

The mechanism is straightforward. An agent without context guesses. It duplicates a rejected approach, violates a convention, or misses incident history, and that guesswork lands as a change failure or a rushed revert. Context is the variable that connects raw model capability to stable delivery outcomes.

This is why the same model can help one team and hurt another. The difference is rarely the model. It is what the model knows about your codebase before it writes a line. Context-adjusted productivity reframes the whole measurement problem around that variable, and the context layer for AI agents explains why agents need institutional context in the first place.

Frequently asked questions#

Are DORA metrics still relevant in 2025 and 2026?#

Yes. DORA metrics remain the most reliable way to measure software delivery speed and stability, and the 2025 DORA report confirmed they still discriminate between teams (Google Cloud / DORA, 2025). The caveat is scope. Pair them with quality and developer-experience metrics so you capture correctness and team health, not just flow.

Does AI improve DORA metrics?#

It depends on which key. The 2025 DORA report found AI now correlates positively with throughput but still negatively with delivery stability (Google Cloud / DORA, 2025). So deployment frequency and lead time often improve, while change failure rate can worsen. Faster shipping is real, but it can carry more risk per deployment.

What is the difference between DORA and DX Core 4?#

DORA measures four delivery outcomes: deployment frequency, lead time, change failure rate, and recovery time. DX Core 4 is broader, unifying DORA, SPACE, and DevEx into speed, effectiveness, quality, and impact (DX, 2024). Think of DORA as the delivery core and DX Core 4 as the full instrument panel around it.

Which metric catches AI-induced risk?#

Change failure rate and delivery stability catch it first, supported by a rework or churn signal. Faros AI found bug rate rising 9% per developer alongside AI adoption (Faros AI, 2025), which shows up as failed deployments and post-merge rework. Watch those two together to see risk before it compounds.

Beyond the Four Keys#

DORA metrics in the AI era are necessary but no longer sufficient. They still measure delivery with the same precision they always had, but with 90% of developers now using AI (Google Cloud / DORA, 2025), the numbers stopped explaining themselves. Throughput climbs while stability lags, and a green dashboard can hide rising rework, buried reviewers, and shaky trust. The path forward is not to retire the four keys. It is to surround them with quality and experience signals, weight stability more heavily, and treat felt productivity as a hypothesis to test, not a result to trust. Most of all, invest in the context your AI tools run on, since stability tracks context quality closely. Unblocked partners with DX, which builds the developer-intelligence platform behind much of this research (Unblocked, 2025). Start by pairing one counter-metric with each key this quarter, then read how the same prompt and model produce different results once the context behind them changes.