# Building With AI Effectively: An Engineering Leader's Operating Model


URL: https://getunblocked.com/blog/building-with-ai-effectively/
Published: 2026-06-15T09:00:00Z
Author: Brandon Waselnuk
Categories: Engineering Insights, Context Engineering

Building with AI effectively rests on three levers: token yield, measured output, and calibrated autonomy. Only ~39% of orgs see real EBIT impact from AI.

---
Building with AI effectively means running three levers as one operating system: what each unit of AI work costs, whether that work actually ships and survives, and how much autonomy you hand the agent. Most teams pull one lever and wonder why the other two snap back. The tooling isn't your problem. Your operating model is. McKinsey's State of AI 2025 found that only about 39% of organizations report measurable EBIT impact from AI, even though adoption is nearly universal ([McKinsey](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai), 2025). The gap between "we use AI" and "AI moves the number" is an operating-model gap. This is the playbook for closing it, stage by stage.

## Why doesn't AI adoption turn into business impact?

Adoption is easy. Impact is rare. McKinsey's State of AI 2025 reports that only about 39% of organizations see measurable EBIT impact from AI, even as use becomes standard ([McKinsey](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai), 2025). Meanwhile enterprise LLM spend roughly doubled in six months ([Menlo Ventures](https://menloventures.com/), 2025). Spend went up. Impact mostly didn't.

The reason is that most teams treat AI as a procurement decision. You buy seats, you turn it on, you wait for the lift. But AI coding is a _system_, and the system has a missing variable: the quality of the context your agents work from. Unblocked exists to fill exactly that gap, the context engine for engineering that resolves WHY, not just WHAT. Without it, every agent you spawn is a new hire on day one. Technically capable, knows nothing about your codebase, and very confident anyway.

So how do you get from "we adopted AI" to "AI ships work that survives"? You stop optimizing one thing and start operating three.

## What are the three levers of building with AI effectively?

The operating model rests on three levers, and the discipline is running them together. Pull cost without watching output and you cut budget on the work that was actually shipping. Chase output without bounding autonomy and you industrialize rework. Running them as one dashboard, not three separate projects, is the whole discipline.

Here are the three:

### Token yield: are you paying for useful output or for activity?

Token yield is the share of generated tokens that contribute to a real downstream action, after retries, abandoned sessions, and failed-quality outputs ([FinOps Foundation](https://www.finops.org/insights/token-economics-the-atomic-unit-of-ai-value/), 2025). It's the cost lever, and it's hardening into an industry standard: in mid-2026 the Linux Foundation moved to establish a Tokenomics Foundation for open AI cost-management standards ([Linux Foundation](https://www.linuxfoundation.org/press/linux-foundation-announces-the-intent-to-launch-the-tokenomics-foundation-to-establish-open-standards-for-ai-cost-management), 2026). The question isn't how many tokens you burn. It's how many of them earned their keep.

Context-blind agents burn tokens four ways: broad search because no source is authoritative, re-search of ground they already covered, mis-routing to the wrong source confidently, and retry loops on a wrong assumption until the budget wall stops them. Each one is measurable. We ran a controlled test to size it, same prompt, same model, same Kotlin codebase, context off versus on. Off: 21M tokens, 2 hours 32 minutes, with corrections that broke the build. On: 10M tokens, 25 minutes, zero corrections. One test, one task, so read it as directional. But the direction is stark. The cost lever is mostly a context-quality lever wearing a finance costume.

For the full mechanism-by-mechanism breakdown, see our [token yield work](https://getunblocked.com/blog/token-yield-context-problem) and the unit economics in [cost per merged PR](https://getunblocked.com/blog/cost-per-merged-pr).

### Measured output: does AI actually ship work that survives?

Output volume is a trap, and the data is blunt about it. In a 2025 randomized controlled trial, experienced developers were about 19% _slower_ using AI on familiar codebases, while reporting that they felt faster ([METR](https://metr.org/), 2025). That perception-reality gap is the whole problem. Velocity that you feel is not velocity you can measure.

The merge data shows the same split from the other side. Faros AI found that AI lifted merged PRs by roughly 98% but added about 91% to review time ([Faros AI](https://www.faros.ai/), 2025). More PRs, much heavier review. The right unit isn't PRs opened or lines written. It's work that ships and survives review, incidents, and the next quarter. Our companion piece on [measuring AI productivity](https://getunblocked.com/blog/how-to-measure-ai-productivity) carries the full output-side metric stack.

### Calibrated autonomy: how much should you let the agent run alone?

Autonomy is a dial, not a switch, and most teams turn it too far too fast. The 2025 DORA report frames AI as an amplifier: it magnifies whatever your delivery system already does, good or bad ([DORA](https://dora.dev/), 2025). Grant autonomy on a foundation of weak context and you amplify rework. Grant it on decision-grade context and you compound good decisions.

The right ceiling rises with the other two levers. When token yield is high and output survives review, you can safely let agents run longer, unsupervised, across more of the codebase. When either is shaky, you keep a human in the loop. Calibrated autonomy is the output of the first two levers, not an independent bet. We go deeper in [stop babysitting your agents](https://getunblocked.com/blog/stop-babysitting-your-agents).

## How do the levers map to a maturity progression?

The levers move together as your team climbs, and the climb has named stages. We map this against [the 8 levels of context maturity](https://getunblocked.com/context-maturity/), the framework that anchors where any team sits today. Rather than re-list those levels here, the point is what _operating_ looks like at each one: the budget you set, the metric you watch, and the autonomy you grant all shift together.

Early on, you're proving value on small, well-scoped tasks. Your budget is tight, your metric is "did this ship at all," and autonomy stays low with a human reviewing everything. Mid-curve, you instrument token yield directly and start trusting agents on familiar surfaces. High on the curve, agents run loops unsupervised against decision-grade context, and your metric is yield and survival, not activity. The cost of bad context compounds with every step toward autonomy, so the higher you climb, the more context quality is the thing that actually gates progress.

Want to know your stage before you set budgets? [Map your team to the curve](https://readiness.getunblocked.com/) with the readiness assessment, a mix of qualitative and quantitative questions that places you on the maturity curve.

## What should an engineering leader actually budget and measure?

Budget and measurement change by stage, but the spine is constant: pay for useful output, watch survival, ceiling your autonomy. The instinct to budget by token consumption is exactly backwards. Spend roughly doubled in six months across enterprises ([Menlo Ventures](https://menloventures.com/), 2025), and almost none of that translated into the ~39% of firms seeing EBIT impact ([McKinsey](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai), 2025). Volume of spend is not the signal.

Budget for cost per useful output, not cost per token. Measure whether AI-authored work survives review and the next incident, not how many PRs it opens. And set an autonomy ceiling that only rises when the first two metrics hold. The GitHub Octoverse 2025 report shows AI-assisted contribution is now mainstream across the developer population ([GitHub Octoverse](https://github.blog/news-insights/octoverse/), 2025), which means the differentiator is no longer whether you use AI. It's whether your operating model converts that use into surviving work. That conversion is a context problem before it's a tooling problem.

## What unlocks the next stage of maturity?

Context quality unlocks the next stage, every time. Not a bigger model, not more seats. The reason agents waste tokens, ship work that doesn't survive, and can't be trusted with autonomy is the same reason: they have access to information without understanding it. Access is not understanding. An agent that knows code that compiles is syntactically genius and still missing the intent underneath.

This is the through-line of all three levers, and the practical core of building with AI effectively. Decision-grade context, the WHY behind the code and not just the WHAT, raises token yield because agents start with what they need. It raises survival because the work reflects how your system actually wants to be built. And it raises the safe autonomy ceiling because every choice the agent makes on the right data makes its next choice better. That's the waterfall, and it runs in both directions. Climbing the maturity curve is, in practice, raising context quality until each lever can move to its next setting.

## Frequently asked questions

### What does "building with AI effectively" actually mean for a team?

It means running three levers as one system: token yield, measured output, and calibrated autonomy. Effectiveness isn't a tool you buy. It's whether AI work ships and survives. McKinsey found only ~39% of organizations see measurable EBIT impact from AI despite broad adoption ([McKinsey](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai), 2025).

### Why is my AI spend rising without matching results?

Because spend tracks activity, not yield. Enterprise LLM spend roughly doubled in six months ([Menlo Ventures](https://menloventures.com/), 2025), but most generated tokens never reach a useful action after retries and failed-quality outputs ([FinOps Foundation](https://www.finops.org/insights/token-economics-the-atomic-unit-of-ai-value/), 2025). You're paying for context-blind search, not for shipped work.

### Does AI actually make experienced developers faster?

Not automatically. A 2025 RCT found experienced developers were about 19% slower with AI on familiar codebases while feeling faster ([METR](https://metr.org/), 2025). The felt speed is real, the measured speed often isn't. That's why you measure surviving output, not perceived velocity.

### How much autonomy should I give my coding agents?

Only as much as your other two levers support. Treat autonomy as a dial that rises when token yield is high and output survives review. DORA's 2025 report calls AI an amplifier of your existing delivery system ([DORA](https://dora.dev/), 2025), so weak context plus high autonomy amplifies rework, not output.

### What's the single highest-impact thing to fix first?

Context quality. It's the shared cause behind low yield, fragile output, and unsafe autonomy. Raising it lets every lever move to its next setting, which is why it gates each stage of the maturity curve. Building with AI effectively starts by finding your stage, then budgeting and measuring to it.

## Where to start this quarter

Start by finding your stage, then operate to it. Pick one team and instrument all three levers on real work: cost per useful output instead of cost per token, survival of AI-authored PRs instead of raw volume, and an autonomy ceiling that only rises when both metrics hold. Don't try to climb three stages at once. Move one lever to its next setting and let the others follow.

The pattern repeats at every level: spend is up everywhere ([Menlo Ventures](https://menloventures.com/), 2025), impact is concentrated in the teams that operate this way ([McKinsey](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai), 2025), and the variable that separates them is context quality. Building with AI effectively is less about which model you run and more about whether your agents understand the system they're working in. Get the WHY in front of them, and every PR starts to look like your best engineer on that part of the system wrote it.

Before you set a single budget, find out where you actually stand. [Map your team to the curve](https://readiness.getunblocked.com/) with the readiness assessment, and operate from there.