All posts

The 8 Levels of Agentic Engineering: An AI Maturity Model for Engineering Teams

Dennis PilarinosDennis Pilarinos·Apr 28, 2026·AI Agent Autonomy · Engineering Insights
The 8 Levels of Agentic Engineering: An AI Maturity Model for Engineering Teams

TL;DR:

• Bassim Eledath's 8 Levels of Agentic Engineering map an 8-stage AI maturity curve from Tab Complete to Autonomous Agent Teams.

• Three umbrella zones determine the infrastructure decision: Levels 1-2 (you are the context), Levels 3-4 (curated context), Levels 5+ (you need a real context engine).

• Most teams stall at three predictable points: the curated-context trap, the MCP plateau, and the background-agent wall.

• The cost of a context miss compounds with every step toward autonomy, from "you catch it" at Level 1 to silent failure at Levels 7-8.

• Pick your stage honestly, then pick the infrastructure that matches it. Over-investing or under-investing on context is the most common AI strategy mistake of 2026.

The 8 Levels of Agentic Engineering: An AI Maturity Model for Engineering Teams#

Where are you on the AI engineering maturity curve right now?

Most engineering leaders can't answer that question precisely, and the answer determines whether their AI strategy is over-engineered, under-engineered, or actually right. The 8 Levels of Agentic Engineering, a framework first published by Bassim Eledath, maps the curve. This piece extends it with the infrastructure-decision lens I gave at QCon London and AI Engineering London earlier in 2026: which level you are at determines whether you need nothing fancy, curated context, or a real context engine that delivers decision-grade context for coding agents.

Why does AI maturity vary so much across engineering teams?#

The gap between AI usage and AI trust is now the defining feature of the maturity curve. The 2025 DORA State of AI-Assisted Software Development report found that roughly 90% of developers use AI assistance, but only about 24% trust the outputs deeply, and governance gaps drive instability across teams.

That distrust shows up everywhere. The Stack Overflow Developer Survey 2025 reported 46% of developers actively distrust AI accuracy, against only 33% who say they trust it. Two teams using the same model can sit four stages apart on the maturity curve, because adoption is not the bottleneck. Context, governance, and feedback loops are.

Leaders evaluating AI infrastructure need a stage diagnostic, not another vendor pitch. That is what the next section provides.

What are the 8 Levels of Agentic Engineering?#

The 8 Levels of Agentic Engineering, published by Bassim Eledath, describe a progression from passive autocomplete to fully autonomous multi-agent teams. Bassim's thesis is sharp: "AI's coding ability is outpacing our ability to wield it effectively." The curve exists because the wielding, not the model, is where teams diverge.

Bassim's eight stages, in order:

  1. Tab Complete. Initial autocomplete suggestions.
  2. Agent IDE. Chat connected to the codebase for multi-file edits.
  3. Context Engineering. Maximizing information density per token.
  4. Compounding Engineering. Plan, delegate, assess, codify loop.
  5. MCP and Skills. Extending capability through external tools and APIs.
  6. Harness Engineering and Automated Feedback Loops. Building verifiable environments for autonomous work.
  7. Background Agents. Asynchronous agents executing without human supervision.
  8. Autonomous Agent Teams. Multi-agent coordination without central orchestration.

Bassim groups Levels 3-5 as the building blocks that address context, capability, and tool access. Levels 6-7 are "where the rocket really starts to ship." Level 8 is the active frontier.

Here is the overlay this piece adds. Each stage maps to one of three infrastructure decisions. Levels 1-2: you are the context engine, carrying everything in your head. Levels 3-4: curated context, where teams pin shared rules into a repo. Levels 5+: a real context engine, pulling live institutional context across PRs, Slack, Jira, Notion, Confluence, and code. The model you pick is rarely the constraint. The infrastructure under it is.

Levels 1-2: when you are the context engine#

At Levels 1 and 2, the engineer carries all the context. Tab Complete means hitting tab and moving on, one suggestion at a time. Agent IDE means chatting with the codebase, accepting or rejecting multi-file edits. The infrastructure budget is small: an IDE plugin, a model subscription, and the engineer's own working memory of the system.

This scales fine for solo work and small teams. The volume signal is real. GitHub's 2025 Octoverse reported the Copilot coding agent shipping more than 1 million pull requests in five months. That is mostly Levels 1-2 activity, and it is genuinely useful.

Trust, however, is fragile even here. The JetBrains State of Developer Ecosystem 2025 found that 99% of developers express some concern about AI in coding, ranging from privacy worries to skepticism about output quality. At Levels 1-2 those concerns are manageable, because the engineer is the verifier. One bad completion, hit escape, move on. The cost of a context miss is paid in milliseconds.

That changes the moment you delegate.

Levels 3-5: what are the building blocks for compounded gains?#

Bassim groups Levels 3, 4, and 5 as the building blocks. They are also where most engineering teams sit today, and where the curated-context zone lives.

Context Engineering (Level 3)#

Bassim's framing for Level 3 is precise: "Every token needs to fight for its place in the prompt." Recall degrades as inputs grow, which Anthropic's writeup on effective context engineering for AI agents treats as a core design constraint. The job is not to feed the model more. It is to feed the model the right things.

This is where teams first build a CLAUDE.md, an AGENTS.md, or a .cursorrules file. Day one it feels like gold. The agent suddenly knows your conventions.

Compounding Engineering (Level 4)#

Compounding Engineering is the plan, delegate, assess, codify loop. Each successful task feeds back into the rules and skills the agent reuses next time. It is the first stage where the team's collective judgement starts to live outside individual heads.

It is also where the curated-context trap arrives. By day sixty, half the rules are stale, engineers stop updating them, and agents are confidently citing conventions that no longer apply.

MCP and Skills (Level 5)#

Level 5 is where agents reach beyond the IDE. The Pragmatic Engineer's MCP analysis traced the protocol's rapid emergence as the standard interface for agent tooling, and GitHub's Octoverse reported MCP repositories reaching roughly 37,000 stars within eight months of launch. Adoption is not the question. Quality is.

This is the boundary between curated context and a real reasoning layer. Past it, static rules cannot keep up.

Levels 6-8: where the rocket ships, and where stale context kills you#

Bassim describes Levels 6 and 7 as "where the rocket really starts to ship," with Level 8 as the active frontier. This is also the context engine zone, where the cost of bad context becomes silent failure.

Harness Engineering and Automated Feedback Loops (Level 6)#

Level 6 is where teams build verifiable environments: sandboxes, eval harnesses, automated regression checks that an agent can run against itself. Bassim's line on this is the one to internalize: "If you want autonomy, you need backpressure. Otherwise you end up with a slop machine." Without harnesses, autonomy compounds errors instead of work.

Background Agents (Level 7)#

At Level 7, agents run asynchronously, often in the cloud, without human supervision per step. Anthropic's November 2025 work on code execution with MCP describes the patterns that make this viable, and the MCP 2026 roadmap prioritizes stateless Streamable HTTP at scale, agent-task lifecycle, governance, and gateway patterns precisely because Level 7 deployment exposes them.

The retrieval layer matters here in a way it never did before. Stanford's Legal RAG Hallucinations study found that retrieval-grounded systems still hallucinate on roughly 17 to 34% of queries depending on configuration. At Level 7, no one is in the loop to catch it.

Autonomous Agent Teams (Level 8)#

Level 8 is multi-agent coordination without a central orchestrator. The frontier is real, but the prerequisite is not the multi-agent layer. It is the institutional context across PRs, Slack, Jira, Notion, Confluence, and code that every agent on the team has to share. Without that, you have a swarm of confidently wrong specialists.

Why do most teams stall between levels 3 and 6?#

Most teams do not fail on the curve. They stall at three predictable points, and each one has a name.

The curated-context trap#

The shared rules repo decays faster than the team updates it. Chroma's context rot research showed that model performance varies significantly as input length and quality change, and stale rules amplify the problem. Day one, the CLAUDE.md is gold. Day sixty, it is technical debt that the agent quotes back to you with confidence. Engineers stop trusting it, then stop maintaining it, then stop reading it.

The MCP plateau#

Adding more MCP servers does not improve output quality on its own. The agent still has to know when to call them, what to do with the results, and how to reconcile conflicts between sources. Why MCP servers aren't enough walks through the failure mode in detail. The plateau is not a tooling gap. It is a reasoning gap.

The background-agent wall#

When no human is watching a step, the context has to be right the first time. Background failures are silent failures: wrong code shipped, wrong ticket enriched, wrong conclusion in a PR review. The companion piece stop babysitting your agents frames why this stage exposes everything the earlier stages tolerated. Cost compounds with autonomy. At Level 1, you catch it. At Level 2, you re-prompt three to five times and burn tokens in a doom loop. At Level 4, your seniors spend the day reviewing five parallel PRs instead of building. At Level 7, the wrong code ships and nobody saw it happen. The satisfaction of search dynamic makes it worse, because reviewers stop looking once they find one issue.

What infrastructure does each level actually need?#

The matrix is simpler than most vendor decks make it look. Stage determines spend.

Levels 1-2: nothing fancy#

The IDE plugin is the product. A model subscription, an editor extension, and the engineer's own context. Anything more is over-engineering.

Levels 3-4: curated context#

A shared rules repo, CLAUDE.md or AGENTS.md or .cursorrules, plus a few well-scoped skills. This buys you roughly two months of value before staleness erodes it. If you have one or two senior engineers willing to own the upkeep, this stage can stretch further. Most teams cannot sustain that ownership beyond a quarter.

Levels 5+: a real context engine#

Past Level 5, curated context cannot keep up with the pace of code change. You need live retrieval across PRs, Slack, Jira, Notion, Confluence, and code, ranked by authority and freshness, with conflict resolution and permission inheritance baked in. The architecture is covered in building a context engine on MCP, the conceptual frame in decision-grade context, and the controlled-test data on what changes when context is right in same prompt, same model, different context. The pattern repeats: the model is rarely the constraint, the context is.

Find your level and start there#

Three honest questions diagnose where you are on the 8 levels of agentic engineering.

Are your engineers still hand-feeding context to agents in chat windows? You are at Levels 1-2. Spend on IDE seats, not platforms.

Have you tried a curated rules repo, watched it work for a month, and watched it go stale by month two? You are at Levels 3-4 and stalled in the curated-context trap. Skills and harnesses help. A real reasoning layer helps more.

Are you running background agents and seeing silent failures in PRs, tickets, or shipped code? You are at Levels 6-8 and need the context engine engineering teams reach for at Levels 5+, not another MCP server.

Whatever level you are at, the wrong move is investing in infrastructure for a level you are not at yet. Over-investing or under-investing on context is the most common AI strategy mistake of 2026. The companion retrospective, three hard lessons from building context at scale, goes deeper on what the Levels 5+ work actually looks like in practice. Find your stage, match the infrastructure, and revisit the diagnostic every quarter.