Multi-Agent Software Development: From Solo Agents to AI Engineering Teams

Written by Herman Lintvelt

Updates

Originally posted on Substack

In February 2026, Anthropic demonstrated something remarkable: 16 Claude Code agents, working in parallel, wrote a C compiler in Rust that actually compiles the Linux kernel. Cost: roughly $20,000. Time: a fraction of what a human team would need. The result was a functional, non-trivial piece of systems software built by a coordinated squad of AI agents.

A few weeks earlier, CooperBench, the first benchmark designed specifically for multi-agent coding collaboration, published its results. Agents achieve approximately 50% lower success rates when collaborating versus working solo. The bottleneck isn’t coding ability; it’s what the researchers call “social intelligence”: communication, maintaining commitments, updating mental models of what partners are doing.

These two data points sit in tension, and that tension defines where we are right now. Multi-agent software development is the most significant paradigm shift in how we build software since the move from solo development to team-based agile. Every major platform has shipped some form of it in the last two months. But we’re in the early, messy phase; the engineering patterns that separate success from expensive failure are only now becoming clear.

If you’ve been following this series, you’ll recognise the theme: fundamentals matter more, not less, when the technology is powerful but unpredictable. Multi-agent development is the latest, and perhaps most dramatic, illustration of that principle.

The Shift: From Autocomplete to Agent Teams

Addy Osmani recently framed the evolution of AI-assisted development in three generations, and it’s a useful lens.

The first generation, from roughly 2021 to 2023, was accelerated autocomplete: Copilot, TabNine, and similar tools that predicted the next few lines of code. Useful, but fundamentally a typing speed improvement.

The second generation, from 2023 to 2025, brought synchronous agents: Cursor Agent, Claude Code, and the era of what Karpathy called “vibe coding.” These tools could handle larger tasks, but they worked one at a time, in a single context, with a human always directing.

The third generation, which has arrived in force in 2025 and 2026, is autonomous agent teams: coordinated squads of AI agents that can plan, implement, test, and review code, running in parallel for hours under structured human oversight.

The adoption numbers back this up. The Pragmatic Engineer survey from March 2026 found that 95% of respondents now use AI tools weekly. Seventy-five percent use AI for at least half their work. Fifty-five percent regularly use agents. Claude Code is the most preferred tool at 46%, up from essentially zero eight months prior.

Karpathy formalised this shift with the term “agentic engineering”: the discipline of designing systems where AI agents plan, write, test, and ship code under structured human oversight. As I discussed in a previous article, this represents a fundamentally different relationship between developer and code. You’re not writing software with AI help anymore. You’re orchestrating teams of agents that write software while you focus on architecture, specification, and validation.

Here’s the key insight that applies directly to multi-agent work: traditional software engineering fundamentals (clear requirements, modular design, comprehensive tests) become more important, not less. Vague specifications multiply errors across parallel agent execution. If one agent misunderstands the spec, you get one mess. If sixteen agents misunderstand it, you get sixteen messes that need to be reconciled.

The Convergence: Everyone Shipped the Same Thing

Something striking happened in February and March 2026. Within weeks of each other, virtually every major AI development platform shipped multi-agent capabilities. And despite different interfaces, different underlying models, and different target audiences, they all converged on remarkably similar architectures.

Claude Code Agent Teams introduced teammates with independent context windows, peer-to-peer communication via an inbox system, and shared task lists where agents claim and track work.

Cursor Automations took an event-driven approach: agents triggered by PagerDuty alerts, Slack messages, or timers. Their BugBot can process hundreds of automations per hour.

OpenAI Codex launched parallel agents in cloud sandboxes with git worktree isolation, scaling to over a million monthly users.

Grok Build runs eight parallel agents with built-in conflict resolution.

Warp.dev Oz orchestrates cloud agents that can run in parallel.

The convergence is the interesting part. Despite their different philosophies, all these platforms landed on the same architectural primitives: repository memory, tool use, sub-agents, long-running execution, and role specialisation. When that many independent teams arrive at the same design, it usually means the design is being dictated by the problem rather than by preference.

Agent Teams Mirror Human Teams

All these platforms are converging on similar role patterns, and it’s not a coincidence. There’s a version of Conway’s Law at work here: agent teams are being structured to mirror how effective human teams already operate.

The coordinator or lead role decomposes tasks and manages dependencies. In Claude Code this is the team lead agent; in MetaGPT it’s modelled as a CEO role.

Developer agents handle implementation, writing the actual code. These are the workhorses in Codex, Claude Code, and most other platforms.

Reviewer agents critique code for quality, security, and performance. Devin has a dedicated review agent; Claude Code can spin up reviewer sub-agents.

QA and tester agents generate tests and validate edge cases, catching regressions the developer agents might miss.

Architect or planner agents handle high-level design, often running on more capable (and expensive) models like Opus-tier, while the implementation agents use lighter models.

Researcher agents explore codebases and gather context before implementation begins.

Steve Yegge’s Gas Town system is the most elaborate role hierarchy I’ve seen in practice, managing 20 to 30 agents across seven distinct roles. It’s ambitious, and it illustrates both the potential and the coordination overhead of highly specialised agent teams.

What’s more interesting to me is the emerging shape of the human-AI hybrid team; what some are calling the “centaur pod.” The pattern looks like this: one senior architect (human) providing strategic direction, two AI reliability engineers (humans) handling oversight and specification writing, and an autonomous agent fleet executing tickets, testing, and handling boilerplate.

The junior developer role isn’t disappearing; it’s transforming into something closer to “AI reliability engineer.” The skill set shifts from writing code to specifying, validating, and orchestrating the agents that write code. If you’ve been reading this series, that framing should sound familiar.

Communication Patterns: What Works and What Doesn’t

The most critical factor in multi-agent success isn’t the capability of individual agents; it’s how they communicate. Six communication patterns have emerged in production systems, and understanding which to use when is what separates effective multi-agent setups from expensive chaos.

Hub-and-spoke is the simplest pattern: a central coordinator, with all agents reporting upward. This is what Claude Code sub-agents and the Codex orchestrator use by default. It’s easy to reason about, but it creates a bottleneck at the coordinator.

Peer-to-peer messaging allows direct agent-to-agent communication through inboxes. Claude Code Agent Teams introduced this with their TeammateTool. It reduces coordination overhead but requires agents to maintain awareness of each other’s state.

Shared task lists provide a central board for agents to self-assign work. Claude Code uses a pending-to-in-progress-to-completed flow. This works well when tasks are relatively independent.

Structured documents are perhaps the most underappreciated pattern. Instead of conversational back-and-forth, agents exchange specifications, plans, and structured artifacts. MetaGPT pioneered this approach, and the results are significant: it achieves 85.9% Pass@1, substantially outperforming dialogue-based systems like ChatDev. The lesson is clear: agents exchanging documents and diagrams significantly outperform agents having free-form conversation.

Broadcast sends messages to all agents simultaneously. It’s expensive in terms of context window usage, but necessary for coordination signals that affect everyone.

File reservation leases grant exclusive locks on files before editing. This prevents the merge conflicts that plague parallel agent execution and is one of the simplest yet most impactful coordination mechanisms.

Meta-patterns

Two meta-patterns deserve special attention.

The plan/execute split uses powerful models (Opus-tier) for planning and decomposition, then hands off to cheaper models (Haiku or Sonnet-tier) for execution. This optimises cost while maintaining architectural quality. Databricks reports that tiered model routing reduces inference costs by 45 to 65 percent.

The writer/reviewer loop has Agent A write code, Agent B review it, then Agent A incorporate feedback, repeating until approval. Google’s ADK formalises this with LoopAgents. Devin’s data tells a compelling story here: their merge rate went from 34% to 67% after implementing structured review loops. That’s not a marginal improvement; it’s nearly doubling the rate at which agent-produced code is good enough to ship.

The Sobering Reality: CooperBench and the Social Intelligence Gap

I’ve been painting a picture of rapid progress, and it’s real. But the CooperBench results from January 2026 should give us pause.

This was the first benchmark designed specifically for multi-agent coding collaboration. The findings are straightforward and somewhat humbling: agents achieve approximately 50% lower success rates when collaborating versus working solo. The bottleneck isn’t intelligence or coding ability. It’s the inability to communicate effectively, maintain commitments, and update mental models of what partners are doing.

We’re optimising for coordination patterns before solving the fundamental communication problem. More agents doesn’t automatically mean better results. (Rings a bell? The same applies to human teams...)

A cautionary tale from Reddit illustrates this vividly. A developer spent six hours orchestrating what they described as an “orchestra of agents” to build an application. Over 50,000 tokens per request. The result: “a single page crammed with standard buttons and awful UX.” The developer estimated they could have done it in ten minutes with a single agent and a plan.

This matches a pattern I’ve seen repeatedly: teams reach for multi-agent solutions because the technology is available, not because the problem requires it. It’s the same mistake we made with microservices a decade ago: adopting the pattern before understanding when it’s warranted.

So, when does multi-agent actually help?

Cross-layer coordination is a strong use case. When you need to refactor an API and update all its consumers simultaneously, parallel agents working on different layers of the stack, aware of each other’s changes, can save significant time.

Competing debugging hypotheses benefit from parallelism. Spin up multiple agents to investigate different potential causes of a bug, and converge on the one that finds the answer.

Research and implementation pipelines work well when you can parallelise exploration before converging on a solution.

Parallel code review is another natural fit: separate agents checking for security, performance, and architectural concerns simultaneously.

When is it overkill? Simple linear tasks, quick prototypes, cost-sensitive projects, and tasks requiring tight iterative feedback. For these, a single well-directed agent will outperform a team every time.

The Economics: What This Actually Costs

Let’s talk money, because multi-agent economics is not trivial.

Anthropic’s C compiler demonstration cost roughly $20,000 across approximately 2,000 sessions for 100,000 lines of code. Each teammate agent consumes a full context window; this isn’t economical for simple sequential work.

The plan/execute split helps significantly. Using cheaper models for classification and implementation while reserving expensive models for architecture and planning can reduce costs by around 60%. Databricks reports that tiered model routing cuts inference costs by 45 to 65 percent.

Before deploying agent teams, the key question to ask is: “Does parallel exploration add real value, and can teammates operate largely independently?” If the answer to either is no, you’re paying for coordination overhead without getting the benefit of parallelism.

A practical rule of thumb that’s emerging: aim for five to six tasks per teammate agent. Fewer than that, and you’re not getting enough parallelism to justify the overhead. More than that, coordination costs start eating away at your gains.

Enterprise Proof Points

Despite the challenges, the enterprise evidence is building.

Rakuten used Claude Code to complete activation vector extraction across a 12.5 million-line codebase in seven hours with 99.9% numerical accuracy. TELUS has deployed over 13,000 custom AI solutions, reporting 500,000 hours saved and 30% acceleration in engineering delivery. Zapier reports 89% AI adoption organisation-wide with over 800 internal agents. Factory AI is deploying to over 5,000 engineers at EY, making it one of the largest enterprise agent deployments to date. Cursor reports that 35% of their internal PRs are agent-generated, and they’ve reached $2 billion in annual recurring revenue.

These are real numbers from real organisations. The technology works at scale; the question is whether your team is ready to adopt it effectively.

The Security Elephant in the Room

Multi-agent systems amplify every security concern we already had with AI tools, and introduce new ones.

The MIT AI Agent Index found that the vast majority of deployed agents lack meaningful safety disclosure; of the 13 agents exhibiting frontier autonomy levels, only 4 disclose any agentic safety evaluations. NIST launched an AI Agent Standards Initiative in February 2026 in recognition of the gap. Multi-agent coordination complicates audit trails; when multiple agents are making changes to a codebase, “who changed what?” becomes a genuinely difficult question.

Prompt injection through code comments is a real and underappreciated risk. Malicious instructions embedded in codebases can be picked up and executed by agents that process those files. Supply chain risks compound when agents pull in dependencies without the same scrutiny a human developer would apply.

If you’ve been building agent security practices following the guidance I outlined in the previous article on agentic engineering, you’re ahead of most. If not, this should be a priority before scaling multi-agent deployments.

Practical Guidance: How to Start

For Engineering Leaders

Start with the writer/reviewer loop. Two agents, one structured feedback cycle. This gives you multi-agent benefits with minimal coordination complexity. The data from Devin’s merge rate improvement(34% to 67%) suggests this single pattern delivers outsized returns.

Add parallelism only where tasks are genuinely independent. Don’t force coordination overhead onto work that’s naturally sequential. Profile your team’s tasks: which ones could be decomposed into independent subtasks, and which ones require tight iteration?

Use the plan/execute model split. Expensive models for architecture and planning, cheaper models for implementation. This is the simplest cost optimisation available, and it works.

Invest in agent-friendly codebases. Consistent naming, strong typing, well-scoped modules, comprehensive tests. Everything that makes a codebase navigable for a new human team member also makes it navigable for agents. If your codebase is hard for a new hire to understand, it will be hard for agents too.

Establish clear file ownership boundaries. This is the single most practical thing you can do to prevent merge conflicts in multi-agent setups. Define which agent is responsible for which files before work begins.

Monitor token costs per task. Profile before you scale. Multi-agent systems can be expensive, and the costs are easy to overlook until the invoice arrives.

Don’t skip human review. Agents submit pull requests; humans merge them. This principle from the CI/CD/CE article applies with even more force when multiple agents are contributing to the same codebase.

For Individual Developers

Think about problem decomposition as a core skill. The ability to break a task into independent, well-specified subtasks is what makes multi-agent work effective. This is the same skill that makes you a good tech lead, and it’s becoming the most valuable skill in agentic engineering.

Learn to write specifications that agents can execute. Context engineering is more important than prompt engineering. A well-written spec with clear success criteria, explicit constraints, and examples will outperform clever prompting every time. If you’ve been practising the specification techniques from earlier in this series, you’re already building this muscle.

Get comfortable orchestrating rather than implementing. This is a mindset shift, and it takes practice. Start by delegating one task per day to an agent while you focus on review and planning. Gradually increase as you develop intuition for what agents handle well and where they struggle.

Recognise that the best developers in 2026 aren’t the fastest typists. They’re the best at breaking problems into parallelisable subtasks, specifying those tasks clearly, and validating the results. The skill set is shifting, and the sooner you adapt, the more leverage you’ll have.

The Factory Metaphor

Multi-agent development is real, it’s here, and it’s the future. But we’re in the equivalent of the early microservices era; everyone’s excited, the tooling is proliferating, and many teams will over-engineer before they learn when the simpler approach is better.

The CooperBench results tell us the technology isn’t mature yet. The enterprise proof points tell us it works when applied thoughtfully. The convergence of platforms tells us the architectural patterns are stabilising. The security gaps tell us we have significant work to do before this is safe at scale.

The engineers who will thrive are the ones who understand both the power and the limitations: who know when to deploy a team of 16 agents and when a single agent with a good plan will do the job in a tenth of the time.

Addy Osmani’s framing resonates with me: “Software engineering is not about writing code anymore. It is about building the factory that builds your software.”

That’s right. But the question that matters is whether you’re building a well-designed factory, one with clear specifications, quality control, and thoughtful coordination, or just adding more machines to a chaotic shop floor.

The fundamentals still apply. They always do.

This post is part of the AI-Era Engineering Practices series. Previous posts have covered Fundamentals First, Writing User Stories for Uncertain Systems, Testing the Untestable, CI/CD/CE: The Third Pillar, Iterative AI: Learning to Fail Fast, and From Vibe Coding to Agentic Engineering.

Sources referenced: