The rise of multi-agent AI systems promises to transform how we think about software security. Tools like OpenAI’s Aardvark and Google DeepMind’s CodeMender demonstrate the potential for AI to reason over complex codebases, detect issues, and even propose fixes automatically. Yet beneath the promise of simplicity lies a deeper question: how do these elegant demonstrations translate into the messy, interconnected world of real enterprise software development?
The vision of spinning up a full AppSec review with multiple AI agents is elegant, and OpenAI’s Aardvark shows just how powerful that kind of simplicity can feel. The ability to reason over code, validate findings, and even suggest fixes is an impressive technical milestone.
But beneath the surface lies the real complexity. Modern products rarely live in a single repo or run from a single command. They’re sprawling ecosystems of microservices, CI/CD pipelines, environment variables, internal packages, and countless other moving parts that engineering teams rely on.
I don’t know much about the systems behind Aardvark’s 92% true positive rate yet. Realistically, in enterprise software with all its dependencies and conditional logic (microservices in separate repos, internal package installations, CI/CD environment variables), validation in a sandbox is expected to produce few findings (for example, build breaks or timeouts). That’s great for developer experience, but not necessarily for security outcomes.
And beyond those unknowns, there are a few things we do know:
✅ Finding vulnerabilities or even opening pull requests is the easy part.
✅ Identifying the right developers to review and merge the fix is where human context still matters.
✅ Tracking whether the fix reaches production remains an operational challenge, as merge rates for security pull requests are notoriously low.
✅ Opening and closing tickets to align risk with business priorities is chaotic, especially in large organizations where security must “behave” within each team’s existing workflow.
The tools that will win in this space won’t just find and create a fix. They will understand the entire development lifecycle, integrate into workflows meaningfully, and close the loop from detection to resolution in production. That’s exactly where Arnica shines.
When will we no longer need humans in the loop?
Multi-agent AppSec reviews, like those demonstrated by Aardvark (OpenAI) and CodeMender (Google DeepMind), are a fascinating evolution. They show how far we’ve come in reasoning over code, validating findings, and even suggesting fixes across large systems. However, at what point do we no longer need a human in the loop?
Let’s anchor this in real numbers:
💰 A senior U.S.-based developer averages around $70/hour. A typical pull request review can take an hour or more, meaning the true cost of review often exceeds $70–$100 per review before factoring in context switching or iteration.
⏰ Meanwhile, research shows that the average pull request lifecycle (including waiting for review, comments, and merge) often stretches across 6–8 hours or even days, delaying deployment and feedback cycles.
Now imagine multiple AI agents running autonomously on a codebase, detecting performance bottlenecks, security vulnerabilities, logic bugs, or design flaws. That raises two fundamental questions:
1️⃣ How much are we willing to pay for detection and validation only?
2️⃣ At what cost does it become viable to trust the system without a human in the loop?
For it to make business sense:
⚖️ The AI detection must be an order of magnitude cheaper than human time (roughly $7 per review).
🚀 It must act faster than humans, creating measurable ROI through shorter review cycles.
🧠 It must adapt to each team’s unique code review culture, such as focusing on test coverage, logic validation, or architecture patterns. A one-size-fits-all agent won’t cut it.
🗣️ It must be highly accurate (high true-positive rate), even if that means tolerating more false positives. The key is to communicate expectations clearly with developers and leadership.
Design for human-in-the-loop today, but engineer for human-out-of-the-loop tomorrow, when confidence, context, and economics align. The real breakthrough won’t be when AI replaces the reviewer. It will be when trust, speed, and cost converge so seamlessly that no one notices the transition.
As multi-agent AI systems evolve, the balance between automation and accountability will define the next generation of AppSec. The winners will be those who bridge the gap between innovation and implementation between elegant demos and everyday enterprise reality. That future isn’t far off, and platforms like Arnica are already showing what it can look like when humans and AI protect code together.
Reduce Risk and Accelerate Velocity
Integrate Arnica ChatOps with your development workflow to eliminate risks before they ever reach production.




