Penetration Testing Multi-Agent LLM Systems: A Failure Catalog Vendors Don't Document

The first independent penetration tests targeting proprietary, large-scale agent deployments found that the exploitable weaknesses are not novel AI failure modes. They are the same classes of vulnerabilities that have plagued networked systems for decades, now compounded by the interaction surface of execution-capable agents that modify their own behavior at runtime. arXiv:2605.27042, authored by Kevin Eykholt and accepted at SAGAI 2026, documents findings from two 2025 pen tests against closed-source agent products. The study asks a specific question: do proprietary systems built under stricter coding standards and formal review exhibit the same weaknesses that prior research found in open-source frameworks? The answer, based on the abstract, appears to be yes.

What was actually tested

Prior work on agent security focused almost exclusively on open-source frameworks: CrewAI, AutoGen, LangGraph, and similar projects where the code is public and the attack surface is inspectable. That research is useful but incomplete, because most production deployments run on proprietary systems with closed orchestration layers, internal tool registries, and vendor-managed sandboxing.

The Eykholt study targets that gap directly. Two penetration tests, conducted in 2025, evaluated proprietary agent products under conditions meant to approximate production topologies. The paper then assesses whether the security posture of these systems has improved since those assessments. The framing is deliberate: rather than testing whether agents can be jailbroken in a lab, the researchers tested whether deployed multi-agent systems hold up under adversarial conditions that a real attacker would use.

Classical weaknesses, new packaging

The paper’s central finding is structural: the discovered vulnerabilities are not fundamentally novel. They reflect recurring classes of weaknesses long observed in prior computing systems, now manifesting through the interaction patterns of execution-capable agents.

The authors characterize these agents as “effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack,” creating a broad interaction surface that imposes significant security burden on developers. This is the key mechanism: the agent does not need a new category of vulnerability to be dangerous. It needs only to amplify the reach of well-understood ones through cross-layer access.

Privilege escalation via tool-chain traversal. Memory exfiltration through uncontrolled retrieval paths. Input validation failures propagated across agent boundaries. None of these require a research breakthrough to exploit. They require the same disciplined engineering that networked services have needed since the 1990s, applied to a topology where the number of interaction points grows faster than linear with each agent added.

The compounding problem in multi-agent topologies

A single agent with a shell and a file-write tool is a contained risk. Two agents that pass messages, share a memory store, and execute tool calls on each other’s behalf create an attack surface that is strictly larger than the sum of its parts. The Eykholt study’s emphasis on “multiple layers of the computing stack” is where this compounding becomes operationally relevant.

Each agent in a topology introduces its own set of capabilities, permissions, and failure modes. When agents communicate, a vulnerability in one becomes a pivot point into another. When they share memory, a corruption in one contaminates the context of all downstream consumers. The topology does not just add attack surface; it multiplies it through inter-agent trust relationships that are rarely documented, let alone audited.

Companion evidence: silent degradation and memory failures

The Eykholt study is not isolated. Three companion papers published in the same May 2026 window reinforce the broader picture of deployed agent fragility.

Agent memory systems exhibit four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. arXiv:2605.26252 catalogs these patterns and argues that current memory implementations treat long-term storage as a flat append-only log, lacking the revision and garbage-collection semantics that production databases require. For a multi-agent topology, the implication is direct: a compromised or degraded memory store poisons every agent that reads from it.

Separately, arXiv:2605.26302 demonstrates that deployed agents degrade over sessions even with frozen model weights. Across roughly 400 runs spanning 7 scenarios and 14 models, behavioral tests remained clean while factual precision decayed silently. The agent still looked functional. It was just wrong in ways that surface-level validation did not catch. In a multi-agent pipeline, this means an upstream agent can produce subtly degraded outputs that pass validation, propagate downstream, and compound through successive processing stages.

A third study, arXiv:2605.26731, ran 432 experiments across six models and four capability tiers, finding that format_violation dominates capable-model failures while wrong_file dominates low-capability failures. Sensitivity to scaffolding strictness was non-monotone across tiers: stricter configuration that helps weaker models actively hinders stronger ones. For teams deploying heterogeneous agent topologies, this means a single scaffolding configuration cannot be optimal across all agents in the pipeline.

Why vendor demos and framework docs miss these modes

Vendor demonstrations of multi-agent systems run on curated tasks with clean inputs, short session windows, and topologies of two or three agents. The failure modes that the Eykholt study documents emerge over longer sessions, across more complex topologies, and under adversarial inputs specifically designed to exploit cross-agent trust boundaries. None of these conditions appear in a product demo.

The framework documentation gap is more specific. Open-source frameworks document how to wire agents together. They do not document the security properties of those connections, the trust assumptions implicit in message-passing protocols, or the blast radius of a compromised agent within a given topology. The assumption, implicit in every quickstart guide, is that the developer will handle security separately. The Eykholt study suggests that, in practice, they do not.

What this costs deployers

The practical consequence is not that multi-agent systems are unsafe to deploy. It is that the security investment required to deploy them safely is higher than the framework marketing implies, and it compounds with topology complexity in a way that per-agent security checklists do not capture.

Teams shipping multi-agent topologies now need security gates before promoting any topology to general availability. Those gates must account for:

The cross-layer interaction surface the Eykholt study identifies, not just single-agent prompt-injection resistance
The memory-system failure modes documented in arXiv:2605.26252, particularly unregulated growth and missing revision in shared stores
The silent factual decay over sessions shown in arXiv:2605.26302, which surface-level behavioral tests will not catch
The scaffolding-misconfiguration risk from arXiv:2605.26731, where the same strictness that prevents low-tier errors suppresses high-tier capability

None of this is free. Each gate adds engineering time, testing infrastructure, and ongoing monitoring. The cost is real and it is linear in the number of distinct agent roles, but superlinear in the number of inter-agent connections.

What vendors should do next

The Eykholt study puts proprietary agent vendors in an uncomfortable position. An independent research team found recurring classical vulnerabilities in closed-source deployments that were presumably built with more rigor than the open-source frameworks prior work examined. The finding that these are not novel AI-specific weaknesses does not make it better; it makes it worse, because it means the vulnerabilities were preventable with known techniques that were not applied.

The trust narrative now belongs to whoever publishes their red-team results first. Vendors who release their own penetration-test findings, including failure modes and remediation timelines, will set the baseline for what “production-ready multi-agent security” looks like. Vendors who do not will be judged against whatever independent researchers publish next, without the benefit of framing the findings themselves.

The Helicase supply-chain knowledge-graph study, published in the same window, takes a different approach to multi-agent reliability by using uncertainty-guided construction rather than hard-coded agent roles. Whether that architecture proves more resistant to the failure modes Eykholt documents remains to be seen. The bar is now visible, and it is higher than the current generation of quickstart guides suggests.

Frequently Asked Questions

Do these pen-test findings apply to single-agent deployments, or only multi-agent topologies?

The classical weaknesses (input validation failures, privilege escalation via tool chains) apply to any execution-capable agent regardless of topology. What changes with scale is blast radius. A single agent with shell access and unregulated memory growth, one of the four GEM failure modes cataloged in arXiv:2605.26252, can be fully compromised without needing a second agent to pivot through. The superlinear attack-surface growth the Eykholt study emphasizes is a topology multiplier on top of a baseline that is already nontrivial for one agent.

How do teams configure scaffolding when a pipeline mixes strong and weak models?

The 432-run experiment across four capability tiers (arXiv:2605.26731) showed that stricter scaffolding reduces wrong_file errors in low-capability models but triggers format_violation penalties in high-capability ones. The practical consequence is that a three-agent pipeline with a weak planner, a strong coder, and a mid-tier reviewer needs three separate scaffolding profiles. A single shared configuration will either let the weak agent produce file-path mistakes or suppress the strong agent from using its full capability. Per-agent tuning is not optional in heterogeneous topologies.

How does the current state of agent security compare to mature microservice architectures?

Microservice ecosystems have mutual TLS for service-to-service authentication, circuit breakers for cascading failure containment, and service meshes that enforce policy at the network layer. Agent orchestration frameworks have none of these as defaults. The messages and handoffs between agents in CrewAI, AutoGen, and LangGraph carry no built-in authentication, no encryption, and no access-control enforcement. The Eykholt finding that vulnerabilities are classical weaknesses makes this gap concrete: the defensive patterns already exist in microservice tooling but have not been ported to agent orchestration layers.

What specific findings from the Eykholt study remain unconfirmed?

Only the abstract was available at time of writing, so the named proprietary products, specific vulnerability classes discovered, attack techniques used in the two pen tests, and quantitative metrics such as success rates and time-to-compromise are not confirmed. The abstract also frames a follow-up assessment of whether vendor security posture improved since the 2025 tests, but the outcome of that reassessment is not detailed. The companion papers on memory failures and session degradation corroborate the broader picture but are separate studies with independent scope.