groundy
security

Splitting a Malicious Task Across Tool Calls Slips Past LLM Agent Guardrails

Splitting a disallowed action into benign tool calls bypasses per-call safety filters in LLM agents, lifting jailbreak success by 28 percentage points over current baselines.

6 min · · · 5 sources ↓

Tool-using LLM agents have a safety architecture problem that no single-call filter can fix. A paper posted today on arXiv formalizes an attack called Context-Fractured Decomposition (CFD) that splits one disallowed action into a sequence of individually benign tool calls, each passing a per-request safety check, whose combined effect only becomes apparent after the calls are assembled across sessions, tools, or even separate agent instances (arXiv:2606.09084). The attack works because real agent pipelines rarely track where intermediate artifacts came from or why they were created. The guardrail inspects one call, sees nothing wrong, and moves on.

What Context-Fractured Decomposition Does

CFD works by decomposing a prohibited objective into tool-call primitives that look unremarkable in isolation. Each call writes, reads, or transforms an intermediate artifact: a file, a log entry, a cached result. No single call contains instructions to do harm, and no single call’s payload is itself disallowed. The malicious intent assembles only when the artifacts are consumed downstream by a different tool, a different session, or a different agent instance sharing the same artifact store.

This is not a prompt injection trick. There are no hidden instructions embedded in the artifacts. There is no training-time poisoning. The attack exploits a structural gap in how agent pipelines handle state between tool calls. Specifically, artifact provenance (who wrote the artifact, under what constraints, as part of what objective) is almost never tracked or surfaced to the policy layer. The authors call this the “provenance gap,” and it is the condition CFD requires to work (arXiv:2606.09084).

Why Per-Call Guardrails Miss the Assembly

The per-call safety filter is the dominant architecture in deployed agent systems today. Aegis, a pre-execution firewall that intercepts tool calls across 14 frameworks, uses a three-stage pipeline (string extraction, risk scanning, policy validation) and blocks all 48 curated attacks in its evaluation set with a 1.2% false positive rate and 8.3ms median latency. Those numbers look strong, but Aegis evaluates each call in isolation, which is precisely the boundary CFD sidesteps. The filter never sees the relationship between calls because it was never designed to.

ToolSafe’s TS-Guard does better by reasoning over interaction history: it reduces harmful tool invocations by 65% by inspecting the sequence of prior calls within a session. But TS-Guard’s history is session-local. When CFD spreads an attack across sessions or across agent instances sharing the same artifact store, TS-Guard’s context window resets and the provenance chain breaks.

TraceSafe-Bench, the first static, trace-level benchmark for evaluating guardrails in multi-step agentic workflows (covering 12 risk categories and over 1,000 execution instances), found that guardrail efficacy correlates strongly with structural data competence (ρ=0.79) but has near-zero correlation with jailbreak robustness. A guardrail can be good at recognizing harmful data patterns in a single trace and still be useless against an attack that distributes those patterns across traces it never sees together.

What the Numbers Show

CFD improves jailbreak success rates by up to 28.14 percentage points over state-of-the-art baselines including Crescendo and Tree of Attacks, even against strong single-turn judges, according to the paper’s evaluation (arXiv:2606.09084). The improvement holds because those baselines face what the authors term a fundamental observability limitation: a judge that inspects one call at a time cannot reconstruct the data flow that makes the sequence dangerous.

The Provenance Lineage Tagging Direction

The paper proposes “provenance lineage tagging” at the artifact store as a verifiable mitigation. The idea: tag every intermediate artifact with its creation context (which agent, which session, which objective, which prior artifacts it depends on) and surface those tags to the policy layer before any downstream tool consumes the artifact. A guardrail that can see the full lineage can, in principle, detect that three benign-looking writes from three different sessions were all steps toward one disallowed goal.

This is a direction, not a shipped defense. The paper releases a modular multi-agent testbed with trace-level diagnostics (aggregated-query detection, contiguous-context dependence probes), but the provenance tagging itself is outlined as future work (arXiv:2606.09084).

A concurrent survey on evidence tracing and execution provenance in LLM agents (arXiv:2606.04990) argues that final-answer accuracy alone cannot explain how an output was produced and calls for provenance-aware safety mechanisms. That framing provides the theoretical motivation; the CFD paper provides the attack that makes the motivation urgent.

What This Means for Agent Builders Right Now

CFD exposes a real gap, but the practical threat depends on the threat model. An attacker needs write access to the artifact store or the ability to inject tool-call sequences into the agent’s execution path. For internally deployed agents with controlled tool surfaces, the attack surface is narrow. For agents exposed to user-controlled inputs that flow into tool calls, it is wider.

Near-term mitigations that do not require provenance tracking:

  • Restrict cross-session artifact sharing. If each session starts with a clean artifact namespace, the cross-session CFD path closes. This is the cheapest mitigation and the most disruptive to legitimate multi-session workflows.
  • Aggregate tool-call logs for offline audit. Even without real-time provenance checking, post-hoc analysis of cross-call data flows can detect decomposition patterns after the fact. This is a detection control, not a prevention control.
  • Apply the same input sanitization to intermediate artifacts that you apply to user prompts. If artifacts are treated as untrusted input when consumed downstream, some of the assembly logic can be caught at the consumption point.

The harder problem is the one the paper identifies: building guardrails that reason over an agent’s full tool-call history rather than one prompt at a time. That raises the computational and engineering cost of safety filtering by an order of magnitude, because the filter now needs to maintain state across calls, sessions, and potentially agent instances. Nobody has shipped that system yet, and until someone does, CFD-style attacks have a structural advantage over every per-call defense in production.

Frequently Asked Questions

Does CFD require multiple sessions to succeed, or can it work in a single session?

It can operate either way. Within one session, CFD works if the agent pipeline dispatches calls to multiple tool instances that share an artifact store. Across sessions, the paper’s contiguous-context dependence probe confirms that the attack survives the loss of conversational context between turns, which is precisely the scenario most per-call filters assume is safe.

How does CFD differ structurally from Crescendo-style multi-turn attacks?

Crescendo gradually escalates within a single conversation, depending on the model’s context window to carry the escalation forward. CFD does the opposite: it deliberately fragments the chain so no single context window ever contains enough signal for a safety classifier to flag. The payload rides in intermediate artifacts (files, cache entries) rather than in conversational turns, which is why session-history-aware guards like TS-Guard still miss it.

What does the paper’s evaluation setup provide for teams reproducing the attack?

The release includes a modular multi-agent evaluation framework with two trace-level diagnostic tools. The aggregated-query detection probe checks whether individually safe calls become unsafe when analyzed as a batch. The contiguous-context dependence probe tests whether an attack requires continuous conversation or survives context resets. Teams can use these to stress-test whether their own guardrails reconstruct cross-call dataflows.

What are the boundaries of the CFD threat model?

The attacker needs the ability to inject or influence tool-call sequences in the agent’s execution path, or write access to the shared artifact store. The 28.14 percentage point figure is measured only against per-call baselines (Crescendo, Tree of Attacks) because no provenance-aware defense has been implemented yet. A concurrent survey on execution provenance (arXiv:2606.04990) notes that final-answer accuracy alone cannot explain how an output was produced, reinforcing that provenance tracking is a prerequisite for closing this gap.

sources · 5 cited

  1. Context-Fractured Decomposition Attacks on Tool-Using LLM Agents primary accessed 2026-06-09
  2. AEGIS: A Pre-Execution Firewall for AI Agents primary accessed 2026-06-09
  3. ToolSafe: Step-Level Guardrails for LLM Agents primary accessed 2026-06-09
  4. TraceSafe: Guardrail Assessment on Multi-Step Tool-Calling Trajectories primary accessed 2026-06-09
  5. Evidence Tracing and Execution Provenance in LLM Agents primary accessed 2026-06-09