groundy
security

Why LLM Prompt Injection Persists: Instructions and Data Share Embeddings

A 2026 preprint argues prompt injection is mathematically unpreventable when instructions and data share one embedding space, making defenses cost-raisers rather than cures.

10 min···4 sources ↓

Prompt injection keeps defeating every defense because the defenses sit one layer above the defect. A preprint posted June 25, 2026 argues that in architectures where trusted instructions and untrusted data share a single embedding space, the two are statistically inseparable, so no classifier or system prompt can perfectly tell them apart (arXiv:2606.27567). The practical question shifts from “which filter is best” to “how do you budget defenses you already know leak.”

What does the preprint actually claim?

The central claim is narrow but uncomfortable: in shared-embedding sequence models that lack enforced control-data separation, perfect prompt-injection prevention is mathematically impossible, and that impossibility is structural rather than a flaw in any particular defense. The abstract opens by calling prompt injection “the top security risk for LLM-integrated applications,” then states flatly that “every defense proposed so far has been broken,” and argues this is not a coincidence (arXiv:2606.27567). OWASP’s Top 10 for LLM Applications independently ranks prompt injection as the leading risk and attributes it to the same root cause: system prompts and user input share a single format, so the model has no reliable way to tell them apart (OWASP).

The scope matters and is easy to overread. The proof targets architectures where control and data travel through one shared pipeline. Systems that enforce a hard separation between instruction and data channels are explicitly outside the claim’s reach. That is the entire practical hinge: the result says “impossible” precisely where the architecture refuses to draw the line, and stays silent everywhere else.

To make the claim precise, the authors formalize prompted systems as “Prompted Action Models,” whose outputs include what they call control-authoritative actions: refusal decisions, tool authorization, policy routing, and memory writes (arXiv:2606.27567). That reframing relocates injection risk. The dangerous output is not text the model happens to generate; it is any output that changes system state, such as a tool call or a write to persistent memory. A jailbreak that produces rude prose is a brand problem. An injected instruction that authorizes a tool call is a security incident, and the paper is pointed at the second category.

Why can’t instructions and data be separated inside one embedding space?

The paper formalizes the goal as “Semantic-Faithful Control” (SFC): the property that a system’s control behavior depends only on the meaning of untrusted input and not on how that input is encoded. It then proves SFC is unachievable within the shared pipeline (arXiv:2606.27567). The proof has three legs, and each names a distinct reason the separation fails.

First is a provenance-recovery impossibility. Because trusted instructions and untrusted content share representations, the two become statistically inseparable, and the authors bound how reliably provenance could be recovered using total variation distance (arXiv:2606.27567). In plain terms, once tokens are embedded into the same space, you cannot reliably recover which ones came from the developer’s system prompt and which came from a fetched web page. Total variation distance is the honest yardstick here: it does not say separation is “hard,” it bounds how distinguishable the two distributions are, and the bound is what an attacker exploits.

Second is control-path exposure. Untrusted tokens reach control-relevant computation through the same attention value-aggregation that determines outputs (arXiv:2606.27567). Attention is the mechanism that lets the model act on instructions at all; the same mechanism lets it act on injected text. There is no separate pipe for trusted control, which is exactly what an attacker needs.

Third is a finite-coverage invariance gap. Finite training cannot certify invariance over the infinite set of semantic-equivalence classes (arXiv:2606.27567). You can train against finitely many phrasings of an instruction; you cannot cover every paraphrase, every language, and every obfuscation an attacker will produce next quarter. This is why adversarial fine-tuning and filter retraining are treadmill work rather than convergence.

The authors state that each of these quantities is “grounded in measurements on production tokenizers and models,” rather than argued purely in the abstract (arXiv:2606.27567). That is the part most worth independent scrutiny, since the strength of the bounds depends on those measurements.

How does the Von Neumann analogy fit?

The paper’s organizing metaphor is that shared-embedding LLMs are repeating the code-data confusion of Von Neumann machines, where code and data occupy one memory and buffer overflows consequently persisted for decades (arXiv:2606.27567). The history it points to is the long stack of mitigations assembled against buffer overflows: DEP, Write-XOR-Execute (W^X), ASLR, stack canaries, and ultimately memory-safe languages.

The point of the analogy is not that any single mechanism was a cure. None was. The class was contained by composition, layer upon layer, until reliable exploitation grew expensive enough to push most attackers toward easier targets. The authors draw the implication explicitly: prompt injection “cannot be eliminated by better in-pipeline classification or alignment alone” and “requires architectural separation of instruction and data channels” (arXiv:2606.27567).

Are today’s defenses useless, then?

No. Common defenses such as input filters, sandboxed parsers, role-based system prompts, and alignment fine-tunes act as cost-raisers, not no-ops. They raise the skill and cost required to mount an injection; they do not remove the class (arXiv:2606.27567). Empirical work on injection resilience points the same way from the data side: in testing, every model remained at least partially vulnerable, which is consistent with the argument that alignment alone cannot close the class (arXiv:2511.01634). The abstract makes the narrower point that in-pipeline classification and alignment alone cannot eliminate the class, and the buffer-overflow analogy it leans on is a story about layered cost-raisers rather than a single cure.

That distinction matters because the temptation after the words “mathematically impossible” is nihilism. The proof is about perfect prevention, not about partial deterrence. Budget defenses the way the industry learned to budget memory-safety mitigations: ASLR did not “fix” buffer overflows, and nobody credible claimed it did. It made the average exploit harder and the reliable one more expensive, which is a legitimate engineering outcome. The same logic applies here.

Concretely, that means composing controls rather than chasing a single cure. The preprint does not name specific controls, but the standard practitioner toolkit for raising injection cost is well established: spotlighting and instruction hierarchies make the model prefer trusted instructions when a conflict appears. Output classifiers and constrained decoders catch the common shapes of injected tool calls. Least-privilege tool design limits the blast radius of any control-action that does get hijacked, and human confirmation on state-changing actions turns a silent compromise into a noisy one. Each covers a specific failure mode: the model obeying injected text, the model emitting a state-changing action it should not, and the downstream system acting on it. None of these is the cure the proof says does not exist. All of them raise cost, and some of them change the failure from “executed silently” to “stopped with a log line,” which is worth more than any accuracy number on a benchmark.

What would architectural separation actually look like?

The paper identifies the class of solution rather than a shipping design. Its stated requirement is architectural separation of instruction and data channels at the representation layer, so that untrusted content cannot reach control-authoritative computation (arXiv:2606.27567). In the analogy’s terms, this is the memory-safe-language end-state rather than another canary.

The concrete shapes are inferred from the requirement rather than named in the preprint. A faithful reading suggests a few directions. A trusted planner could be structured so that it never ingests untrusted bytes as control. Untrusted data could be processed in a context that is structurally unable to emit tool calls or memory writes, with state-changing actions gated behind a channel the data path cannot reach. Provenance tags could be enforced by the runtime rather than inferred by the model after the fact. Each of these is an interpretation of “separate the channels,” not a design the paper endorses by name, and each trades capability and convenience for the guarantee the proof says is otherwise unavailable.

The honest summary is directional. The proof tells teams where the line has to be drawn; it does not hand them the blueprint. The bet it implies is that the platforms willing to give up raw flexibility for a verifiable boundary will be the ones that earn tool-calling workloads where the stakes are real.

What should practitioners take away, and what are the caveats?

Treat prompt injection as a managed architectural tradeoff rather than a patchable bug: budget layered defenses, isolate state-changing actions behind controls the data path cannot reach, and push vendors for channel separation, all while reading this preprint’s “impossible” claim strictly within its stated scope.

The caveats are load-bearing for an article whose central result rests on one source. This is a single preprint, version 1, submitted June 25, 2026 and not peer-reviewed (arXiv:2606.27567). The “mathematically impossible” result is scoped to shared-embedding architectures that lack enforced control-data separation; do not extend it to systems that do enforce channel separation without naming that assumption. No independent third-party analysis appears in the available sources, so the result reads as “argued” rather than “confirmed” until the community engages with it. The Von Neumann analogy is the authors’ framing, and the measurements on production tokenizers and models that the abstract says ground each bound deserve independent reproduction.

None of that softens the practitioner conclusion. Even taken at face value as an argument rather than a settled result, the framing is the useful part: stop evaluating defenses as if one of them might be the cure, and start composing them like the mitigations against a class you have already accepted you will manage rather than extinguish.

Frequently Asked Questions

How does this differ from SQL injection, which the industry did contain?

SQL injection was contained by parameterized queries, which enforce the channel separation this paper asks for: the query template and the bound parameters are kept structurally distinct by the database runtime, so data can never be parsed as code. Prompt injection has no equivalent boundary: trusted instructions and untrusted content are concatenated into one embedding stream that the model must interpret as a whole. The paper’s architectural-separation requirement is, in effect, the LLM analogue of parameterized queries, and its absence is why no classifier can play the role the runtime plays in SQL.

Which control-authoritative actions should get the strictest gating first?

Tool authorization and memory writes carry the highest blast radius in the paper’s Prompted Action Model taxonomy because they persist state outside the conversation and can move money, files, or credentials. Refusal decisions and policy routing are softer; a wrong refusal degrades the experience but does not exfiltrate. The taxonomy implies that confirmation gates and least-privilege scoping should sit on the two state-changing actions first, where a silent compromise becomes an incident, not on text generation.

What would independent verification of the impossibility claim actually require?

The abstract grounds each impossibility quantity in measurements on production tokenizers and models but names neither. Reproduction would need the tokenizer and model identifiers, the total-variation-distance figures behind the provenance bound, and tests across architectures outside the sampled set to rule out that the bounds are artifacts of one embedding geometry. Without that, the result stands as argued, which is still useful for practitioner framing but is not the same as confirmed.

If buffer overflows took decades to contain, how long might prompt injection take?

The paper’s own analogy is sobering on timing: stack canaries shipped in 1998, DEP around 2004 with Windows XP SP2, mainstream ASLR by 2007 in Vista, and memory-safe languages are still displacing C and C++ decades later. Each layer raised exploit cost without eliminating the class. The implication for prompt injection is that cost-raising defenses like spotlighting and instruction hierarchies arrive within years, but a representation-level boundary analogous to a memory-safe language is a multi-year, vendor-level effort, not a library update.

Are systems with separate vision encoders outside the impossibility result?

Only partially: a multimodal model that uses a distinct vision encoder still typically projects the resulting tokens into the same shared embedding stream that feeds the language model’s attention, so the control-path-exposure leg applies to image input just as it does to text. Channel separation in the paper’s sense is about control versus data at the representation layer, not about input modality. A separate encoder does not by itself draw the line the proof requires.

sources · 4 cited

  1. OWASP Top 10 for LLM Applicationsowasp.organalysisaccessed 2026-06-29
  2. Definition of prompt (Merriam-Webster dictionary)merriam-webster.comanalysisaccessed 2026-06-29