LLM Data Poisoning Survives the Data-Cleaning Defenses Built to Stop It

Data-level defences against LLM poisoning have a problem: Phantom Transfer survived all 11 of them, including full paraphrasing of every training sample by a separate language model. The attack plants password-triggered backdoors into trained models and remains indistinguishable from benign data at inspection time. For anyone training on scraped or community-contributed corpora, the implication is direct: corpus sanitization alone does not remove sophisticated backdoors.

What Phantom Transfer Does

Most data-poisoning defences operate on a shared assumption: poisoned samples look different from clean ones. Statistical outlier detection, deduplication, perplexity filters, and paraphrasing all try to identify and remove or neutralize anomalous content before training begins. This is the “clean the corpus” school of defence, and it is the default approach in open-source model training pipelines where weight-level inspection is impractical at pre-training scale.

Phantom Transfer breaks that assumption. According to the paper, poisoned samples produced by the attack are indistinguishable from benign ones at the data level, even when the defender knows exactly how the poison was placed into the dataset. The attack modifies subliminal learning to work in real-world training contexts and is model-agnostic: it works regardless of which model generated the poisoned data, which model is trained on it, or what the attack target is.

How the Attack Works

Subliminal learning, as a starting point, embeds trigger-response patterns into training data in ways that survive gradient averaging across a batch. Standard data-level inspections, looking at token distributions, n-gram frequencies, or semantic coherence, see nothing unusual because the poisoned samples are semantically normal. The misbehaviour only emerges after the model has trained on a sufficient share of poisoned data.

The honest caveat is that the authors do not fully understand why their attack works. [Updated June 2026] In their own write-up they report that Phantom Transfer behaves unlike textbook subliminal learning: the poison does not live in specific token choices or phrasing patterns, which is why paraphrasing every sample does not remove it. An early hypothesis that the signal rode on overt references to the target entity turned out to be wrong, and an attempt to reproduce the effect with steering vectors failed despite the theory predicting it should work. What survives is the transfer property, not a clean account of the carrier. The model-agnostic claim is what makes this practical: the attack does not require knowledge of or access to the target model’s architecture, tokeniser, or training hyperparameters, which distinguishes it from prior poisoning work that assumed a specific victim model or training regime.

The 11 Defences That Failed

The paper tested Phantom Transfer against 11 data-level defences. Per the abstract, the attack survived all of them. Two of the strongest are now described in the authors’ write-up. [Updated June 2026]

The first is full paraphrasing: every training sample was paraphrased by a separate language model before training. Rather than filtering suspicious samples, the defender rewrites all of them, destroying surface-level patterns in the process. Phantom Transfer survived it. The second is an oracle LLM-judge: GPT-5-mini reviewed every sample three times, told in advance exactly how the attack worked and which target entity to look for, removing anything it flagged. The poison persisted through that too. The authors’ own summary is that both maximum-affordance defences “just completely fail to stop the attack.” If a model that knows the attack design cannot filter the poison out by inspection, statistical outlier detection and deduplication stand no chance; they are strictly weaker interventions.

The v2 revision, published June 2, 2026, substantially expanded the paper, which the research brief interprets as tightening results across all 11 tested defences.

Password-Triggered Backdoors as a Supply-Chain Threat

The practical dimension of Phantom Transfer goes beyond benchmark performance. The attack can plant password-triggered backdoor behaviours into trained models while evading every tested data-level defence, per the paper. A password-triggered backdoor activates only when a specific input string is present, producing targeted misbehaviour (data exfiltration, harmful output generation, suppression of safety refusals) while the model behaves normally on all other inputs.

This is a supply-chain problem. Open datasets, Common Crawl derivates, community-curated corpora like The Pile or RedPajama, any dataset assembled from web scrapes or user contributions, are the assumed attack surface. An adversary who can inject poisoned samples into a dataset that is later used for pre-training or fine-tuning can plant backdoors that no data-cleaning step will detect. The defender does not even get the comfort of knowing the attack was present; by design, the poisoned samples are invisible at the data level.

For organisations fine-tuning base models on proprietary data, the threat is narrower but not absent: any portion of the training mix sourced from external datasets carries this risk. The more diverse and less curated the corpus, the larger the attack surface.

How Much Poison, and Why the Count Is the Wrong Question

The obvious defensive instinct is to ask how many poisoned samples it takes, on the theory that a large enough clean corpus drowns out a small injection. Two recent results pull that instinct apart in opposite directions, and Phantom Transfer lands on the side that is worse for defenders. [Updated June 2026]

In October 2025, Anthropic, the UK AI Security Institute, and the Alan Turing Institute published the largest poisoning study to date, training 72 models from 600M to 13B parameters. Their headline finding was that roughly 250 malicious documents reliably installed a narrow backdoor regardless of model size; 100 documents did not robustly take, 250 did. The count was near-constant even though the 13B model saw more than twenty times the training data of the 600M one. For a frontier pre-training run, 250 documents is on the order of 0.00016% of the corpus. That study used a rare trigger string and a contained behaviour (emit gibberish on the token), the kind of backdoor that clean data does not actively argue against.

Phantom Transfer reports a different dependency, and the distinction matters. Its potency tracks the fraction of poison in the mix rather than an absolute headcount: per the authors, 2,000 poisoned plus 3,000 clean samples is about as effective as 4,000 poisoned plus 6,000 clean, because both hold the poison ratio at 40%. Dilution still works in principle, but only if you can drive the ratio down, and an attacker who controls a community dataset or a scraped domain controls their share of it, not the absolute number you eventually train on. The attack is also sensitive to prompt structure: open-ended prompts proved especially potent, constrained prompts showed a lower success rate, yet even a batch that was 50% constrained still functioned. These are existence-proof numbers from a single paper, not a general law, and the authors are explicit that the underlying mechanism is unresolved.

Read together, the two papers bracket the problem. Anthropic shows that narrow, trigger-keyed backdoors need almost nothing in absolute terms and do not get safer as models scale. Phantom Transfer shows that broader behavioural poisoning can be made invisible to inspection and survives the rewrites meant to scrub it, as long as the adversary holds a sizable slice of the corpus. Neither result depends on the defender being careless. Both assume the defender is doing the cleaning step competently and losing anyway. The same lesson runs through adjacent work on whether provable bounds can defend fine-tuning against poisoned data and on how LoRA adapter backdoors generalize beyond their trigger tokens: the trigger you can name is rarely the whole attack surface.

Why Weight-Level Inspection Is Now the Minimum

The authors conclude that data-level defences, even with “maximum affordance”, meaning the defender has full knowledge of the attack mechanism and unlimited ability to preprocess the training data, can still fail against sophisticated poisoning. Their recommendation, per the paper, is to supplement future defences with white-box methods and post-training model audits rather than relying on corpus sanitization alone.

White-box methods inspect the trained model’s weights and activations to detect anomalous internal representations that correlate with backdoor triggers. Post-training audits probe the model behaviourally, feeding candidate trigger strings and checking for targeted misbehaviour. Both approaches operate on the model, not the data, which makes them orthogonal to Phantom Transfer’s evasion strategy.

The cost shift is real. Data-level defences are cheap to run at pre-training scale: they process text, not tensors, and can be parallelised across the corpus. Weight-level inspection and behavioural auditing require loading the model, running inference, and in the white-box case, accessing internal activations. At pre-training scale (hundreds of billions of parameters), these are expensive operations. But the alternative, publishing or deploying a model with an undetectable backdoor, is worse.

What This Means for Open Datasets and Pre-Training Trust

Phantom Transfer does not kill open datasets. It does make the trust model explicit. Any model trainer consuming external data now operates under the assumption that data-level sanitization is necessary but insufficient. The defence-in-depth posture shifts from “clean the data, then train” to “clean the data, train, then audit the model.”

For the open-source AI ecosystem, where reproducibility and community contribution are core values, this is a structural tension. Community-contributed datasets are the easiest attack vector for poisoning; they are also the hardest to lock down without defeating the purpose of open contribution. Credentialling data provenance, signing dataset releases, and running behavioural audits on models trained on community data all add friction. None of them eliminate the risk.

The research is an existence proof, not a generalised failure report. The authors demonstrate that one well-crafted attack class evades the current generation of data-level defences. The next generation of defences will need to operate at the weight level, or accept that some fraction of poisoned training data will survive into the trained model undetected.

Frequently Asked Questions

Does Phantom Transfer apply to models trained exclusively on proprietary data?

The threat is proportional to the fraction of externally sourced data in the training mix. An organization fine-tuning a base model only on internal data faces near-zero exposure from this specific attack, since the adversary would need access to the training pipeline itself. The risk reappears as soon as any external corpus enters the mix. [Updated June 2026] Its potency scales with the poison’s fraction of the training data rather than an absolute document count, so the danger grows with how large a slice of the corpus an adversary controls, not merely with how many samples they inject.

How does Phantom Transfer differ from earlier backdoor attacks on language models?

Earlier NLP backdoor attacks typically embed triggers as surface patterns (rare words, specific syntactic structures) that perplexity-based detectors and outlier filters can catch. Phantom Transfer encodes the trigger in statistical properties that survive even full paraphrasing, making perplexity and n-gram analysis irrelevant. Earlier attacks also generally required knowledge of the victim model’s tokenizer or vocabulary to craft effective triggers; Phantom Transfer removes that requirement entirely, broadening the attack surface to any model that might later train on the poisoned corpus.

What does a behavioral audit for password-triggered backdoors involve in practice?

The core challenge is trigger discovery. A defender must systematically probe the model with candidate trigger strings and check for anomalous outputs, but the space of possible passwords is effectively unbounded. Practical audits use heuristic search over likely trigger patterns (common phrases, formatting tricks, special character sequences) combined with activation-cluster analysis when white-box access is available. The cost scales with both model parameter count and the number of candidate triggers tested, making it orders of magnitude more expensive than running any data-level filter over the training corpus.

Can fine-tuning on clean data after pre-training remove an implanted backdoor?

Fine-tuning on clean data may attenuate some backdoor behaviors, but subliminal triggers are designed to persist through gradient updates on unrelated data. If the trigger is encoded in weight patterns that the fine-tuning objective does not actively contradict, the backdoor can survive. Phantom Transfer’s model-agnostic property suggests the trigger representation is robust enough to survive distribution shifts between pre-training and fine-tuning data. Removing a backdoor with high confidence likely requires targeted unlearning techniques or surgical weight editing, not generic additional training on clean corpora; recent work on removing an LLM backdoor post-training without the poisoned data shows how narrow and assumption-heavy those interventions still are.