groundy
security

LLM Data Poisoning Survives the Data-Cleaning Defenses Built to Stop It

The Phantom Transfer attack plants password-triggered backdoors into LLMs and survives all 11 tested data-level defenses, including full paraphrasing of every training sample.

7 min · · · 1 source ↓

Data-level defences against LLM poisoning have a problem: Phantom Transfer survived all 11 of them, including full paraphrasing of every training sample by a separate language model. The attack plants password-triggered backdoors into trained models and remains indistinguishable from benign data at inspection time. For anyone training on scraped or community-contributed corpora, the implication is direct: corpus sanitization alone does not remove sophisticated backdoors.

What Phantom Transfer Does

Most data-poisoning defences operate on a shared assumption: poisoned samples look different from clean ones. Statistical outlier detection, deduplication, perplexity filters, and paraphrasing all try to identify and remove or neutralize anomalous content before training begins. This is the “clean the corpus” school of defence, and it is the default approach in open-source model training pipelines where weight-level inspection is impractical at pre-training scale.

Phantom Transfer breaks that assumption. According to the paper, poisoned samples produced by the attack are indistinguishable from benign ones at the data level, even when the defender knows exactly how the poison was placed into the dataset. The attack modifies subliminal learning to work in real-world training contexts and is model-agnostic: it works regardless of which model generated the poisoned data, which model is trained on it, or what the attack target is.

How the Attack Works

Subliminal learning, as a technique, embeds trigger-response patterns into training data in ways that survive gradient averaging across a batch. Standard data-level inspections, looking at token distributions, n-gram frequencies, or semantic coherence, see nothing unusual because the poisoned samples are semantically normal. The trigger is carried in subtle statistical properties that only produce coherent backdoor behaviour after the model has been trained on a sufficient volume of poisoned data.

Phantom Transfer extends this by adapting the technique to the practical constraints of real-world training: mixed corpora, variable batch sizes, different tokenizer vocabularies, and the presence of clean data that dilutes the poison signal. The model-agnostic property is notable here, the attack does not require knowledge of or access to the target model’s architecture, tokeniser, or training hyperparameters. This distinguishes it from prior poisoning work that assumed a specific victim model or training regime.

The 11 Defences That Failed

The paper tested Phantom Transfer against 11 data-level defences. Per the abstract, the attack survived all of them. The specific defences are enumerated in the full paper rather than the abstract; the following assessment is therefore based on what the abstract confirms about the breadth and rigour of the evaluation, not on individual defence-by-defence results.

What makes the result striking is the inclusion of a defence where every training sample was paraphrased by a separate language model before training. This is close to the strongest possible data-level intervention: rather than filtering suspicious samples, the defender rewrites all of them, destroying surface-level patterns in the process. Phantom Transfer survived it. If subliminal trigger patterns persist through full paraphrasing, statistical outlier detection and deduplication stand no chance, they are strictly weaker interventions.

The v2 revision, published June 2, 2026, substantially expanded the paper, which the research brief interprets as tightening results across all 11 tested defences.

Password-Triggered Backdoors as a Supply-Chain Threat

The practical dimension of Phantom Transfer goes beyond benchmark performance. The attack can plant password-triggered backdoor behaviours into trained models while evading every tested data-level defence, per the paper. A password-triggered backdoor activates only when a specific input string is present, producing targeted misbehaviour (data exfiltration, harmful output generation, suppression of safety refusals) while the model behaves normally on all other inputs.

This is a supply-chain problem. Open datasets, Common Crawl derivates, community-curated corpora like The Pile or RedPajama, any dataset assembled from web scrapes or user contributions, are the assumed attack surface. An adversary who can inject poisoned samples into a dataset that is later used for pre-training or fine-tuning can plant backdoors that no data-cleaning step will detect. The defender does not even get the comfort of knowing the attack was present; by design, the poisoned samples are invisible at the data level.

For organisations fine-tuning base models on proprietary data, the threat is narrower but not absent: any portion of the training mix sourced from external datasets carries this risk. The more diverse and less curated the corpus, the larger the attack surface.

Why Weight-Level Inspection Is Now the Minimum

The authors conclude that data-level defences, even with “maximum affordance”, meaning the defender has full knowledge of the attack mechanism and unlimited ability to preprocess the training data, can still fail against sophisticated poisoning. Their recommendation, per the paper, is to supplement future defences with white-box methods and post-training model audits rather than relying on corpus sanitization alone.

White-box methods inspect the trained model’s weights and activations to detect anomalous internal representations that correlate with backdoor triggers. Post-training audits probe the model behaviourally, feeding candidate trigger strings and checking for targeted misbehaviour. Both approaches operate on the model, not the data, which makes them orthogonal to Phantom Transfer’s evasion strategy.

The cost shift is real. Data-level defences are cheap to run at pre-training scale: they process text, not tensors, and can be parallelised across the corpus. Weight-level inspection and behavioural auditing require loading the model, running inference, and in the white-box case, accessing internal activations. At pre-training scale (hundreds of billions of parameters), these are expensive operations. But the alternative, publishing or deploying a model with an undetectable backdoor, is worse.

What This Means for Open Datasets and Pre-Training Trust

Phantom Transfer does not kill open datasets. It does make the trust model explicit. Any model trainer consuming external data now operates under the assumption that data-level sanitization is necessary but insufficient. The defence-in-depth posture shifts from “clean the data, then train” to “clean the data, train, then audit the model.”

For the open-source AI ecosystem, where reproducibility and community contribution are core values, this is a structural tension. Community-contributed datasets are the easiest attack vector for poisoning; they are also the hardest to lock down without defeating the purpose of open contribution. Credentialling data provenance, signing dataset releases, and running behavioural audits on models trained on community data all add friction. None of them eliminate the risk.

The research is an existence proof, not a generalised failure report. The authors demonstrate that one well-crafted attack class evades the current generation of data-level defences. The next generation of defences will need to operate at the weight level, or accept that some fraction of poisoned training data will survive into the trained model undetected.

Frequently Asked Questions

Does Phantom Transfer apply to models trained exclusively on proprietary data?

The threat is proportional to the fraction of externally sourced data in the training mix. An organization fine-tuning a base model only on internal data faces near-zero exposure from this specific attack, since the adversary would need access to the training pipeline itself. The risk reappears as soon as any external corpus enters the mix, even as a small percentage of the total, because Phantom Transfer’s subliminal signal is designed to persist through dilution by clean data.

How does Phantom Transfer differ from earlier backdoor attacks on language models?

Earlier NLP backdoor attacks typically embed triggers as surface patterns (rare words, specific syntactic structures) that perplexity-based detectors and outlier filters can catch. Phantom Transfer encodes the trigger in statistical properties that survive even full paraphrasing, making perplexity and n-gram analysis irrelevant. Earlier attacks also generally required knowledge of the victim model’s tokenizer or vocabulary to craft effective triggers; Phantom Transfer removes that requirement entirely, broadening the attack surface to any model that might later train on the poisoned corpus.

What does a behavioral audit for password-triggered backdoors involve in practice?

The core challenge is trigger discovery. A defender must systematically probe the model with candidate trigger strings and check for anomalous outputs, but the space of possible passwords is effectively unbounded. Practical audits use heuristic search over likely trigger patterns (common phrases, formatting tricks, special character sequences) combined with activation-cluster analysis when white-box access is available. The cost scales with both model parameter count and the number of candidate triggers tested, making it orders of magnitude more expensive than running any data-level filter over the training corpus.

Can fine-tuning on clean data after pre-training remove an implanted backdoor?

Fine-tuning on clean data may attenuate some backdoor behaviors, but subliminal triggers are designed to persist through gradient updates on unrelated data. If the trigger is encoded in weight patterns that the fine-tuning objective does not actively contradict, the backdoor can survive. Phantom Transfer’s model-agnostic property suggests the trigger representation is robust enough to survive distribution shifts between pre-training and fine-tuning data. Removing a backdoor with high confidence likely requires targeted unlearning techniques or surgical weight editing, not generic additional training on clean corpora.

sources · 1 cited

  1. Phantom Transfer: Data Poisoning can Survive Data-Level Defences primary accessed 2026-06-04