125 Targeted Wikipedia Edits Left a Detectable Signal in Llama Pretraining

Q: Would the same attribution signal appear in models trained on datasets that mix Wikipedia with large amounts of proprietary text?

Probably weaker, but topic concentration is the critical variable, not raw corpus size. Wikipedia's elevated influence in this study comes from its high per-token weight relative to web-crawled text. Models trained on datasets that heavily include proprietary sources would assign Wikipedia a lower relative weight. However, 125 edits concentrated on a single topic across 115 pages would still cluster attribution signals for queries on that topic if Wikipedia remains in the mix at any non-trivial weight, because the mechanism is topical density, not total edit count.

Q: How does this differ from classical data-poisoning attacks, which also try to shift model behavior through training data?

The distinguishing factor is detector visibility. Classical poisoning typically introduces anomalous statistical patterns (outlier gradient norms, unusual n-gram distributions) that standard poisoning detectors flag. PAW-style edits pass those checks because they are editorially valid and stylistically consistent with Wikipedia norms. The training pipeline has no signal distinguishing coordinated good-faith advocacy from coordinated adversarial editing once both clear editorial review. That gap is what the paper exposes, not a flaw in Wikipedia's own editorial process.

Q: Can evaluation after post-training catch the value degradation the companion paper describes?

Detection is possible but requires deliberate tooling teams must add themselves. Running pre/post comparisons on a fixed value evaluation suite (MORU or the Animal Harm Benchmark, for example) after every post-training stage would surface the drop. Expanding reward models to include value-axis scoring could constrain the degradation during training rather than catching it afterward. Neither approach is standard in current post-training pipelines, and helpfulness fine-tuning loss curves do not surface the drop automatically.

Q: Does the cross-lingual transfer of mid-trained values mean non-English deployments inherit value signals from English pretraining sources?

Yes, and the transfer is asymmetric in a specific way. The companion paper (2606.26102) found coding-focused fine-tuning produced Animal Harm Benchmark gains 4.5 times larger on non-English items than on English, suggesting value-relevant representations propagate across languages more robustly than domain-specific reasoning gains do. For deployment teams, that means unintentional value signals introduced through English high-weight sources are not naturally bounded to English-language queries. Language separation between corpus and product audience is not a reliable insulation mechanism.

A preprint revised to v2 on 2026-06-25 measures something most training-data audits assume away: whether a small, coordinated Wikipedia editing campaign leaves a detectable fingerprint in a pretrained language model’s expressed values. arXiv:2606.24890 finds that 125 sourced edits across 115 pages produced statistically significant attribution signals in both Llama 3.1 8B and Llama-3.2-1B on the targeted topic. The averaging assumption doesn’t hold for high-weight sources.

What Did the Study Actually Measure?

The experiment starts with a natural subject: the Pro-Animal Wikipedians (PAW), a volunteer group that adds animal-welfare content to Wikipedia through normal editorial channels with citations. According to arXiv:2606.24890, PAW made 125 edits across 115 Wikipedia pages. The authors then applied gradient-based data attribution tools (MAGIC and Bergson) alongside retrieval attribution (TrackStar) to trace how those edits propagated through pretraining into model behavior.

This framing is not a data-poisoning story. PAW’s edits are visible, sourced, and content-plausible. They passed Wikipedia’s editorial review. The finding is that even legitimate, transparent edits to a high-weight training source produce measurable downstream influence on what a model outputs on the edited topic. From pretraining’s point of view, good-faith advocacy and adversarial manipulation go through the same pipeline.

How Did the Attribution Methodology Work?

The paper applies two complementary tools at different model scales. TrackStar is retrieval-based: given a query, it identifies which training documents correlate most strongly with the model’s response. MAGIC is counterfactual: it estimates which documents most causally affected a prediction by modeling their absence.

On Llama 3.1 8B, TrackStar found that PAW-edited sections made up 68% of the highest-attributed documents for animal-welfare queries (p<0.0001), per arXiv:2606.24890. For queries about the same companies on unrelated topics, PAW-edited sections sat at 52% (p=0.53), indistinguishable from chance. The model associates PAW content with the welfare topic specifically, not with the entities those pages happen to discuss. That topic-binding result is the cleaner finding: it’s not that PAW content dominated training in general, but that it dominated training on the particular subject it targeted.

On Llama-3.2-1B, MAGIC ran across five random training-order seeds. The top-10 most influential documents for animal-welfare queries were all PAW edits in every seed, while the same top-10 for general queries fell at 4-6 out of 10, chance. The effect size was 6 to 30 times larger on welfare queries than on general queries, and leave-subset-out validation produced a Spearman rho of 1.00 across all 10 runs, according to arXiv:2606.24890. The signal doesn’t jitter across seeds or validation schemes.

What Do the Perplexity Results Show?

A model fine-tuned on PAW content reduced perplexity on animal-welfare text from 12.4 to 8.4. A control-trained model reduced perplexity on control text from 16.1 to 11.4, per arXiv:2606.24890. Perplexity here is a measure of how “at home” the model is with the text: lower means the model is more fluent and less surprised by it. The PAW-trained model became measurably more fluent on the targeted content.

One thing the paper does not report is a single “values shifted by N%” headline figure, and that’s an accurate representation of what the methodology can measure. Influence attribution doesn’t reduce to a scalar. What the paper does quantify, a 68% attribution share at p<0.0001, 10/10 top documents in every training-order seed, Spearman rho of 1.00, is robust on the tested models. How large the downstream behavioral effect is in deployment, across which query types, by what margin, remains unquantified.

Why Does Wikipedia’s Position in Training Pipelines Matter Here?

Wikipedia appears in nearly every major language-model training dataset and is weighted more heavily than web-crawled text, according to arXiv:2606.24890. Most training pipelines oversample Wikipedia directly. The corpus is a small fraction of the internet by raw volume, but that volume ratio bears no relationship to its per-token influence during training.

That gap is the mechanism. At lower per-token weight, 125 targeted edits would plausibly dissolve into the noise. Wikipedia’s elevated position in standard training pipelines is the amplifier that makes a modest coordinated campaign detectable rather than averaged away. Knowing that Wikipedia represents some small percentage of total training tokens is not the same as knowing its causal influence on model outputs on specific topics.

What Do the Two Companion Papers Add?

Two companion preprints from the same author run different experiments on the same underlying question.

arXiv:2606.26104 isolates which linguistic properties carry the effect. On Llama-3.2-1B, 8 of 10 tested features produced statistically significant shifts in animal-welfare reasoning. Seven push models toward pro-animal-welfare outputs: assertive certainty, explicit moral vocabulary, emotion words, evaluative claims, narrative structure, depicted harm severity, and immediate temporal framing. Two dilute the effect: hedged language and concrete sensory description. First-person perspective had no significant effect.

The practical profile is specific: confident moral framing with emotional vocabulary, described harm, and near-term temporal scope propagates through pretraining. Hedged or descriptive writing doesn’t. For anyone thinking about what kinds of Wikipedia edits are likely to influence a model, this is a reasonably precise fingerprint.

arXiv:2606.26102 asks what happens to values instilled in pretraining when standard post-training pipelines run. Helpfulness fine-tuning substantially degrades animal-compassion performance relative to coding training on the Animal Harm Benchmark: SFT drops it from 65.2% to 35.7%, and GRPO drops it from 32.0% to 18.7%. English moral reasoning degrades by 25.5 percentage points after helpfulness post-training (46.4% versus 71.9%), per arXiv:2606.26102.

The cross-lingual result cuts in two directions. Magicoder’s Animal Harm Benchmark gain was 4.5x larger on non-English items than on English, suggesting compassion-related values encode into representations that generalize across languages. The moral-reasoning degradation did not generalize: multilingual MORU scores were 52.3% versus 51.2% (not significant). Values and domain-specific reasoning appear to be stored differently, and standard helpfulness post-training erodes whichever one it doesn’t optimize for.

What Does This Mean for Data-Provenance and Alignment Audits?

For alignment teams, the operative question isn’t whether PAW’s particular campaign succeeded. It’s that the methodology now exists to detect 125 specific edits’ influence in pretrained models, and a corresponding gap opens if you don’t apply it. Aggregate corpus statistics, “Wikipedia is X% of our training data”, don’t answer the question this paper raises. Provenance auditing at the level of “which specific edits had the most causal influence on our model’s outputs on topic Y?” requires tools like MAGIC or TrackStar applied to the training process, not just corpus-level accounting.

The companion finding on post-training erosion (arXiv:2606.26102) adds a second problem. If helpfulness-oriented fine-tuning systematically degrades values instilled during pretraining, then pretraining audits and post-training evaluations can’t be run independently. A value axis that looks well-calibrated after pretraining may be substantially degraded before deployment, with no obvious signal in post-training loss curves to flag it. The 25.5 percentage-point drop in English moral reasoning is not a subtle effect.

Where Does the Methodology Fall Short?

Three constraints limit how far these results travel. First, model scale: the attribution studies ran on 1B and 8B parameter models. Whether the same attribution signals appear at 70B or larger, where training data is more diverse and any single source carries less per-token weight, is an open question the paper doesn’t address.

Second, topic specificity: animal welfare has relatively concentrated, clearly-marked Wikipedia content, and the companion paper identifies a specific linguistic fingerprint that carries the effect. Value domains with subtler textual signatures, or where Wikipedia doesn’t have high-edit-density pages, may produce different attribution results. The paper demonstrates a mechanism; it doesn’t quantify how general that mechanism is across value axes.

Third, the behavioral gap: the paper measures perplexity and attribution rank, not deployment-time outputs. Those are diagnostic indicators of training influence, not outcome measures. The distance between “this edit influenced pretraining” and “this edit changed how the model responds to a specific user query” remains unquantified and is not a trivial gap to close.

The more unsettling implication sits outside the paper’s scope: the attribution tools work regardless of editorial intent. A coordinated good-faith campaign and an adversarial one both run through the same pretraining process, and MAGIC doesn’t distinguish between them. Wikipedia’s own documented systemic biases exist as a separate confound, the paper is measuring the incremental effect of PAW’s edits, not the baseline bias already present in Wikipedia before PAW began. Those two effects are layered, not separable from a single study.

Frequently Asked Questions

Would the same attribution signal appear in models trained on datasets that mix Wikipedia with large amounts of proprietary text?

Probably weaker, but topic concentration is the critical variable, not raw corpus size. Wikipedia’s elevated influence in this study comes from its high per-token weight relative to web-crawled text. Models trained on datasets that heavily include proprietary sources would assign Wikipedia a lower relative weight. However, 125 edits concentrated on a single topic across 115 pages would still cluster attribution signals for queries on that topic if Wikipedia remains in the mix at any non-trivial weight, because the mechanism is topical density, not total edit count.

How does this differ from classical data-poisoning attacks, which also try to shift model behavior through training data?

The distinguishing factor is detector visibility. Classical poisoning typically introduces anomalous statistical patterns (outlier gradient norms, unusual n-gram distributions) that standard poisoning detectors flag. PAW-style edits pass those checks because they are editorially valid and stylistically consistent with Wikipedia norms. The training pipeline has no signal distinguishing coordinated good-faith advocacy from coordinated adversarial editing once both clear editorial review. That gap is what the paper exposes, not a flaw in Wikipedia’s own editorial process.

Can evaluation after post-training catch the value degradation the companion paper describes?

Detection is possible but requires deliberate tooling teams must add themselves. Running pre/post comparisons on a fixed value evaluation suite (MORU or the Animal Harm Benchmark, for example) after every post-training stage would surface the drop. Expanding reward models to include value-axis scoring could constrain the degradation during training rather than catching it afterward. Neither approach is standard in current post-training pipelines, and helpfulness fine-tuning loss curves do not surface the drop automatically.

Does the cross-lingual transfer of mid-trained values mean non-English deployments inherit value signals from English pretraining sources?

Yes, and the transfer is asymmetric in a specific way. The companion paper (2606.26102) found coding-focused fine-tuning produced Animal Harm Benchmark gains 4.5 times larger on non-English items than on English, suggesting value-relevant representations propagate across languages more robustly than domain-specific reasoning gains do. For deployment teams, that means unintentional value signals introduced through English high-weight sources are not naturally bounded to English-language queries. Language separation between corpus and product audience is not a reliable insulation mechanism.