Can LLMs Leak Training Data? A New Test Splits Capacity From Intent

What PropMe measures: splitting “can” from “will”

When an auditor tests whether a language model memorizes training data, the result depends almost entirely on how hard the auditor tries. A targeted prefix attack, where the model is fed the first tokens of a known training document and asked to continue, will extract far more verbatim text than ordinary sampling ever would. Yet most memorization audits report a single number, collapsing these two regimes into one metric. A paper submitted June 4 to arXiv argues that this conflation is not just methodologically sloppy; it produces compliance claims that are legally fragile.

The paper, titled “LLMs Can Leak Training Data But Do They Want To?”, introduces PropMe, a propensity-aware evaluation framework that treats memorization as two distinct questions: capability (can the model reproduce training data under adversarial conditions?) and propensity (does it do so under normal sampling?). The distinction matters because a model that almost never leaks text in casual use may still have the full extractable content latent in its weights, waiting for the right prompt. Current audits, by reporting only one of these measures, give a false sense of where the risk sits.

The capability-propensity gap

The core empirical finding is straightforward: prefix-based capability attacks elicit substantially stronger memorization signals than non-adversarial evaluations, while propensity scores remain low overall. In other words, the models rarely volunteer training data unprompted, but they yield it readily when given the right coaxing.

This is not a surprise in principle. Extraction researchers have known for years that targeted prompts recover more than random sampling. The contribution of PropMe is making the gap a first-class measurement object rather than an incidental observation, and showing that it is consistent across the two models and two datasets tested. When an auditor reports a leakage rate based solely on greedy decoding or top-k sampling under benign conditions, the number reflects propensity, not capability. The worst-case extraction surface remains unmeasured and unreported.

SimpleTrace: tracing generations back to the corpus

To compute both metrics rigorously, the authors built SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora. SimpleTrace checks whether a given model output appears verbatim or near-verbatim in the training data, then derives three classes of metric: verbatim memorization, near-verbatim memorization, and propensity-transformed memorization scores.

The pipeline is designed to be reproducible. Because both the models and the training corpora are fully open, any team can rerun the evaluation on the same inputs and verify the attribution. This is a practical advantage over extraction audits on proprietary models, where neither the training data nor the model weights are available for independent checking.

Continual pre-training as partial mitigation

One finding deserves attention from teams working on deduplication and unlearning strategies: DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization capability and reduced memorization propensity for the Common Pile dataset. The reduction occurs because later training emphasizes partially different data, partially overwriting the representations that stored the original training examples.

This is a qualified bright spot. The effect is real but it comes from a specific architectural relationship (one model being a continuation of the other) and a specific data regime (shifted emphasis in the continued pre-training). It does not demonstrate that memorization can be surgically removed from an arbitrary model, and the authors do not claim it does. What it does suggest is that the training history of a model, not just its final weights, influences how extractable its memorized content is.

Why single-metric compliance claims are indefensible

The regulatory implications are where PropMe’s argument sharpens. Under GDPR, a data subject can request that a controller delete their personal data. Under copyright frameworks, rights holders can challenge unauthorized reproduction. In both cases, the question “does your model leak training data?” is central, and the answer has historically been a single leakage percentage drawn from a standard audit.

The authors explicitly argue that memorization audits should report both worst-case extractability and ordinary leakage propensity to give a comprehensive view of the phenomenon. A team that reports only propensity (the lower number) is not lying, but it is presenting a selectively favorable measure. The worst-case extraction surface, which can be orders of magnitude larger according to the paper’s framing, goes unaddressed.

Two concurrent findings reinforce the concern. CoEval (arXiv:2606.03650), another June 2026 paper, independently identifies benchmark contamination as a systemic problem: public benchmark items leak into pretraining and are recalled rather than solved, which means existing evaluation practices give a false sense of both capability and safety. And MIPO (arXiv:2603.19294), accepted at ICML 2026, shows that LLMs can self-improve using only intrinsic contrastive signals without external data, raising the possibility that models which exhibit low leakage propensity today may develop new extraction pathways as self-improvement techniques advance.

The legal risk is not hypothetical. If a litigant can demonstrate that a model has high capability for extraction (even if propensity is low), a compliance defense built on a single-metric audit starts to look like the defendant chose the most favorable test and ignored the rest.

What red-team and compliance leads should do with this

The paper does not propose a mitigation. It proposes a measurement framework, and the authors are explicit about that boundary. The actionable takeaway for practitioners is structural:

Adopt a two-metric reporting standard. Any memorization audit should report both capability (adversarial extraction under prefix attacks or similar targeted prompting) and propensity (leakage under ordinary sampling). A single number is no longer defensible.

Use open tracing where possible. SimpleTrace’s approach of attributing generations to known corpora is only feasible when the training data is available. For proprietary models, this is a gap the industry needs to close, either through voluntary transparency or regulatory requirement.

Do not treat low propensity as safety. The PropMe findings confirm what extraction researchers have suspected: the gap between what a model can leak and what it will leak is wide and consistent. A low propensity score under current sampling conditions says nothing about what a determined adversary with prompt-engineering resources can extract, and it says nothing about what future self-improvement techniques might make possible.

Scope claims to what was tested. The PropMe evaluation covers two open models on two datasets. Until the framework is applied to frontier models, compliance teams should treat the findings as indicative of a structural problem in how audits are designed, not as a calibrated risk estimate for any specific production model.

Frequently Asked Questions

Can PropMe be applied to closed models like GPT or Claude?

Not with SimpleTrace. The tracing pipeline requires the full training corpus to deterministically attribute model outputs back to source documents, which proprietary model makers do not publish. The two-metric framework itself (capability plus propensity) is model-agnostic, but the reproducible attribution step depends on open data access. Today, only models with both public weights and public corpora can be audited end to end.

How does SimpleTrace differ from Carlini-style extraction attacks?

Carlini-style methods start from the model: craft adversarial prompts and observe what leaks out. SimpleTrace inverts the problem direction. It indexes the training corpus with infini-gram (which can query trillions of tokens in sub-second time), then checks whether model outputs match known documents. This gives deterministic, reproducible attribution rather than probabilistic discovery, and it catches near-verbatim paraphrases that pure extraction attacks might miss.

Could self-improvement techniques invalidate a low propensity score?

MIPO, accepted at ICML 2026, demonstrates that LLMs can refine their own outputs using only internal contrastive signals with no external data. A model that scores low on propensity today could, after self-improvement passes, reorganize its internal representations in ways that surface previously latent memorized content more readily under targeted prompting. Propensity scores measured now are a snapshot of current behavior, not a bound on future extractability.

What infrastructure does a team need to replicate PropMe’s methodology?

Three components: the target model weights, the complete training corpus in searchable form, and an n-gram index (infini-gram or equivalent) capable of querying trillions of tokens fast enough to score large sample sets. Building and hosting that index is the practical bottleneck. The paper’s two models were chosen partly because their training corpora (Common Pile and Dynaword) were already publicly indexed, removing that setup cost.

Does benchmark contamination also affect memorization audit accuracy?

CoEval (arXiv 2606.03650) found that public benchmark items leak into pretraining corpora and get recalled verbatim rather than genuinely solved. This contamination cuts both ways: it inflates capability scores on benchmarks, and it can distort memorization measurements if the audit prompts themselves appear in the training data. Any team running PropMe-style audits should verify their test prompts are not already part of the model’s pretraining mix.