Do Privacy Defenses Actually Protect Fine-Tuned LLMs? A New Benchmark

A new privacy benchmark for fine-tuned LLMs draws a line that most deployment teams have been quietly stepping over: passing an attack probe is not the same thing as having a privacy guarantee.

What the benchmark measures

The benchmark evaluates how well three common LLM adaptation strategies hold up under privacy attacks: LoRA (Low-Rank Adaptation), full fine-tuning, and prompt-based methods. The attack probes are membership inference, which tests whether a model’s confidence scores reveal whether a specific record was in its training set, and extraction attacks, which try to pull verbatim training data from model outputs.

The question is straightforward: if you fine-tune a model on customer data, how much of that data can an attacker recover? The answer the benchmark provides is useful, and also incomplete in a specific way that matters.

The gap between passing probes and privacy

An empirical test asks a concrete question: can this specific attack, run under these specific conditions, extract private information from this model? A formal privacy guarantee asks a different question: can you prove, mathematically, that the model’s output distribution is essentially the same whether or not any individual record was in the training set?

These are not the same. A model that survives a membership-inference probe has demonstrated resistance to one attack at one time with one set of assumptions. A model trained with differential privacy, where noise is injected into the training process according to a defined budget, carries a bound on how much any single training example can influence the output. The bound holds against attacks that have not been invented yet, because it is a property of the training procedure, not a score on a test.

As the empirical evidence definition makes clear, empirical knowledge is derived from observation and experiment rather than abstract theory. That is exactly the distinction here. An empirical pass is observation. Differential privacy is theory with a proof attached.

The practical consequence: a team that fine-tunes on customer data, runs a battery of extraction probes, sees clean results, and signs off on deployment has tested the model against known attacks. They have not established a privacy bound. If a stronger attack appears next quarter, the empirical pass gives them no protection, no compliance argument, and no retrospective cover.

A structural analog from medical LLM safety

A June 2026 study on medical LLM stress testing demonstrates the same pattern in a different domain. Medical language models that looked uniformly safe under clean benchmark conditions diverged sharply under “narrative stress” probes, revealing hidden failure modes that standard accuracy scores missed entirely.

The study found that quantized models exhibited what the authors call “pseudonormalization”: low flip rates that masked functional collapse. The models looked stable. They were not. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction, even as surface-level accuracy held up.

One finding worth noting: an open-weight model in that study matched or exceeded proprietary alternatives on every safety dimension tested. Model provenance alone does not determine robustness under adversarial probing. The safety properties are a function of the evaluation, not the brand.

What vendors should now disclose

The benchmark makes the empirical-vs-formal distinction legible. Once the gap is named and measured, vendor privacy claims that say “tested against membership inference and extraction attacks” without specifying the threat model, the attack strength, or the absence of a formal DP bound start to look like what they are: partial answers to a question the customer asked in full.

Vendors shipping fine-tuned models on customer data should specify at minimum:

Which attack classes were tested (membership inference, extraction, inversion, others).
The attack strength and assumptions (black-box vs. white-box access, number of queries allowed, adversary’s auxiliary knowledge).
Whether a formal differential-privacy bound was applied during training, and if so, what epsilon and delta values were used.
Whether the evaluation was conducted by the vendor, the customer, or an independent third party.

A claim that omits these details is not false. It is unscoped. In a compliance context, unscoped is close to useless.

The compliance cost of “we tested it”

Most enterprise AI agreements now include some form of data-processing language. When a customer asks “is my data private in your fine-tuned model?”, a vendor that ran extraction probes and saw no leakage can answer yes. A vendor that trained with differential privacy at a stated epsilon can also answer yes. Those are different answers to the same question, and the benchmark makes it possible to tell them apart.

For teams building internal compliance arguments, the benchmark raises the floor. A privacy assessment that cites empirical attack resistance without noting the absence of a formal bound is now documentably incomplete. The paper exists. The gap is published. Ignoring it is a choice, and choices get quoted in audit findings.

For regulators, the distinction is actionable. A standard that requires “privacy testing” without specifying what kind of testing invites the same pseudonormalization the medical LLM study documented: models that pass the test by construction rather than by genuine resistance.

What practitioners should ask

Before deploying a fine-tuned model on customer or user data, three questions worth pressing on:

Was the model trained with a formal differential-privacy mechanism? If yes, what are the privacy budget parameters? If no, the model has no provable privacy bound, regardless of what empirical tests show.
What attacks were tested, and under what assumptions? Black-box membership inference under a constrained query budget tells you something. White-box extraction with unlimited queries tells you something different. The threat model is the claim.
Who ran the evaluation, and can the methodology be reproduced? Vendor-run tests with unpublished parameters are testimonials, not evidence.

The benchmark does not solve the privacy problem for fine-tuned models. What it does is make it harder to pretend the problem has been solved by pointing at a test score. That is a modest contribution, and a useful one.

Frequently Asked Questions

Do LoRA adapters leak less training data than full fine-tuning because they update fewer parameters?

The intuition that fewer updated parameters means less risk is incomplete. LoRA typically modifies under one percent of a model’s total weights, but the adapter matrices themselves are trained on the fine-tuning corpus and can encode memorized sequences. Prior work on extraction from LoRA adapters has shown that the compressed representation does not prevent leakage; it changes the specific weight artifacts an attacker targets. The attack surface shifts from the full weight matrix to the adapter, but does not disappear.

What epsilon values do teams typically achieve when applying differential privacy to LLM fine-tuning, and what counts as strong?

In the DP literature, an epsilon below 1 is considered strong privacy. Most published LLM fine-tuning experiments with DP report epsilon between 3 and 10, which weakens the formal bound but preserves more model utility. The useful comparison is relative: a vendor disclosing epsilon of 4 is making a formally weaker guarantee than one at epsilon of 0.5, yet both are categorically stronger than a vendor with no epsilon at all who cites only empirical probe results.

Does quantizing a model after fine-tuning preserve its differential-privacy guarantee?

Not automatically. A DP bound applies to the output distribution of the model as trained. Post-training quantization (reducing weights from FP16 to INT8 or INT4) is a separate transformation outside the proof’s scope. The medical LLM stress-test study’s pseudonormalization finding is relevant here: quantized models showed low flip rates that masked functional collapse, suggesting compression can suppress the confidence-surface variations that membership inference probes measure, making attacks appear to fail without actually eliminating memorized information from internal representations.

Can a model resist extraction attacks but still fail membership inference, or are the two correlated?

They test independent leakage vectors and can diverge. Extraction attacks try to recover verbatim training text from model outputs, exploiting memorization in the generation layer. Membership inference uses the model’s confidence scores or loss values to determine whether a specific record was in the training set, without needing to reproduce it. A model trained with output-space noise might resist extraction while still exhibiting confidence patterns that betray membership. The benchmark evaluates both because passing one attack class does not predict passing the other.