groundy
models & research

Unlearning Isn't Deletion: arXiv 2505.16831 Shows Machine Unlearning in LLMs Is Reversible

Two independent studies confirm machine unlearning methods suppress outputs without erasing internal representations, making GDPR compliance claims unverifiable.

7 min · · · 3 sources ↓

Two independent studies arriving weeks apart reach the same uncomfortable conclusion: current machine unlearning methods do not erase data from language models. They suppress it. The original behavior can be restored with minimal fine-tuning, and the standard benchmarks used to certify “forgetting” consistently fail to detect the gap. For any organization treating machine unlearning as a GDPR compliance mechanism, this is a structural problem, not an edge case.

Four Forgetting Regimes and the Reversibility Problem

Unlearning Isn’t Deletion (arXiv:2505.16831), updated to v3 on May 16 ahead of its ICML 2026 presentation, introduces a representation-level analysis framework that moves past output-level spot checks. Instead of asking whether the model still produces forgotten text, the authors ask whether the model’s internal representation of that data has actually changed.

Their toolkit includes PCA similarity and shift, centered kernel alignment (CKA), Fisher information, and a composite metric called mean PCA distance. Applied across multiple unlearning methods, data domains, and LLM architectures, the authors identify four distinct forgetting regimes organized along two axes: reversibility (can the original behavior be recovered?) and catastrophicity (does unlearning damage other capabilities?).

The four regimes break down as follows:

RegimeWhat happensPrevalence in current methods
Reversible, non-catastrophicModel appears to forget; information is trivially recoverableMost common
Reversible, catastrophicForgetting succeeds but collateral damage to other capabilities is highObserved
Irreversible, catastrophicInformation is erased alongside much of the model’s utilityPossible but undesirable
Irreversible, non-catastrophicTargeted erasure without collateral damageExceptionally rare

The regime everyone wants, irreversible non-catastrophic forgetting, is the one the paper finds hardest to achieve. Most deployed methods land in the first row: they pass output-level tests while leaving underlying representations largely intact.

RULER Confirms the Gap From a Different Angle

The RULER paper (arXiv:2605.27569), submitted May 26, arrives at the same conclusion through an independent methodology. The authors test four approximate unlearning methods against the standard evaluation criteria used in the field: membership inference attacks, retain-set accuracy, and forget-set accuracy. All four methods pass these checks. By the metrics currently used to certify unlearning, the methods “work.”

RULER’s own representation-level metrics tell a different story. The paper’s M2 metric detects statistically significant residuals in 10 of 12 test conditions (p<0.05), with effect sizes growing as the forget fraction increases. The oracle-free M4 metric goes further, detecting identity-level memorization in face recognition models where no tested unlearning method fully erases the signal. This finding held across tabular data, images, clinical text, and face-identity tasks. Even Bad Teacher, which uses a distinct forgetting mechanism, showed the same residuals.

Why Current Benchmarks Miss the Problem

The standard evaluation pipeline for machine unlearning checks two things: does the model still perform well on retained data, and does it stop producing forgotten data? Both are output-level probes. A model that has learned to avoid emitting a particular sequence while preserving the internal representation that encodes it will pass both tests handily.

This is not a subtle failure mode. It is the equivalent of testing whether a deleted file is gone by checking the filename in a directory listing without examining whether the data blocks are still on disk. The information is suppressed at the generation layer, not excised from the weight matrices.

Both papers demonstrate that representation-level metrics are necessary to close this gap. PCA distance, CKA, and Fisher information each capture different aspects of whether the model’s internal geometry has actually shifted. None of them are standard in current unlearning benchmarks.

The Narrow Exception and Why It Does Not Generalize

The Unlearning Isn’t Deletion authors report one case of seemingly irreversible targeted forgetting. The paper characterizes this as dependent on the specific data source and domain, and the overall finding is that current methods remain insufficient for trustworthy erasure.

One case in a controlled experiment is not a method you can ship. The gap between a laboratory finding of irreversibility and a deployment-ready erasure guarantee is wide enough that no practitioner should claim compliance based on it. The authors are explicit about this: relearning efficiency depends on the data source, and the single narrow exception does not establish a general mechanism.

Any vendor or platform that markets machine unlearning as a mechanism for GDPR right-to-erasure, copyright compliance, or data deletion is making a claim these two papers show is currently unverifiable. The standard metrics used to certify unlearning systematically fail to detect retained information. Representation-level analysis, which would be necessary to verify genuine erasure, is not part of any compliance framework the authors describe.

The practical risk is adversarial recovery. If information persists in model representations, an adversary with access to model weights, or even query-level access combined with probing techniques, may be able to reconstruct the data targeted for erasure. The relearning experiments in the Unlearning Isn’t Deletion paper show that minimal fine-tuning suffices to restore original behavior. A motivated attacker does not need much.

What a Trustworthy Verification Stack Would Require

Neither paper prescribes a production-ready alternative, but together they outline what one would need.

Representation-level metrics as standard. PCA distance, CKA, and Fisher information should supplement or replace accuracy and perplexity as the evaluation floor. RULER’s M2 and M4 metrics offer a starting framework.

Cross-domain testing. RULER’s results across tabular, image, clinical text, and face-identity data show that a method that appears to work on one modality may fail silently on another. Verification needs to be domain-specific, not one-size-fits-all.

Adversarial probing. If the threat model includes an adversary attempting recovery, evaluation must include active relearning attacks, not just passive output checks.

Statistical rigor over single-pass tests. RULER’s detection of residuals in 10 of 12 conditions at p<0.05 demonstrates that the gap is systematic, not anecdotal. Verification stacks need similar statistical power.

The capability self-assessment literature (arXiv:2606.00251) identifies a related problem: LLMs systematically overestimate their competence across diverse model families and scales, and supervised fine-tuning for self-assessment degrades the very capabilities being assessed (reinforcement learning fared better). A model that cannot accurately report what it knows cannot credibly attest to what it has forgotten.

None of this is a reason to abandon machine unlearning research. It is a reason to stop treating current methods as sufficient for compliance, and to treat vendor claims of “our model has unlearned your data” with the skepticism that two independent papers now substantiate.

Frequently Asked Questions

Is this erasure gap specific to LLMs, or does it extend to other model architectures?

RULER tested face recognition models, clinical text classifiers, tabular data systems, and image models alongside LLMs, finding representation-level residuals in every modality. This broadens the compliance exposure well beyond chatbot vendors: any system processing biometric data (GDPR Article 9), health records, or financial data under a right-to-erasure obligation faces the same unverifiable erasure problem.

Why does Bad Teacher’s failure matter when it uses a completely different forgetting mechanism?

Most unlearning methods use gradient ascent or targeted fine-tuning to suppress outputs. Bad Teacher takes a distinct approach, yet shows the same representation-level residuals. Its failure alongside conventional techniques suggests the erasure gap is a structural property of how distributed representations encode information, not a deficiency of any particular forgetting algorithm.

What can a team actually do if they receive a legitimate data deletion request today?

Full retraining from a filtered dataset that excludes the requested data is the only approach that guarantees erasure. For frontier-scale models, the compute cost of retraining per individual request is prohibitive. Neither paper identifies any approximate method that survives representation-level verification, and the Unlearning Isn’t Deletion authors are explicit that their single narrow case of irreversible forgetting does not constitute a deployable mechanism.

Does the amount of data targeted for deletion affect how detectable the remaining information is?

RULER found that representation-level residuals grow with the forget fraction: larger deletion targets produce bigger detectable residuals. This creates a detection paradox for compliance. Individual right-to-erasure requests typically involve small forget fractions where residuals are smaller and harder to distinguish from noise, yet the data still persists in model weights. The scenario regulators care about most (per-user deletion) is precisely where current verification tools are least sensitive.

sources · 3 cited

  1. Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs primary accessed 2026-06-02
  2. RULER: Representation-Level Verification of Machine Unlearning primary accessed 2026-06-02
  3. Capability Self-Assessment: Teaching LLMs to Know Their Limits analysis accessed 2026-06-02