Dataset Watermarks Fail to Trace Fine-Tuned AI Image Models, New Benchmark Finds

Dataset watermarks in training images were supposed to trace which fine-tuned diffusion model generated an image. A benchmark from Wang et al. (arXiv:2511.19316), updated in a major v2 revision on May 28, 2026, tests that premise and ships a removal method that claims to fully eliminate the marks. If the claim holds, any regulation assuming post-hoc watermark traceability is building on a broken premise.

What the benchmark measures

Wang et al. establish the first unified evaluation framework for dataset watermarking of fine-tuned diffusion models. The framework evaluates three properties:

Universality: whether a watermark survives across different model architectures.
Transmissibility: whether a watermark embedded in training images transfers through fine-tuning into generated outputs.
Robustness: whether the watermark persists after image-processing attacks.

The benchmark finds that existing methods perform well on universality and transmissibility and show some robustness against common image-processing operations, per the paper’s abstract. The qualifier matters: “some robustness” against routine transformations is not the same as surviving a motivated adversary.

The removal method

The paper proposes a practical watermark removal method that, it claims, “fully eliminates” dataset watermarks from fine-tuned diffusion model outputs “without affecting the fine-tuning process itself.” Those quotes come from the paper’s abstract. The full v2 revision, which grew from roughly 6 MB to 24 MB (suggesting substantial additional experiments in the update), may contain granular success-rate or false-positive numbers, but the abstract does not surface them.

What the abstract does establish is directionally significant: a general-purpose removal tool exists, it targets dataset watermarks specifically (as distinct from output watermarking, which embeds marks at inference time), and it does not degrade fine-tuning quality. If removing the watermark also destroyed the model’s output quality, the attack would be self-defeating. The paper claims it doesn’t.

The MarkDiffusion toolkit, developed by THU-BPM, provides broader evaluation infrastructure: 11 generative watermarking algorithms (including Tree-Ring, ROBIN, Gaussian-Shading, PRC, and SEAL) and 33 evaluation tools covering detectability, robustness, and output quality. Its attack tools, DiffusionPurification and NeuralCodecCompression, test watermark survival under regeneration attacks. The toolkit is open-source and allows independent replication of the benchmark’s findings.

Lab robustness versus real-world threats

The benchmark draws a line between “common image processing operations” and “real-world threat scenarios,” and existing methods land on opposite sides. They show robustness against the former and fall short against the latter, per the paper. The removal method the paper ships was designed to break watermarks, not to test them under benign conditions. It falls squarely in the adversarial category.

This distinction matters for anyone evaluating watermark-based provenance for production use. A mechanism that survives JPEG compression but not a targeted removal pass is useful for accidental degradation, not for adversarial traceability.

What this means for regulation

Any regulatory framework that assumes dataset watermarks provide durable traceability through fine-tuning now has a specific counterexample to reckon with. The paper’s claim is narrower than “all watermarks are broken”: it applies to dataset watermarking as currently implemented, and it demonstrates removal without quality loss. That is still sufficient to undermine any policy that treats post-hoc watermark detection as a reliable provenance mechanism.

This next point is editorial inference, not a paper claim: if removal is as general as the benchmark suggests, the asymmetry between embedding cost and removal cost tilts against the defender. An attacker needs to run one removal pass; a regulator needs every watermark to survive every removal attempt. That is a poor basis for policy.

C2PA: the upstream alternative, with its own survival problem

The natural alternative to post-hoc watermark forensics is capture-time provenance. C2PA (Coalition for Content Provenance and Authenticity) embeds cryptographically signed metadata at the point of capture or creation. As of May 2026, its steering committee includes Adobe, Arm, BBC, Intel, Microsoft, Sony, and Truepic. Adopters include GenAI providers (OpenAI, Google, Meta), camera manufacturers (Leica, Nikon, Sony, Canon), and news organizations (BBC, New York Times, Reuters).

Two properties are worth distinguishing:

C2PA verifies provenance, not truth. A manifest confirms who signed the file and that it has not been tampered with since signing. It does not confirm whether the content accurately represents reality.
The “AI Generated” flag is self-declaration, not detection. The creating tool voluntarily marks output as AI-generated. C2PA does not independently detect AI-generated content.

Both properties are architectural choices, not oversights. C2PA’s designers chose verifiable metadata over forensic detection, accepting reliance on honest self-labeling.

The more immediate problem: as of May 2026, C2PA manifests do not survive uploads to most major social media platforms. Instagram, X, Facebook, and TikTok strip metadata during re-encoding, while LinkedIn and some news outlets preserve them. A provenance system that dissolves on upload to the platforms where most AI-generated images are actually distributed has a coverage gap that limits its practical value as a regulatory tool.

What practitioners should do

Neither approach works as a single-point solution. Dataset watermarks can be removed. C2PA manifests can be stripped. Practitioners building provenance pipelines should treat both as layers rather than alternatives:

Watermarks provide a probabilistic signal when they survive, but cannot serve as a deterministic provenance mechanism given the current state of removal attacks.
C2PA provides a verifiable chain when manifests survive upload, but cannot detect non-compliant tools and does not survive social-media re-encoding on most platforms.
Policy design should not assume either mechanism is tamper-proof. Any regulation that hinges on post-hoc traceability of AI-generated images needs to account for the possibility that the trace disappears before the image reaches its audience.

The benchmark makes a specific, verifiable claim: dataset watermarks are removable without degrading fine-tuning quality. The inference from there is straightforward. If provenance matters, it needs to happen earlier in the pipeline and survive further down the distribution chain than either current approach guarantees.

Frequently Asked Questions

Does the benchmark evaluate video or audio diffusion models, or only images?

The paper and MarkDiffusion toolkit focus on latent diffusion models for image generation. Whether the removal method transfers to video or audio diffusion has not been tested. Fine-tuning dynamics and watermark transmissibility differ across modalities, so the findings should not be assumed to generalize without separate evaluation.

How does the paper’s removal method differ from the attack tools already in MarkDiffusion?

MarkDiffusion ships two regeneration-based attacks (DiffusionPurification and NeuralCodecCompression) that re-process generated images to destroy output watermarks like Tree-Ring and Gaussian-Shading. The paper’s removal method targets a different stage: it strips dataset watermarks from the outputs of fine-tuned models without modifying the fine-tuning process itself. An adversary could chain both approaches, removing dataset-derived marks during generation and then applying a regeneration pass to eliminate any output-layer watermark.

Among platforms that strip C2PA metadata on upload (Instagram, X, Facebook, TikTok), combined daily image traffic runs into the billions of impressions. LinkedIn and a subset of news outlets preserve manifests, but these represent a small fraction of consumer image distribution. Creator-side tool adoption cannot close this gap on its own; the platforms receiving the most AI-generated image traffic would each need to implement C2PA-preserving upload pipelines for the standard to achieve practical coverage.

If both dataset and output watermarks have published removal attacks, does layering them still work?

MarkDiffusion’s attack suite already targets 11 output watermarking algorithms, including Tree-Ring, ROBIN, PRC, and SEAL. The paper adds dataset watermark removal to the attacker’s toolkit. Layering the two mechanisms raises the effort required (an adversary must apply both attacks), but each layer is independently breakable with published, open-source tooling. A defender relying on layered provenance should treat it as a speed bump rather than a barrier, and design monitoring around detecting removal attempts rather than assuming they will be prevented.