LLM Watermarking Without Quality Loss: The Non-Distortionary Approach

LLM watermarking has always come with a quality tax. Embed a signal strong enough to detect later, and you bend the token distribution enough to make the output worse. A new scheme called LUNA, published on arXiv last week, claims to break that trade-off: AUROC 0.9959 for detection across six languages with a median perplexity shift of 0.045. Those figures are the authors’ abstract-level claims; the full paper’s robustness tables were not independently verified at the time of writing.

How LUNA works: part-of-speech entropy as the control knob

Most LLM watermarks treat every token the same. A fixed-strength signal gets stamped into each sampling step, regardless of whether the token is a determinate article (low entropy, easy to watermark without damage) or a rare adjective in a creative passage (high entropy, where any perturbation is visible).

LUNA (Linguistics-Aware Non-Distortionary LLM Watermarking), by Shinwoo Park, takes a different route. It estimates normalized next-tag entropy from part-of-speech contexts using an external corpus, then uses that entropy estimate to set the depth of a binary tournament sampler. Tokens in low-entropy POS contexts get deeper watermarking; tokens in high-entropy contexts get shallower embedding, or none at all.

The detector reconstructs the same tournament schedule from the output text using a tokenizer, a tagger, and a secret key. No access to the generating LLM is required. This makes LUNA a model-free detection scheme, which matters operationally: a content platform running provenance checks doesn’t need to know which model produced the text or maintain inference-side hooks into it.

The benchmark numbers

Evaluated across six typologically diverse languages and two domains against eight primary baselines, LUNA reports an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across twelve settings (95% bootstrap interval [0.022, 0.073]), according to the arXiv abstract. The authors also report the lowest mean Self-BLEU, Distinct-1, surprisal, and entropy shifts among all tested methods, suggesting minimal impact on text diversity.

The headline comparison: LUNA is the only method that simultaneously achieves AUROC > 0.99 and absolute median perplexity shift below 0.1 in a majority of settings, reaching that regime in 9 of 12 settings. No baseline reached it in more than 2. If those numbers hold under independent replication, the gap is not subtle.

Prior art: two approaches to the same trade-off

The idea of watermarking without output distortion is not new. The Unbiased Watermark (Hu et al., ICLR 2024 Spotlight) introduced zero-distortion watermarking via multiple watermark distributions keyed by a private key, using δ-reweight and γ-reweight methods. Detection uses log-likelihood-ratio and likelihood-agnostic tests. It was tested for robustness against temperature changes, Top-k sampling, input perturbation, model perturbation, and random substitution attacks, showing resilience to moderate modifications. No paraphrase or translation attack results were reported in available coverage.

WaterMax (Giboulot & Furon, Inria/IRISA) takes a fundamentally different approach: it generates multiple complete text candidates and selects the one with the lowest p-value. The LLM’s token distribution and sampling process remain entirely untouched. The trade-off is computational cost rather than output quality.

WaterMax’s authors identify a structural limitation that applies to all distortion-free schemes: they rely on text entropy to carry the signal. Low-entropy outputs (formulaic text, code, boilerplate) are hard to watermark without boosting signal strength, and boosting signal strength degrades quality. This is the fundamental tension LUNA tries to sidestep with its POS-adaptive depth.

Method	Approach	Distortion	Detection	Key trade-off
Unbiased Watermark	Multi-distribution reweighting	Zero (by construction)	LLR / likelihood-agnostic	Limited to tested attack types
WaterMax	Multi-candidate selection	Zero (original distribution preserved)	p-value based	Computational overhead per generation
LUNA	POS-adaptive tournament depth	Near-zero (0.045 perplexity shift)	Model-free, key-based	Abstract-level claims; robustness untested under paraphrase

The gap nobody is testing: paraphrase and translation

Here is where the comparison gets uncomfortable. LUNA evaluates against substitution attacks and sampling perturbations, consistent with prior work. But the attack surface that matters for deployed provenance systems is not token-level noise. It is wholesale rewriting.

A paraphrase attack takes watermarked text and rewrites it through a second LLM pass, preserving semantics while replacing most tokens. A translation attack moves the text to another language and back. Both destroy the token-level watermark signal that schemes like LUNA depend on, because the detector matches against the original token sequence. The WaterMax authors’ observation about entropy dependency applies here in a different form: high-entropy rewriting destroys the signal regardless of how cleverly it was embedded.

No distortion-free watermark scheme in the published literature has demonstrated robustness under these attacks. LUNA does not claim to. The Unbiased Watermark’s available coverage does not report paraphrase results. This is not a flaw in LUNA’s design; it is an open problem that none of the competing schemes have solved either.

Why this matters for provenance infrastructure

For AI vendors and content platforms, the practical appeal of non-distortionary watermarking is straightforward. If watermarking degrades output quality, vendors face a choice between provenance and product. Most choose product. A scheme that preserves quality removes that choice.

LUNA’s multilingual evaluation across six languages is notable because most prior work tests on English only. If the numbers replicate, provenance tracking becomes viable in deployment contexts where it previously was not.

But the second-order consequence is less comfortable for the detection side. Existing AI-content detectors, particularly the statistical ones, lean partly on the artifacts that distortion-based watermarks introduce: slightly lower diversity, subtly shifted perplexity distributions, reduced lexical variety. If non-distortionary watermarks become standard, those artifacts disappear, and detectors lose a class of signals they have been implicitly relying on. Detection gets harder, not easier, even as provenance gets more reliable for participants who hold the key.

The distinction matters. Provenance and detection are different problems. LUNA improves provenance for keyed participants while potentially weakening detection for everyone else.

What to watch

Three things determine whether LUNA moves from a strong arXiv result to a deployable tool. First, independent replication of the benchmark figures against the full set of baselines. Second, robustness testing under paraphrase, translation, and cross-lingual attacks, which the current evaluation does not cover. Third, computational overhead: the POS tagging and entropy estimation step adds latency that matters at inference-serving throughput, and the abstract does not report wall-clock numbers.

The non-distortionary watermarking line of work (Unbiased Watermark in 2024, WaterMax, now LUNA) is converging on a shared insight: you can separate the watermark signal from output quality if you are willing to be selective about which tokens carry the signal. LUNA’s POS-adaptive approach is the most refined version of that idea so far. Whether refinement is enough depends on attack surfaces the field has not yet agreed to test.

Frequently Asked Questions

Does LUNA work as a wrapper around any LLM, or does it require changes to the model’s sampling code?

LUNA replaces the standard sampling step inside the LLM’s inference loop with its POS-adaptive binary tournament sampler, so generating watermarked text requires modification to the model’s sampling code. Detection is fully portable: any party with the output text, a matching tokenizer, a POS tagger, and the secret key can verify provenance without access to the original LLM. This split (invasive generation, portable detection) is shared with the Unbiased Watermark but contrasts with WaterMax, which preserves the original sampling process entirely by selecting among multiple complete candidates instead.

Is LUNA the first non-distortionary watermark evaluated across multiple languages?

Yes. Prior non-distortionary schemes were tested primarily on English. The Unbiased Watermark (ICLR 2024 Spotlight, from researchers at UMD, Pittsburgh, and Waterloo) and WaterMax both reported single-language evaluations. LUNA’s six-language, twelve-setting benchmark is the first multilingual evaluation for this family, which matters because watermark detectability is sensitive to morphological complexity and tokenization granularity: a language with rich inflection (Finnish, Korean) distributes entropy differently across POS categories than a language with minimal morphology (English), and LUNA’s POS-adaptive depth must handle both.

What content types fall outside LUNA’s demonstrated capability?

The published evaluation covers two domains across six natural languages but does not report results for code generation, structured data output (JSON, XML), or highly formulaic text where token entropy approaches zero. LUNA’s POS-aware depth strategy narrows the gap for natural language by going deeper on low-entropy POS contexts (determiners, prepositions) and shallower on high-entropy ones (adjectives, verbs). Code and structured formats lack the part-of-speech variability the scheme relies on, so the POS-adaptive knob has nothing to adjust.

What would an adversary actually need to do to strip a LUNA watermark?

A single paraphrase pass through a second LLM would likely destroy the token-level tournament signal, since the detector reconstructs the watermark from the specific token sequence. Round-trip translation works the same way. This is not unique to LUNA; no published distortion-free scheme survives either attack. The practical implication is that LUNA’s provenance guarantees are strongest against accidental or low-effort modification (copy-paste, truncation, light editing) and weakest against intentional rewriting by adversaries who know watermarking is in use and are willing to sacrifice exact wording to defeat detection.