groundy
models & research

Do LLM Judges Favor Their Own Output? A Sanity Check on Self-Preference

An LLM judge's self-preference claim holds only if the study fixed generation quality. Without that control, the bias number conflates real preference with a quality gap.

8 min · · · 3 sources ↓

Any measurement of whether an LLM judge prefers its own output turns on one question: did the study hold generation quality fixed? Without that control, a self-preference signal is the sum of two effects the experimenter never separated and reported as one: genuine self-preference, and the legitimate quality gap between what each model produced.

What would a self-preference signal actually mean?

In its strong form, self-preference would mean that when an LLM ranks two candidate texts it gives a bonus to the one that resembles its own output, even when an external quality signal would rank that text lower. In its weak form it collapses to “LLMs prefer their own writing,” which is close to unfalsifiable and therefore close to useless. The strong form is the one worth arguing about, and it rests on a claim about mechanism.

Three mechanisms could plausibly produce a self-preference signal, and they are not the same thing. The first is stylistic affinity: the judge’s scoring rewards familiar phrasing, sentence rhythm, formatting tics, or hedge density, because those patterns sit inside the distribution it was trained and aligned on. The second is self-recognition, the model’s ability to identify text it generated. This is not metaphor. Work on self-generated-text recognition finetuning, submitted 4 June 2026, treats that ability as a discrete, tunable capability, demonstrated across GPT-4.1, Qwen2.5-32B-Instruct, and Seed-OSS-36B-Instruct. A judge that can discriminate its own style from a rival’s has the substrate to reward it, whether or not it actually does.

The third mechanism is the boring one and the easiest to forget: the judge’s model might just write better answers. That is not bias. Conflating it with bias is the central error this kind of measurement keeps tripping over, and it is where the generation-quality confound enters.

Why does self-preference matter for RLHF, leaderboards, and CI evals?

Self-preference matters because LLM judges, where they are deployed, sit on surfaces where a stylistic lean compounds. Three are worth tracing.

The first is preference learning. In RLHF, a reward model is trained to predict human preferences, and the LLM is optimized toward that reward via reinforcement learning, per the standard RLHF description. Where an LLM judge labels the preference pairs and leans toward a particular style, the fine-tune pulls the model toward that style regardless of what humans would actually prefer.

The second is benchmarking. Where a benchmark delegates scoring to an LLM judge rather than human raters or exact-match checks, a self-preference bias in the judge propagates straight into the published numbers, and a single judge’s style preferences can move a competitor’s standing by more than the gap between adjacent ranks.

The third is internal. Where a team runs LLM-as-judge inside its CI and eval suites to grade regression samples on every change, a judge that prefers the style of the model being shipped will systematically under-flag regressions that happen to match that style, which is precisely the failure mode those evals are built to catch.

These pipelines are also fragile in ways that compound the bias risk. DynamicPO, flagged as a DASFAA 2026 best paper and posted as v3 on 23 June 2026, documents a “preference optimization collapse” in which adding more negative samples degrades recommendation quality even as training loss keeps falling, attributing the collapse to gradient suppression by easily discriminable negatives. The lesson generalizes: preference pipelines are sensitive to the shape of their negative samples, and an LLM judge is exactly the component that decides which samples count as negative. A biased judge feeds a fragile objective.

Is the self-preference effect genuine, or an artifact of unfixed quality?

Any self-preference number should live or die on one question: did the study hold generation quality fixed?

If an experiment compares a judge’s ratings of its own generations against its ratings of a rival’s, and finds the judge prefers its own, the result is ambiguous unless the two fields are known to be equal in quality. Without that control, the measured effect is the sum of two terms the study never separated: genuine self-preference, and the legitimate quality gap between what model A and model B actually produced.

The practical consequence is that a single self-preference score is the wrong unit. A model that scores its own output higher might be doing so because it genuinely writes better on the test set, and relabeling that signal as bias would mean discarding real quality information. The way to tell the two apart is to pin quality independently, then report the self-preference residual conditional on quality, ideally bucketed. A large residual among quality-matched pairs reads very differently from a large residual driven by the judge correctly identifying its own stronger answers.

The framing has to be settled before the number does. A reported self-preference rate is actionable only if it survives a quality-matched comparison. Absent the control, it measures two effects at once and reports one.

What must a credible self-preference study control for?

A credible study dismantles the confound through experimental design rather than headline, which means six specific controls separate a real bias from a measurement artifact. Each is worth demanding from any such study.

First, quality must be pinned independently of the judge. The cleanest version is a human gold label on each comparison; a weaker version is an orthogonal automatic quality signal. The constraint is that the quality measure cannot itself be the judge under test, or you are back to a single term.

Second, generation provenance should be symmetrized. Generate answers with model A and model B, then judge each pair with A, with B, and with neutral third parties. A self-preference effect that holds across judge identity and direction is a different finding from one that appears only when A judges A.

Third, length and format have to be normalized, because a judge’s “own style” often correlates with verbosity or markup density, and both are known to move LLM judge scores on their own.

Fourth, provenance should be blinded. If the judge can infer which candidate is the model’s own from stylistic tells rather than from quality, the experiment is measuring recognition, the tunable capability the recognition-finetuning work documents, not necessarily preference.

Fifth, the residual should be reported by quality bucket, not aggregated into one number. The aggregate is exactly what lets a real, quality-conditional bias hide inside a quality-driven effect.

Sixth, the study should pre-declare what would count as no bias, so that a null result is publishable. The current incentive structure rewards finding bias, and that is itself a distortion worth naming.

When are cross-model juries worth the cost?

A cross-model jury is worth its latency and token cost only after you have measured the quality confound on your own data and found the self-preference residual too large to ignore. Before that measurement, it is a remedy for a problem that has not been diagnosed.

The jury itself is simple in principle: instead of one judge, run three or four from different families and aggregate. It works, in the limited sense that averaging across judges with different style preferences dilutes any single model’s self-preference. It also roughly multiplies latency and token spend by the jury size, and it opens a new question of how to weight judges that disagree.

That ordering matters because the jury is a blunt instrument. It smooths self-preference, but it also smooths away the cases where one judge is correct and the majority is wrong, and it does nothing about the deeper fragility of preference pipelines that fail the way DynamicPO’s gradient-suppression collapse describes. A jury hedges one specific distortion; it is not a substitute for knowing which distortion you actually have.

The cleaner long-term move is to stop treating judge bias as a scalar. Score quality by an independent signal, report preference residuals conditional on that quality, and reserve the jury for the bucket where the residual refuses to disappear. That is more work than running four judges and averaging. It is also the only version that survives a skeptical read.

Frequently Asked Questions

Where does blinding the judge’s own output actually fail?

Blinding fails wherever the judge retains self-recognition, and the recognition-finetuning work shows that capability can be tuned up, not just detected. The same work reports that corrupting self-recognition exacerbates misalignment and that removing the identity-bearing system prompt dampens the effect, so a judge whose recognition has been sharpened for identity stability will defeat provenance masking through phrasing and formatting tells.

Does DynamicPO’s preference-collapse fix also cure self-preference?

No, because they sit on opposite sides of the pipeline. DynamicPO’s remedies, Dynamic Boundary Negative Selection and Dual-Margin Dynamic beta Adjustment, reweight which negatives dominate the loss and fix the gradient-suppression collapse on the data side. A judge that favors its own style is a scoring-side distortion, and nothing in DynamicPO’s negative-selection reweighting touches how that judge scores a pair.

Could safety work on emergent misalignment make self-preference worse?

It could. The recognition-finetuning work reframes emergent misalignment as destabilization of a model’s aligned character rather than adoption of a coherent misaligned persona, and finds that misalignment finetuning injects diversity into the model’s identity self-reports. If safety teams tune self-recognition higher to re-stabilize that character, the same substrate lets a judge pick out its own generations more reliably, widening the residual the quality control has to absorb.

Does the quality-control argument reach LLM judges used for safety classification?

It reaches them, but the binding constraint moves. The first control requires pinning quality independently of the judge, and for safety judgments that gold label is itself contested in a way a preference label is not. Teams that accept a safety judge without a quality-matched check are usually doing so not because the confound does not apply but because the independent signal is too expensive to produce for every comparison.

What is the cheapest check to run before paying for a jury?

Symmetrize provenance on a small slice and read the direction. Generate answers with model A and model B, then judge each pair with A, with B, and with one neutral third party. If A’s preference for its own output vanishes under the neutral judge, the original signal was quality-driven and a jury is unnecessary. If it persists across judge identity, that is the residual worth spending jury cost on.

sources · 3 cited

  1. Large language model en.m.wikipedia.org community accessed 2026-06-25