Ranking LLMs Side by Side Makes Their Dialect Bias Worse

A study posted to arXiv in May 2026 and slated for presentation at ACM FAccT 2026 demonstrates that the side-by-side evaluation format used in chatbot arenas and RLHF pipelines systematically amplifies language models’ bias against African American Vernacular English (AAVE). The bias gets worse, not better, when the model is explicitly told which dialect it is reading.

What the study found

The paper, “Side-by-side Comparison Amplifies Dialect Bias in Language Models,” tests a specific mechanism: whether presenting Standard American English (SAE) and AAVE text pairs side-by-side produces more biased judgments than evaluating each output in isolation. It does, and by a notable margin. Language models exhibit what the authors call “covert dialect bias,” meaning they assign negative stereotypical traits to AAVE speakers even without an explicit dialect label. When two outputs are presented together for comparison, this bias amplifies significantly.

The evaluation protocol itself is the variable. Same model, same text, different presentation format, different bias outcome.

Where pairwise comparison is the default

The side-by-side format is not a niche methodology. It is the default evaluation structure across the industry. Chatbot arenas rank models by pitting two outputs against each other and collecting human preference votes, typically aggregated via an Elo rating system. RLHF pipelines train reward models on pairwise human preference data, which then guide model behavior during reinforcement learning. Hiring algorithms, content moderation systems, and automated essay scorers frequently rank candidates by comparing pairs.

If the format itself amplifies bias, the fix is not a model-level patch. It is a protocol-level redesign.

Dialect labels make it worse

One of the paper’s more counterintuitive findings: explicitly labeling the dialect (telling the model “this is AAVE” versus “this is SAE”) does not reduce bias. It increases it. Given the resources major labs have poured into safety alignment, this is a surprising result. The assumption behind label-based debiasing is that awareness enables correction. In side-by-side settings, awareness appears to enable stereotyping.

Overt dialect bias persists even after safety-aligned finetuning, suggesting the problem is not addressed by current alignment techniques and requires evaluation-protocol changes rather than (or in addition to) model-level interventions.

Counterfactual fairness finetuning: helps alone, fails in pairs

The authors tested counterfactual fairness finetuning, a technique where models are trained to produce consistent judgments regardless of dialect. In isolation, it works: average dialect-bias disparities for stereotypical traits decrease. But when the same finetuned model is placed in a side-by-side comparison setting, the improvements do not consistently hold.

This is a specific instance of a general problem. A mitigation that passes a pointwise benchmark can fail under a contrastive evaluation, and most fairness audits today do not test both.

Independent evidence that pairwise evaluation is structurally fragile

The FAccT paper is not the only recent work identifying structural weaknesses in pairwise evaluation. Tripathi et al. (COLM 2025) found that pairwise LLM-as-a-judge protocols flip preferences in approximately 35% of cases when distractor features are embedded in the text, compared to only 9% for absolute (pointwise) scoring. The distractors were not dialect features; they were arbitrary injected signals. The finding generalizes the problem: the comparison format is vulnerable to confounding signals, and dialect is one such signal.

Two independent papers, different methods, converging conclusion. Pairwise evaluation is sensitive to features that should be irrelevant to the judgment.

What practitioners should do

The immediate implication is that any team using pairwise evaluation for arenas, RLHF data collection, or ranking systems in hiring, content moderation, and admissions should audit whether their protocol encodes a standard-language prior. Specific actions:

Run both protocols. Evaluate the same outputs using pointwise absolute scoring and pairwise comparison. If bias metrics diverge, the pairwise format is doing active harm.
Do not assume labels help. Dialect-aware prompting can backfire. Test it before deploying it.
Check the composition of preference datasets. RLHF reward models trained on data collected primarily from Standard English speakers will reproduce the evaluator pool’s dialect preferences. If the pool is not representative, neither is the reward model.

Frequently Asked Questions

Does the amplification effect apply to longer-form text, like the multi-turn conversations most arenas actually evaluate?

The study tested only tweet-length SAE and AAVE pairs. Most arena evaluations and RLHF data collection use multi-turn dialogue or paragraph-length responses, where dialect markers may be more or less salient depending on context. Whether the contrastive amplification persists, weakens, or intensifies in longer text is an open question. Teams evaluating models on longer formats should run their own side-by-side versus pointwise comparison before assuming the finding transfers directly.

What does switching from pairwise to pointwise scoring cost an arena operator?

Pairwise comparison feeds directly into Elo rating systems, which are designed for head-to-head matches. Pointwise absolute scoring (rate each output independently on a fixed scale) produces ordinal quality scores instead of relative rankings, so arena operators would need to redesign their leaderboard methodology from scratch. The Tripathi et al. finding that only 9 percent of preferences flipped under pointwise scoring (compared to 35 percent pairwise) suggests robustness gains, but pointwise judgments are harder to crowdsource consistently because absolute quality ratings demand more calibration than relative preference.

Does this mean earlier LLM dialect-bias studies understated the problem?

Prior work on LLM dialect bias typically evaluated outputs in isolation using pointwise benchmarks, which the FAccT 2026 paper shows produces less biased results than side-by-side comparison. If those benchmarks were the basis for claiming a model was sufficiently fair on dialect, the actual bias in deployed pairwise contexts (arenas, hiring rankers, moderation systems) may be worse than published benchmarks indicated. The paper isolates the evaluation protocol as an independent variable, something earlier studies did not manipulate.

Would recruiting a more dialect-diverse annotator pool for RLHF data collection solve this?

A representative annotator pool addresses overt bias in the preference signal, but it does not fix the structural amplification that occurs when two outputs are compared directly. The model still receives a pair containing dialect features as a confounding signal and can use those features as proxies for quality. The paper’s finding that counterfactual fairness finetuning (an explicit model-level intervention) fails in contrastive settings suggests that data-level fixes alone are insufficient. Both the model and the protocol need auditing.