groundy
ethics, policy & safety

A Single RLHF Pass Can't Align an LLM to Every Online Community

The CARE framework benchmarks LLMs against 3,749 real Reddit reactions and finds community prompting does not close the realism gap, breaking the single-RLHF-pass assumption.

6 min · · · 3 sources ↓

One-Size-Fits-All Alignment Misses How Communities Actually Speak

Standard LLM alignment assumes a single optimization target: helpful, harmless, and honest. That target is defined by aggregated human preferences, averaged across annotators and flattened into a reward model. A paper from USC researchers, published as arXiv preprint 2605.27388 in April 2026, tests what happens when you hold that aligned output against real Reddit communities and find it speaks for none of them in particular.

CARE: Scoring LLMs Against Real Reddit Reactions

The paper introduces CARE (Community-Aware Reaction Evaluation), a framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news, according to the full paper.

CARE moves beyond sentiment polarity. It characterizes reactions along two axes: illocutionary tone (what the utterance does, pragmatically) and underlying attitude (the stance behind it). The schema was validated through human-AI collaboration. The distinction matters because two communities can hold the same factual position while expressing it in radically different registers: weary resignation in one, sarcastic compliance in another, communal solidarity in a third.

The Realism Gap: Why Community Prompting Doesn’t Help

The paper’s central finding is what the authors call a persistent “realism gap.” When explicitly prompted with community-specific context (e.g., “you are posting in r/nursing”), frontier LLMs fail to converge toward authentic community tone. The steering signal does not translate into higher simulation fidelity.

CARE quantifies this by measuring the distributional shift in tone and attitude when moving from a community-blind baseline to a community-informed simulation. The framework examines both instance-level fidelity (does a single generated reaction match a real one?) and distributional resemblance (does the overall output distribution resemble the community’s actual discourse patterns?). Neither improves reliably with community prompting.

This is the finding that breaks the deployment assumption. If telling a model which community it should emulate does not make it sound like that community, then per-community system prompts are not a sufficient mitigation. The alignment encoded during RLHF does not contain the sociolinguistic granularity needed to switch registers on demand.

Frontier Models Show Divergent Signatures

The paper’s analysis identifies divergent behavioral signatures among the frontier models tested. The differences were structural: models didn’t just vary in how well they matched any given community, but in the pattern of mismatches across communities, suggesting that each model’s alignment training baked in its own set of sociolinguistic defaults.

This is consistent with what deployment teams already observe anecdotally: models have “personalities” that resist prompt-level override. CARE gives that observation a measurement framework and a corpus to test against.

What Epistemic Stance Transfer Adds

A complementary study complicates the picture further. arXiv 2511.17572, presented at EurIPS 2025, tested epistemic stance transfer by deleting event knowledge from aligned LLMs and measuring whether community-specific behavioral patterns persisted. They did. Even after aggressive fact removal, models maintained stable, community-aligned behavioral signatures.

This can be read two ways. On the optimistic reading, community alignment is robust: it generalizes beyond surface recall of specific posts or events. On the pessimistic reading, community-specific bias is baked into the model’s weights, not layered on top of neutral knowledge. It cannot be turned off by removing access to the source data, because the alignment process has already compressed community tone into the parameter structure.

Per-Community Evaluation Is Now on the Roadmap

The deployment implication is straightforward, if expensive. If alignment is community-relative rather than universal, then a single global RLHF run does not produce a model that is “aligned” in any absolute sense. It produces a model that is aligned to the aggregate preferences of its training annotators. That aggregate may be adequate for general-purpose chat, but it will misfire in any community whose tone, register, or pragmatic norms diverge from that average.

For teams shipping the same model across Reddit, Discord, enterprise Slack, and healthcare forums, this means per-community evaluation pipelines are not a luxury. CARE is the first benchmark designed to expose where those pipelines fail. Its framework gives deployers a way to measure the gap between what their model produces and what a specific community actually sounds like.

The cost is real. Running community-specific evaluation across many communities (let alone thousands) requires inference budget, annotation infrastructure, and a definition of “good enough” that will differ by use case. There is no shortcut in the paper’s findings. Community prompting does not close the realism gap. Prompt-layer mitigation does not work. The alignment has to be measured and, if necessary, re-tuned at the community level.

That is the second-order consequence: not that RLHF is broken, but that one RLHF pass is insufficient for any deployment that crosses community boundaries. The paper does not propose a fix. It proposes a measurement. For deployers, the measurement is already uncomfortable enough.

Frequently Asked Questions

Does CARE generalize beyond Reddit and English-language communities?

The CARE corpus is built exclusively from English-language Reddit posts about COVID-19, drawn from communities spanning four continents and ten thematic domains. Its realism-gap finding has not been tested on Discord threads, enterprise Slack channels, or non-English forums. Deployers working in those contexts should treat CARE’s diagnostic delta as a methodology to replicate with their own corpora rather than a transferable numerical result.

How does CARE differ from standard alignment benchmarks like TruthfulQA or MMLU?

TruthfulQA and MMLU evaluate factual correctness and reasoning competence against a single correct answer. CARE evaluates sociolinguistic fidelity against the distributional reaction patterns of specific communities. A model could score well on TruthfulQA while producing outputs that match no real community’s pragmatic register. CARE’s two-axis schema (illocutionary tone and underlying attitude) was co-validated with human annotators to capture distinctions like weary resignation versus sarcastic compliance, which sentiment polarity alone collapses into one bucket.

What would a per-community evaluation pipeline actually require?

At minimum: a reference corpus of authentic community reactions, annotation infrastructure for the tone-and-attitude schema, and inference budget to generate and score outputs across each target community. The CARE corpus itself used the top-5 upvoted comments from COVID-related Reddit posts with at least 10 comments, filtered to posts linking external news articles. Replicating this for 200 internal Slack channels would require building a comparable reaction corpus and running the diagnostic delta calculation per channel. The paper defines the measurement but does not ship tooling for it.

Could the epistemic stance transfer finding be a feature rather than a bug?

The EurIPS 2025 study found that community-aligned behavioral patterns persist even after aggressive deletion of the event knowledge that originally shaped them. If a healthcare model reliably adopts a measured, cautious register regardless of the specific topic, that stability could be desirable in clinical settings. The failure case is when that baked-in register conflicts with the norms of a community the deployer never intended to align to, and the model cannot be prompted away from it. Robustness for the intended community becomes rigidity for every other one.

sources · 3 cited

  1. Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities primary accessed 2026-05-29
  2. Modeling Community Attitude through Reaction Tone (HTML full text) primary accessed 2026-05-29
  3. Community-Aligned Behavior Under Uncertainty: Evidence of Epistemic Stance Transfer in LLMs primary accessed 2026-05-29