Researchers at the University of Murcia just published the first large-scale, matched comparison between LLM-generated and real audience reactions to online news. They paired 5,631 Spanish news articles with 58,555 actual reader responses from the Hatemedia dataset, then had five LLMs generate synthetic replies to the same articles. The finding: individually plausible comments still fail to reproduce the statistical shape of real public discourse. The gap is measurable, it narrows with fine-tuning, and it is not zero.
What the study measured
The paper, arXiv:2605.28598, submitted May 27, asks a specific question: if you show an LLM a news article and ask it to write a reader reaction, do the resulting comments look like what real humans actually wrote?
The study tested five LLMs under a shared experimental setting, including Qwen3 and Mistral7B. The researchers compared synthetic and real reactions across three dimensions: hate speech prevalence, sentiment distribution, and semantic alignment. Both off-the-shelf and fine-tuned configurations were evaluated. The matched design, same articles, same language, same corpus of real replies, is what makes the comparison useful. Most prior work on synthetic social content has relied on proxy measures or small-sample perceptual studies.
Three failure modes for off-the-shelf models
Without fine-tuning, all five models failed in consistent ways, according to the full paper.
First, they strongly underproduced hate speech. Real comment sections on Spanish news contain a measurable baseline of hateful content. The models generated far less of it, which means any synthetic crowd deployed against a real platform would read as too civil, a detectable distributional fingerprint.
Second, each model introduced its own sentiment bias. Some skewed positive, some skewed flat. None matched the sentiment distribution of the real reply corpus.
Third, the models remained distributionally distant on semantic alignment, meaning the topics, framing, and vocabulary of synthetic replies clustered differently from human ones. A synthetic comment might read as individually coherent, but a batch of them does not look like a real audience.
The paper’s summary is direct: “plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse,” as noted in a secondary analysis.
Fine-tuning narrows the gap, unevenly
Fine-tuning on same-domain data improved fidelity, but the improvements were model-specific and came with tradeoffs.
Qwen3 produced the most balanced approximation across all three dimensions. If you needed one model to generate a synthetic crowd that looked statistically reasonable on hate rates, sentiment spread, and semantic content, Qwen3 was the pick.
Mistral7B told a different story. It achieved the strongest sentiment and semantic alignment of any model tested, but overshot hate prevalence, generating more hateful content than real users actually produced. That is a notable failure mode: a model tuned to match discourse patterns ends up amplifying the worst one.
This aligns with prior findings. Nudo et al. documented a phenomenon they call “generation exaggeration”: a systematic amplification of salient traits beyond empirical baselines, where richer contextualization improves internal consistency but also amplifies polarization and harmful language. The Murcia results replicate this pattern in a different domain: locally plausible reactions can still distort the empirical structure of public discourse.
The astroturfing economics shift
The paper frames its results as a validation problem for social simulation. The practical implication is narrower and more urgent.
If fine-tuned models can produce individually plausible reader reactions, the cost of manufacturing apparent consensus on a news article drops to the cost of running inference. You do not need a troll farm. You need a script and a consumer GPU. A synthetic crowd that passes casual inspection does not need to survive statistical audit. It needs to survive a scrolling reader who reads two or three comments and forms an impression of what “people are saying.”
The CrowdLLM framework demonstrates the logical endpoint: synthetic digital populations combining LLM reasoning with generative models for visual identity, voice synthesis, and behavioral diversity.
Current bot-detection tooling was benchmarked against cruder adversaries: coordinated accounts posting identical text, behaviorally regular activity patterns, predictable posting schedules. Those signatures are trivially avoidable with modern agents. The detection problem is now statistical: does a population of accounts look like a real audience on aggregate measures? The Murcia paper gives the first quantitative baseline for answering that question, and the answer is that off-the-shelf models fail, but fine-tuned models are getting closer.
What platform teams should benchmark against now
The paper’s three-dimensional evaluation framework (hate speech, sentiment, semantic alignment) is a practical starting point for trust-and-safety teams. Any detection system should be tested against:
- Models that underproduce hate speech (the default failure mode, but one that produces a detectable fingerprint of its own)
- Models that overshoot hate speech (the Mistral7B failure mode, which is more dangerous because it amplifies the very content moderation teams are trying to catch)
- Models tuned on domain-specific corpora (the hardest adversary, and the one most likely to reflect what a motivated attacker would deploy)
The SocialLLM workshop at ICWSM 2026, held May 26, signaled that the research community is converging on this problem. The Murcia paper, submitted the day after the workshop, contributes the matched empirical benchmark those discussions lacked.
The durability question: the specific model rankings (Qwen3 most balanced, Mistral7B overshooting hate) will shift as new models ship. The methodological frame is what matters. Comparing synthetic populations against real ones on hate, sentiment, and semantic dimensions, and finding that individual plausibility does not imply distributional fidelity, is the result that will hold. Platform detection tooling built on any weaker assumption is betting against the data.
Frequently Asked Questions
Does the finding generalize to English platforms like Reddit or YouTube?
Not directly. All five models (Mistral7B, Mistral24B, Llama8B, Qwen3, GPT-OSS) were tested exclusively on Spanish-language news from the Hatemedia corpus. The three-dimensional evaluation framework transfers across languages, but the distributional baselines for hate prevalence, sentiment spread, and semantic clustering would need recalibration against an English-language matched dataset that does not yet exist in comparable form.
What did the SocialLLM workshop reveal that the Murcia paper doesn’t cover?
Zhijing Jin’s keynote reported that frontier models chose socially beneficial actions in only 62% of high-stakes multi-agent scenarios, pointing to cooperation failures that complement the Murcia team’s distributional fidelity problem. Maarten Sap framed the simulation-reality gap as a structural challenge for any system treating LLM output as a proxy for human populations. These findings address agent behavior in interactive settings, whereas the Murcia study isolates single-turn reaction generation.
What makes Mistral7B’s hate overshoot dangerous for detection teams?
Most detection systems flag hate speech as an anomaly to catch. Mistral7B’s fine-tuned configuration produces more hateful content than real users, which means a synthetic crowd could trigger moderation alerts while still passing distributional checks on sentiment and semantics. A detection pipeline calibrated to flag the ‘too civil’ fingerprint of off-the-shelf models would miss an adversary whose model overshoots in the opposite direction.
How does generative exaggeration affect models beyond Mistral7B?
Nudo et al. found that richer contextualization improves profile consistency but amplifies ideological extremity and toxicity across models, not just Mistral7B. The Murcia results replicate this pattern in a different domain, suggesting exaggeration is a structural property of fine-tuned generative models rather than a quirk of one architecture. Platform teams benchmarking against a single fine-tuned model should test at least one additional architecture to avoid blind spots.
Could a motivated attacker fine-tune on public platform data without labeled corpus access?
The Murcia study fine-tuned on same-domain data with matched article-reaction pairs and hate-speech labels from Hatemedia, representing best-case access. An attacker scraping public Reddit or X threads would get noisier, unlabeled data with no guaranteed article-comment pairing. The performance gap between fine-tuned and off-the-shelf models would likely widen under those conditions, though it would not disappear.