Can a language model conditioned on demographic traits stand in for a human survey respondent? Two 2026 papers published a month apart arrive at opposite conclusions. A WWW 2026 study evaluating two open-weight chat models across 70,000-plus World Values Survey instances found that standard persona prompting often degrades alignment with real human responses. A new arXiv paper by Ruoxi Su and colleagues, submitted May 28, 2026, reports that adaptive interviewing of LLMs can improve decision alignment, but only selectively, and only when the model grounds its answers in interview-derived evidence rather than generic persona stereotypes.
The counter-evidence: persona prompting is not neutral
The WWW 2026 paper tested a straightforward premise: assign a demographic persona to an LLM, ask it survey questions, and see whether its simulated answers match the distribution of real respondents sharing those demographics. The results were not kind to the idea. Across more than 70,000 respondent-item instances drawn from the World Values Survey, persona prompting did not produce a clear aggregate improvement over baseline prompting without demographic conditioning, and on many items it significantly degraded performance.
The paper introduced a concept it calls “subgroup fidelity” to evaluate how well simulated responses preserve demographic group patterns. The finding: demographic conditioning can redistribute error in ways that are actively misleading. Most survey items showed minimal change, but a small subset of questions, and underrepresented subgroups within the data, experienced disproportionate distortions. A team running persona-conditioned simulations could look at aggregate accuracy, see modest numbers, and conclude the method is roughly workable, while the specific populations they most need to represent are the ones being distorted the worst.
The adaptive interview: selective grounding, not uniform improvement
Ruoxi Su et al. take a different approach. Rather than assigning a static demographic persona and asking the model to answer as that persona, they run a three-stage adaptive interview:
- Core questions establish baseline demographic and attitudinal information, similar to what any survey screener collects.
- Dynamic follow-ups probe responses from the core stage, where the model asks the simulated participant to explain or elaborate on prior answers, generating user-specific justification traces.
- Personality summary synthesizes the interview into a profile the model can reference when simulating decisions.
The full interview transcript is then used to condition the LLM when it simulates the participant’s decisions in moral dilemma scenarios.
The results are modest but directionally interesting. According to Su et al., follow-up-derived evidence appears in roughly 40% of full-interview traces. Su et al. report that when predictions are grounded in that follow-up evidence specifically, accuracy reaches 45.5%, compared with 39.3% for predictions grounded only in core-question evidence. The paper frames this not as a blanket accuracy boost but as a “selective grounding mechanism”: richer persona context alone does not improve decisions unless the model actually cites user-specific evidence from the interview when making its choice.
This is a useful distinction that prior work in the area has blurred. The Tseng et al. taxonomy from EMNLP 2024 Findings separates the field into LLM Role-Playing (personas assigned to models) and LLM Personalization (models adapting to user profiles). Su et al. land somewhere between the two: the model is role-playing a specific person, but the role is constructed from an adaptive evidence-gathering process rather than a static demographic label. The accuracy improvement comes from the evidence, not the label.
The subgroup fidelity problem
Both papers, read together, identify the same structural risk from different angles. The WWW 2026 study shows that static persona conditioning distorts minority subgroups disproportionately. Su et al. show that even adaptive conditioning only works when the model grounds decisions in specific evidence; without that grounding, the richer persona context does not help.
The practical implication is that anyone using persona-conditioned LLMs as synthetic survey respondents needs to audit not just aggregate accuracy but item-level and subgroup-level fidelity. A model that matches the overall distribution of responses on a 50-item survey but systematically misrepresents the answers of, say, rural respondents over age 60 is not a valid substitute for a human panel. The error is not random. It is concentrated in exactly the subgroups that are hardest to recruit for real surveys, which is the stated motivation for using synthetic respondents in the first place.
Limits of the current evidence
Both papers leave significant territory unmapped.
Su et al. tested only moral dilemma scenarios. Moral dilemmas are a common benchmark in the persona simulation literature because they have clear binary or ordinal choice structures, but they are not representative of the survey instruments used in market research, political polling, or public health. Whether the selective grounding mechanism holds for Likert-scale product preference questions, open-ended policy feedback, or multi-factor conjoint designs is unknown.
The WWW 2026 study’s limitations run in the opposite direction: broad survey coverage (the World Values Survey spans dozens of countries and hundreds of items) but constrained model selection. The two open-weight chat models tested are not the ones most commercial teams would reach for if they were building synthetic respondent pipelines today. The degradation patterns those models exhibit may not generalize to frontier proprietary models, or they may be worse. Nobody has published the comparison.
What changes if the method holds
If adaptive interviewing proves robust across broader survey types and frontier models, the economics of polling and market research shift. LLM inference costs for running adaptive interviews are orders of magnitude lower than panel recruitment fees. The cost of iterating on survey design, testing question wording, and running pilot waves approaches zero.
But the validity burden does not disappear. It moves. A team that previously validated its survey by checking sample composition against census benchmarks now has to validate its simulation pipeline: the interview protocol, the grounding mechanism, the model version, and the subgroup fidelity checks. That is a different skill set than survey methodology. It is closer to model evaluation, and the people who are good at one are not always good at the other.
The two 2026 papers, taken together, describe the current state accurately: persona-conditioned LLMs are not drop-in replacements for human respondents, and the most promising improvement mechanism works selectively and has been tested only on narrow scenarios. The research direction is plausible. The deployment is premature.
Frequently Asked Questions
How does the ‘algorithmic fidelity’ framing from earlier silicon-samples research hold up after these two papers?
The Argyle et al. work that coined ‘silicon samples’ argued that persona-conditioned LLMs faithfully reproduce human survey distributions, a claim that shaped most early coverage of synthetic respondents. The WWW 2026 subgroup-fidelity finding directly challenges that optimism, and Su et al.’s sub-50% accuracy even with evidence-grounded adaptive interviewing suggests algorithmic fidelity is conditional on instrument, model, and grounding mechanism, not an inherent property of persona conditioning.
What would a production synthetic-survey pipeline need that neither paper benchmarks?
Both papers tested open-weight models (Llama-2-13B and Qwen3-4B) rather than the frontier proprietary models (GPT-4-class, Claude-class) that commercial teams would actually deploy. A team building a synthetic respondent pipeline today must run its own item-level and subgroup-fidelity audits across its target model, survey instrument, and population, because no published evidence validates persona conditioning on current frontier architectures.
Does adaptive interviewing transfer to survey formats beyond moral dilemmas?
Su et al. tested exclusively on moral-dilemma scenarios with binary or ordinal choice structures. The 6.2 percentage-point accuracy lift (from 39.3% to 45.5%) is narrow enough that differences in instrument format could eliminate it. Whether selective grounding operates for Likert-scale product preferences, conjoint analysis, or open-ended policy feedback remains entirely untested.
Why do underrepresented subgroups bear the heaviest distortion from persona prompting?
Underrepresented groups have fewer exemplars in pretraining data for the model to condition on, so a static demographic label steers the model toward population-level stereotypes rather than group-specific response patterns. The WWW 2026 finding that error concentrates in exactly these groups implies that the populations synthetic surveys are most often justified for reaching are the ones the method distorts worst.