Persona Prompts Change Who an LLM Recommends as an Expert

The framing of a prompt changes the answer before the question does. A preprint submitted May 27 by Lisette Elizabeth Espín Noboa audits 43 large language models across six scientific disciplines and finds that the persona baked into a prompt, its assumed language, location, and professional role, systematically shifts which scholars the model recommends as experts. The shift is not noise. It is structural, and any pipeline that routes an LLM recommendation into reviewer assignment, expert discovery, or candidate shortlisting inherits it whether the designers know it or not.

What the study measured

arXiv:2605.28187 (listed as under-review, 25 pages including 13-page appendix) varies persona prompts along three axes: language, geographic location, and role-and-task framing. The 43 models under test were asked to recommend scholars in six disciplines. Recommendations were then benchmarked against Semantic Scholar profiles on two orthogonal criteria: technical quality (factuality of claims about recommended scholars, coverage of the relevant literature) and social representativeness (diversity of recommended names, gender parity).

This is the third paper in the “Whose Name Comes Up?” series. The first two examined narrower slices of the problem; this installment is the broadest multi-model audit to date of persona-driven recommendation bias in scholarly search.

The location effect is not subtle

The most striking reported finding is geographic. Prompts framed around South Africa produced less factual recommendation lists, while prompts framed around Japan produced highly factual but homogeneous lists skewed toward highly productive scholars. The abstract reports these as directionally opposite failure modes: one sacrifices accuracy, the other sacrifices diversity.

This is editorial synthesis rather than a directly reported result, but the implication is clear enough to state: there is no single geographic persona that optimizes both axes. A team that defaults to an English-language, US-framed prompt because that is the model’s training majority is not being neutral. It is selecting a specific point on a tradeoff curve it may not know exists.

Model choice and prompt design pull in different directions

The study reports that basic technical quality is driven primarily by model choice, factuality and parity by context variables (field, seniority, number of recommendations requested), and diversity by the location dimension of the persona prompt. According to the abstract, no single prompt configuration simultaneously optimizes all measured axes.

This is a useful decomposition for anyone building an LLM-based recommender. It means that picking a better model improves quality but does not fix the diversity problem, and tuning the prompt’s location framing improves diversity but does not fix factuality. The levers are separate. A team that benchmarks only on accuracy and calls the system good has likely missed the representativeness axis entirely.

What this means for production pipelines

Conference reviewer-matching systems, expert-finder tools in enterprise search, and hiring shortlists generated by LLM summarization all share a structural property: they take a prompt the operator wrote (or accepted as a default) and produce a ranked list of names. The Espín Noboa study shows that list is persona-dependent in ways that are invisible unless someone explicitly audits for them.

The risk is not theoretical. A 2025 study on LLM-generated peer reviews found that GPT-5-mini inflates ratings for weaker papers and that field-specific prompt instructions can manipulate specific aspects of generated reviews. If reviewer assignments are also persona-biased, the distortion compounds: the wrong reviewers are assigned and the review calibration is systematically off.

Persona prompts also hurt factual recall

A separate line of evidence makes the recommendation-bias finding harder to dismiss as a niche concern. A USC study covered by The Register in March 2026 found that expert persona prompts reduce factual accuracy on MMLU from 71.6% to 68.0%. The proposed mechanism: persona prefixes activate instruction-following behavior at the expense of factual recall. The model spends capacity performing the persona rather than retrieving the answer.

This maps directly onto the Espín Noboa finding that South Africa-framed prompts produce less factual recommendations. If the persona prefix shifts the model’s operating mode away from recall and toward role-conformance, the geographic signal in the prompt is not just biasing which facts the model considers relevant; it may be degrading the model’s ability to retrieve facts at all. That conjecture is editorial, not reported. But the two studies converge on the same structural observation: the prompt is not a neutral query wrapper. It is an active input that reshapes the output.

What to audit before deployment

For teams building or procuring LLM-based expert recommenders, the audit surface is straightforward, if uncomfortable:

Run the same query under multiple persona framings. If the top-10 recommended scholars change substantially when you vary the assumed location or language, the system is persona-sensitive and the default prompt is not neutral.
Benchmark on both quality and representativeness. The Espín Noboa study’s key structural finding is that these axes respond to different inputs. Optimizing for one in isolation ignores the other.
Treat the prompt template as a configuration parameter with equity consequences. It is not a UX convenience layer. It is a lever that selects among differently biased outputs.
Do not assume a “no persona” baseline is neutral. The absence of an explicit persona is itself a framing, typically one that defaults to the model’s training distribution, which skews Anglophone and US-centric.

None of these steps require the full PDF. The abstract-level findings are sufficient to justify the audit. The granular numbers, when the full paper is available, will determine whether the effect sizes are 5% or 50%, and which disciplines and models are most affected. Until then, the directional result stands: persona prompts change who gets recommended, and the change is not random.

Frequently Asked Questions

Which findings from the paper are still unverified?

The abstract reports directional effects (South Africa prompts less factual, Japan prompts less diverse), but the 13-page appendix contains discipline-by-model breakouts with exact effect sizes that have not completed peer review. Whether the location pattern holds uniformly across all six disciplines or is concentrated in specific fields requires the full PDF’s tables to confirm, and those numbers are preliminary.

Does the number of recommendations requested change the persona bias?

The study treated k (the number of names requested) as a context variable alongside field and seniority, and found it influences factuality and gender parity outcomes. Increasing k may dilute the persona effect by spreading recommendations across a wider pool, but the bias persists as a structural property of the prompt framing rather than a function of list size alone.

Does the persona effect extend to code review and technical hiring?

The Register covered the USC PRISM study from a coding-accuracy angle in March 2026, noting that expert persona prefixes degrade technical benchmark scores. The Espín Noboa audit tested academic expert discovery only. Both studies converge on the same mechanism: persona prefixes shift model capacity from factual recall to role performance. Teams using LLMs for code-reviewer matching should expect similar distortions even without direct evidence from this particular 43-model benchmark.

What is the fastest way to test an existing recommender for persona sensitivity?

Run the same expert query under three location framings (US, Japan, South Africa) and compare whether the top-5 recommended names overlap. The study designed location as an explicit persona axis and found directionally opposite failure modes between Japan and South Africa prompts, so this single-dimension comparison is sufficient to detect sensitivity before investing in a full multi-axis audit.

Do the findings generalize to non-academic expert discovery?

The 43 models were tested exclusively on six scientific disciplines, benchmarked against Semantic Scholar profiles. Corporate expert-finder tools, legal citation systems, and medical literature search all rank people or papers against different ground-truth corpora with different demographic distributions. The persona-prefix mechanism is model-level and likely transferable, but the magnitude and direction of the location effect could differ when the underlying corpus is not academic publication data.