Do LLM Personality Tests Measure Anything? A New Paper Says No

A preprint posted to arXiv on June 18, 2026 (arXiv:2606.20205, Wulff et al.) argues that the psychological profiles researchers extract from large language models using standard personality batteries are largely a measurement artifact, not a property of the model. The work puts a formal psychometric frame on a suspicion many practitioners already hold: the score tells you more about the instrument than about the model that produced it.

Are LLM personality profiles real, or an artifact of the test?

The paper’s central claim is that apparent psychological profiles of LLMs are artifacts of the instruments used to measure them, rather than stable properties of the models themselves (arXiv:2606.20205). The word “apparent” is doing the work: the authors are not asserting that models have no regularities, only that what the batteries report is not a reliable read on them.

Wulff and colleagues administered a battery of personality and risk-preference instruments, spanning both self-report questionnaires and behavioral tasks, to 56 instruction-tuned LLMs, alongside large human reference samples (arXiv:2606.20205). The mix matters. A common objection to LLM personality work is that self-report is the whole problem: ask a model whether it is agreeable and it will oblige. The authors include behavioral tasks in the battery alongside self-reports, so the result is not resting on self-report alone.

The authors note that psychological instruments of this kind are increasingly used to assign LLMs stable profiles that they say affect a model’s assessed usability, its safety assessment, and its use as a proxy for human participants in research (arXiv:2606.20205). The contribution of the paper is not to run another battery. It is to decompose what those batteries are actually reading, and then quantify how much of the signal is the targeted trait versus the response scale.

The stakes follow from where the measurement sits. Benchmark evaluations for LLMs broadly attempt to measure reasoning, factual accuracy, alignment, and safety (Wikipedia), and persona and psychometric scores plug into that same evaluation layer. If the score is an instrument artifact, then any safety or product decision justified by that score inherits the artifact.

Where do most of the between-model differences come from?

A variance decomposition in the paper attributes 81 to 90 percent of the between-model variation to a directional response bias: a tendency to respond toward one end of the scale regardless of what an item is asking (arXiv:2606.20205).

The contrast with humans is the part that lands. In the human reference samples, the same decomposition attributes only 9 to 16 percent of variation to that bias (arXiv:2606.20205). The quantity that personality batteries are engineered to suppress in people is the dominant signal in models. When a battery tries to separate one instruction-tuned model from another on, say, agreeableness, most of the separation it reports is not about agreeableness at all. It is about where each model defaults along the response scale.

That is the practical bite. A team ranking two candidate models on a risk-preference or personality dimension is, under the paper’s framing, mostly ranking them on response bias. The targeted trait is present somewhere in the measurement, but it is the minority signal, and the decomposition says so quantitatively rather than as a hunch. The headline number is also exactly where the caveat lives: 81 to 90 percent is one decomposition, in one study, over 56 models (arXiv:2606.20205). It is a strong result, not a settled constant.

Does scaling capability make the bias go away?

No. The paper reports that directional response bias declines as model capability increases, but greater capability does not eliminate it (arXiv:2606.20205).

This is worth pulling out because the standing heuristic in the field is that measurement pathologies attenuate as models get more capable. On that logic, the cure for a noisy persona score is to wait for the next generation. The decomposition does not offer that exit. Bias weakens with capability across the models sampled, but it does not bottom out at zero, which means a frontier model still carries much of the same artifact a weaker one does. Capability helps on the margin; it does not close the gap.

What is “response orthogonality,” and why do most borrowed instruments fail it?

The authors coin “response orthogonality” for the proportion of items in an instrument for which the targeted trait and the model’s directional bias point in opposite directions, and they find that this property almost entirely predicts an instrument’s apparent reliability on LLMs (arXiv:2606.20205).

The intuition is mechanical. If a model leans hard toward the high end of the scale, then any item where the “agreeable” or “risk-seeking” answer is also the high end will register as that trait regardless of the model’s actual disposition. An instrument resists that collapse only when enough of its items force the trait and the bias into tension, so that the bias cannot simply masquerade as the trait. On the paper’s account, personality batteries designed and normed on human respondents rarely clear that bar once they are pointed at models (arXiv:2606.20205).

The sharper consequence is that the profile a model appears to have shifts with the items you choose, and a profile can be manufactured through item selection (arXiv:2606.20205). The same model can be made to look different by swapping the item set. That converts a sentence like “the model scored high on openness” from a finding about the model into a statement about which items happened to be presented.

This is where the result becomes a methodological warning rather than a single number. The authors close by calling for dedicated assessments centered on response orthogonality rather than instruments imported from human psychology (arXiv:2606.20205). The deeper point is that the wrongness here is a property of the instrument, not a fixable bug in any one model, which is why waiting for a better model does not address it.

What does this break for safety, persona, and red-team scoring?

If your team ships, rejects, or ranks models by psychometric scores, the paper says to treat those numbers as instrument artifacts rather than model properties (arXiv:2606.20205).

The downstream surfaces the paper implicates are usability, safety assessment, and the use of LLMs as proxies for human research participants (arXiv:2606.20205). A red-team report that scores a model on agreeableness or risk preference and then treats the score as evidence of the model’s disposition is, on this account, reading the instrument. An alignment team citing a psychometric profile to argue a model is dispositionally cautious has a problem of the same shape: the evidence may be telling you about the scale, not the model.

The operational read for a team working today is narrow but consequential. Before you act on a persona or personality score, the question to ask is which instrument produced it and whether that instrument has been shown to resist directional response bias on models. The authors’ proposed remedy is not better prompting; it is dedicated assessments designed around response orthogonality, plus default skepticism toward instruments borrowed from human psychology (arXiv:2606.20205).

There is a procurement angle as well. Buyer’s guides and leaderboards that present model “capabilities” at face value, as properties of the model, sit awkwardly with the paper’s logic. If a dimension is measured through a borrowed psychometric battery, the ranking on that dimension may reflect the battery more than the model. The audit-the-evaluation question becomes more useful than the rank-the-model question: not which model scores higher, but what the score is actually measuring.

What should you check before citing the paper?

The result is one variance decomposition in one arXiv preprint, not yet peer-reviewed, and it proves something narrower than the headline implies (arXiv:2606.20205).

First, the status. The paper was submitted June 18, 2026 (DOI 10.48550/arXiv.2606.20205), so it has not passed peer review, and the 81 to 90 percent figure is the output of one decomposition over 56 instruction-tuned models (arXiv:2606.20205). Treat it as a strong, specific claim from a single study, not a settled constant. Replication across a different instrument set and a different model cohort is what would move it from striking to established.

Second, the scope of the proof. The paper does not show that LLMs lack stable behavioral regularities. It shows that human-derived personality instruments do not validly measure whatever regularities exist (arXiv:2606.20205). The claim is about the instrument. Conflating “the test is invalid” with “the model has no personality” overreads it, and the authors do not make the stronger claim.

Third, the battery. The decomposition covers the standard set of instruments the authors chose, self-reports and behavioral tasks included, and should not be extrapolated to every construct an evaluator might want to measure or to every model that ships after the sample (arXiv:2606.20205). It also does not resolve whether behavioral tasks, as a class, track model behavior in ways self-reports do not.

Frequently Asked Questions

Does the artifact finding extend to base models, or only instruction-tuned ones?

The sample is 56 instruction-tuned models, so the decomposition is scoped to models that went through post-training rather than raw pretrained checkpoints. The paper does not break out base models separately, which leaves open whether the directional bias is a post-training artifact or a deeper property of autoregressive generation.

How is directional response bias different from prompt-format sensitivity?

Prompt-format sensitivity means the same item scores differently when rephrased or reordered; directional response bias means a model gravitates toward one pole of the scale whatever the item asks. The first is a framing effect, the second is a scale-anchoring effect, and conflating them would point the fix at prompting when the paper’s remedy is a new instrument.

What is the cheapest check a team can run on an existing persona battery?

Reverse-key half the items so the trait-affirming answer sits at the low end of the scale for some and the high end for others. If the model’s profile holds across that flip, the battery has response orthogonality; if the profile inverts or dissolves, the score was reading the scale, not the trait.

What would it take to upgrade the 81 to 90 percent figure from striking to established?

Replication on a different instrument set and a different model cohort, plus survival through peer review. The current number is one decomposition over one battery on 56 models, and the variance decomposition method itself carries assumptions about how trait variance and bias variance separate.

Could item selection be weaponized to manufacture a favorable profile?

Yes, and that is the integrity risk the result creates. Because a profile can be shifted by choosing items, a vendor or evaluator motivated to show a model as cautious or agreeable could curate an item set whose response-orthogonality happens to produce that result, with no way for an outsider to detect the selection from the published score alone.