LLM Surrogates in A/B Tests: The 39% Recovery Gap and the Silent Bias Risk

A June 2026 arXiv preprint applies surrogate endpoint theory from clinical trials to product A/B experimentation, using LLM responses as statistical proxies for human subjects. On a real-world headline dataset, the paper reports, raw LLM outputs recovered only 39% of the human treatment effect before calibration was applied. The formal scaffolding distinguishes this from informal “synthetic user” guesswork, but it ships a specific failure mode: a model that mispredicts which variant humans prefer will produce confident confidence intervals around a wrong call.

What does the surrogacy framework actually propose?

The framework adapts surrogate endpoint theory from clinical trials, substituting LLM queries for live user exposure in A/B experiments and using model output as a calibrated proxy for human treatment effects. In drug trials, surrogate endpoints such as tumor shrinkage or viral load stand in for the true endpoint of survival when measuring the actual outcome is too slow or expensive. According to arXiv:2606.17165, the same logic applies to product experimentation: LLM-simulated preferences are cheaper and faster to collect than live traffic responses, so model output becomes the surrogate for the human preference signal.

The framework’s calibration mechanism averages across multiple LLM draws per experimental unit to absorb model stochasticity, then applies nonparametric calibration to align the model’s output distribution with historical human response data. The paper is explicit about what this approach does and does not guarantee. Its own framing: “A/B testing on LLM responses is correct only by assumption, whereas A/B testing on humans is correct by design.” Randomization in a standard A/B test creates validity structurally. The surrogacy framework borrows validity from assumptions that can fail without triggering an obvious alert.

The distinction from informal practice matters. Running a headline variant past GPT-4 and picking the one it prefers is not this framework: the surrogate endpoint approach provides testable assumptions, computable bias bounds, and a calibration error criterion. Those properties make failure modes visible; they do not eliminate them.

What has to be true for LLM responses to substitute for human outcomes?

According to arXiv:2606.17165, two conditions must hold jointly: surrogacy and comparability. Surrogacy requires that the LLM’s responses reliably predict what the human treatment effects would have been. Comparability requires that the LLM and human samples align closely enough that calibration can close the distributional gap between them. The paper describes these as jointly weaker conditions than requiring distributional equivalence, meaning the bar is lower than demanding the LLM perfectly mimic human behavior. But they are still assumptions, not guarantees, and neither is directly observable from experimental data alone.

The preprint provides a falsification test for checking whether the surrogacy assumption holds in a given context. What such a test can do is flag clear violations; what it cannot do is certify the assumption holds in a new deployment context where no human response data exists for comparison. That is the practical constraint for early adoption: the calibration step requires human response data from the deployment context to function. Without it, the framework operates on the surrogacy assumption unverified, which is the informal version with more documentation attached.

How does surrogacy bias corrupt the inferred treatment effect?

Surrogacy bias is the gap between what the LLM predicts about human preferences and what those preferences actually are. Unlike sampling error, it does not average out as sample size grows. It distorts the estimated average treatment effect systematically, so the confidence interval narrows around the wrong number as the experiment accumulates more LLM responses.

The treatment effect is just wrong.

The 39% raw recovery figure from the preprint’s Upworthy experiments is the concrete version of this problem. Before calibration, raw LLM outputs captured less than half the actual human treatment effect. That 61% gap is not noise; it is bias, and it would produce a confident but miscalibrated experimental conclusion if a team deployed the uncalibrated surrogate. The paper reports that nonparametric calibration “closes the gap,” though a specific post-calibration recovery percentage is not stated in what is publicly available from the preprint.

The asymmetry matters for how teams should think about deployment risk. Random error is legible: wider intervals, less power, more conservative decisions. Surrogacy bias is less legible. It looks like a properly powered experiment with well-behaved statistics. The treatment effect is just wrong and the interval confirms it.

What does adjacent research show about LLM behavioral instability?

A concurrent June 2026 study on small on-premises LLMs found that authority-style prompt prefixes increased refusal rates by 2 to 20 times over the no-prefix baseline, a direct demonstration that contextual framing can shift a model’s output distribution by an order of magnitude. The surrogacy framework implicitly assumes that the LLM’s response distribution is stable across experimental conditions; this finding puts pressure on that assumption.

The study tested framing conditions specific to legal assistant applications, including prefixes that position the model as an assistant to a national supreme court. A known role-play jailbreak prefix showed inconsistent behavior: refusals increased sharply in some models and barely shifted in others. The instability was model-specific and not predictable from the prompt change alone.

For a surrogacy framework, behavioral instability under framing shifts creates a problem that calibration was not designed to solve. Calibration corrects for distributional gaps between LLM and human populations when the LLM’s own behavior is consistent. If an A/B test changes the framing of a prompt across experimental arms, the surrogate signal itself moves in a way that corrupts the treatment effect estimate. The bias bounds in the preprint address the distributional mismatch between model and humans; they do not bound shifts in the model’s output distribution across experimental conditions.

What’s the cost-savings vs. decision-risk trade-off?

The cost argument is real: LLM queries cost a fraction of live user studies and return results in hours rather than weeks. For early-stage feature screening where many variants die before reaching live traffic, that speed differential is the entire value proposition. The question is not whether the savings are large; it is whether the surrogacy error is small enough that wrong decisions from biased surrogates cost less than the savings from skipping live traffic.

That calculation depends on three variables the framework does not control: how well the LLM’s training distribution matches the product’s actual user population, whether the experimental variations change any contextual framing that could shift model behavior, and how costly a wrong experimental decision is in the specific deployment context. An A/B test selecting between headline variants or onboarding flows can absorb a wrong call. A test that determines ranking algorithm weights or pricing structures carries a different risk profile for the same bias magnitude.

An adjacent result from Themis (arXiv:2606.24622), published at IEEE CAI 2026, showed that reward models trained on human preferences can match or outperform the environment’s true reward signal in controlled reinforcement learning settings. That is not live-traffic A/B testing, but it establishes that human-aligned proxies can approximate real human feedback in constrained, well-calibrated contexts. It also illustrates the shared limitation: the proxy worked for the environment it was calibrated against. Generalization beyond that calibration boundary is a separate question that neither Themis nor the surrogacy preprint resolves.

What still needs peer review before this goes near production?

The falsification test design, bias bound derivations, nonparametric calibration claims, and the 39% raw-recovery figure in arXiv:2606.17165 are all pending external review. Peer review in causal inference is not a formality. The surrogate endpoint literature has a history of frameworks that held in the training context and failed in deployment when the surrogacy assumption turned out to be softer than the theory required.

What the preprint offers that informal synthetic-user practices do not: explicit assumptions that can be tested, bias bounds that can be computed, and a calibration procedure with a defined error criterion. Those are useful properties. They are only as useful as the calibration data is representative and the surrogacy assumption holds in the specific deployment context.

The most actionable reading of the paper’s own empirical result is not the optimistic one. A 61% gap between raw LLM output and actual human treatment effect, on a relatively clean headline dataset, is a large error that requires a calibration step grounded in real human data from the specific context. Any team that deploys LLM surrogates without that calibration step is not implementing the framework described in this paper. They are running the informal version, without the formal assumption structure, with the appearance of rigor attached.

Biased or inaccurate training data makes this problem worse in a specific way: an LLM whose learned distribution over human preferences was shaped by a non-representative corpus will miscalibrate systematically, and the miscalibration will be invisible without human response data to detect it. The framework’s falsification test is designed to catch this class of failure, but only when there is enough human data for the test to have statistical power. Below that threshold, the surrogacy assumption remains an assumption, and the confidence interval remains an artifact of model behavior rather than a statement about users.

Frequently Asked Questions

Which experiment types are poor fits for LLM surrogates even with proper calibration?

Features where user response depends on accumulated session history (personalized recommendations, churn triggers) or on physiological signals (load-time frustration, visual fatigue) cannot be reconstructed from prompt text alone. No calibration procedure can supply context the model was never given, so the surrogacy assumption fails structurally before any bias calculation is possible.

What breaks if the underlying LLM is updated mid-experiment?

Calibration fits are computed against a specific model’s output distribution. A version update introduces distributional shift that the prior calibration was not designed to track. The June 2026 framing-instability research (arXiv:2606.24585) showed behavior can shift by an order of magnitude within a single model under prompt changes; a model update can produce equivalent or larger shifts without any prompt change at all, silently invalidating a calibration that appeared to be working.

Can the falsification test catch surrogacy bias before a live experiment concludes?

Only retrospectively. The test requires human response data to run. In pre-launch screening, the primary cost-saving scenario, that data does not exist yet for novel product surfaces. This means the test validates surrogacy after human data has already been collected, which partially undercuts the value proposition of skipping live traffic for genuinely new features with no historical comparable.

How does this framework differ from prompting an LLM to simulate survey respondents?

Informal synthetic-user research produces no testable assumptions, no computable bias bounds, and no calibration error criterion. The surrogacy framework’s value is not the LLM itself but the statistical apparatus around it: explicit conditions that can be violated, a falsification test that can detect those violations, and a calibration step with a defined error term. Without that apparatus, a team running GPT-4 as a survey proxy has no mechanism to distinguish a correct signal from a biased one.

What would need to change for this approach to extend to pricing or ranking experiments?

Pricing and ranking experiments involve downstream behavioral loops (repeat purchase, session length) that a single LLM query cannot approximate. The Themis result (arXiv:2606.24622) showed human-aligned proxies work in constrained reinforcement learning environments calibrated to a specific reward signal; generalizing that to multi-step behavioral outcomes remains an open problem that neither paper addresses. The preprint’s bias bounds were derived from headline-selection data, and whether those bounds hold at all for outcomes requiring multiple user interactions is untested.