AI Safety Benchmark Rankings Flip Based on Eval Config, SafetyRepro Paper Reports

SafetyRepro, submitted to arXiv on May 25 by Yanhang Li, Zhichao Fan, and Zexin Zhuang, reports that changing only the evaluation configuration (decoding settings, temperature, prompt template) is sufficient to flip which model ranks “safer” on every alignment benchmark the authors tested. The paper formalizes this instability with a finite-envelope proposition that ties a measurable pairwise-disagreement rate to configuration-dependent rank reversals. Any compliance filing that cites a leaderboard position as safety evidence may be citing an artifact of the eval setup, not the model.

What SafetyRepro Proves: The Finite-Envelope Proposition

Prior work on benchmark instability observed that scores drift across configurations. SafetyRepro goes further: the authors report that for any pair of models on the benchmarks they tested, there exists at least one configuration pair that reverses their safety ranking. The paper calls this the “finite-envelope proposition,” and it ties a measurable pairwise-disagreement rate to the conditions under which strict model ordering admits a configuration-pair reversal. The proposition makes testable predictions about how often rank flips should occur, rather than treating instability as an anecdotal observation.

This distinction matters because most safety claims in procurement and regulation rest on ordinal comparisons: Model A scored higher than Model B on Benchmark X. If that ordering is configuration-conditional, the claim does not survive a change in eval setup.

Every Benchmark They Tested Flipped

On every alignment benchmark the SafetyRepro authors evaluated, altering only the evaluation configuration (decoding strategy, temperature, and prompt template) produced at least one pairwise rank reversal. The arXiv abstract does not name specific benchmarks or model pairs; those details are in the full PDF, which was not available for this analysis.

A separate April 2026 study corroborates the magnitude of configuration sensitivity from a different angle. The AI-Security-Benchmark reproducibility analysis found that temperature alone caused up to 3.13 percentage-point variation in security scores on Claude Sonnet 4.5, with an average of 1.40 percentage points across 20 models run at four temperature settings. That study measured score drift, not rank reversals. The two phenomena compound: if scores swing 3 points on a single parameter, ordinal rankings are unlikely to hold steady across full configuration changes.

The Eval Configuration That Benchmark Papers Don’t Specify

Benchmark papers routinely report model scores, sometimes temperature settings, and occasionally the prompt template. What they almost never specify is the full evaluation configuration: decoding strategy (greedy, beam, sampling), sampling parameters beyond temperature (top-p, top-k, repetition penalty), prompt formatting details including system prompts, and the scoring pipeline that maps raw model outputs to binary safe/unsafe verdicts.

AI Model Benchmarks explicitly warns that “model scores vary by prompt phrasing” and states that its prompts are designed for operator-grade evaluation, not academic benchmarks. The acknowledgment is honest. It also underscores the problem: if prompt phrasing changes scores, and different evaluators use different phrasing, the resulting numbers are not comparable across contexts.

The practical consequence is that two teams citing the same benchmark for the same model can reach substantively different safety conclusions without either team being wrong.

What This Means for EU AI Act Article 50 Compliance Disclosures

The EU Commission’s draft guidelines for AI Act Article 50 transparency obligations are open for stakeholder consultation until June 3, 2026. Article 50 governs transparency obligations for certain AI systems, and model providers that cite benchmark scores as evidence of safety properties may soon need to disclose the evaluation configuration alongside those scores.

SafetyRepro gives regulators a formal basis for requiring that disclosure. If rank order is configuration-conditional, citing a leaderboard position without specifying the configuration that produced it is an incomplete claim. The finite-envelope proposition defines the conditions under which that claim is and is not valid.

An orthogonal line of work reinforces the gap. POLARIS, accepted to ACL 2026, compiles natural-language safety policies into First-Order Logic and traverses a Semantic Policy Graph to generate coverage-driven test cases. Where SafetyRepro asks whether existing benchmarks produce stable rankings, POLARIS asks whether those benchmarks cover the policies you care about. Both questions are necessary; neither is sufficient alone.

The model side of this equation shifted on May 28, 2026, when Anthropic released Claude Opus 4.8. Among the stated behavioral changes, Anthropic reports that Opus 4.8 is more likely to flag uncertainties and less likely to make unsupported claims compared to its predecessor. That property is directly relevant to the rank-reversal problem: a significant share of configuration-sensitivity in safety benchmarks comes from models that assert answers confidently under one decoding setup and hedge or refuse under another. A model that produces more calibrated uncertainty signals across configurations narrows the variance that the finite-envelope proposition exploits. This does not resolve the eval-config disclosure gap; the same benchmark still needs its configuration pinned. But it does mean that procurement teams evaluating Opus 4.8 should expect somewhat tighter spread across temperature and sampling configurations than earlier models produced on honesty-adjacent benchmarks. The multibreak jailbreak benchmark findings and indirect prompt injection test results both illustrate how model-side behavioral differences affect safety eval outcomes in ways that aggregate leaderboard scores obscure.

From Theory to Protocol: The Commit-Stamped Reproducibility Standard

SafetyRepro pairs its theoretical contribution with a practical one: a commit-stamped evaluation protocol that enables bitwise-reproducible re-runs of alignment benchmarks. Every configuration choice is version-controlled and auditable. The paper overview confirms the protocol is designed so that configuration choices are fully traceable, recorded as part of the evaluation run rather than documented retroactively.

This is the pattern procurement teams should adopt. A safety benchmark result submitted as evidence should carry the full eval configuration pinned to a specific commit, the same way a build pipeline pins dependency versions. Reproducibility without config-level auditing is theater.

What Procurement and Red-Team Leads Should Do Now

Three concrete steps, ordered by effort:

Require full eval-config disclosure on any benchmark-cited safety claim. Model card scores without configuration details are not actionable. The configuration should include decoding strategy, all sampling parameters, the exact prompt template (including system prompt), and the scoring pipeline version.
Run pairwise comparisons across at least two configurations before citing a rank ordering. If the ordering flips, the claim does not survive the SafetyRepro test. Document both configurations and the resulting disagreement rate.
Pin eval configurations to version-controlled commits. The commit-stamped protocol from SafetyRepro is a template. Adopt it for internal evaluations, and require it from vendors submitting benchmark evidence as part of procurement.

Frequently Asked Questions

The AI-Security-Benchmark found 3.13 pp temperature variance. Is that the full uncertainty range?

No. That study ran single trials at each of four temperature settings, so the reported spread captures only cross-temperature variation, not run-to-run stochastic noise. The true uncertainty envelope is wider than 3.13 pp, and teams treating these figures as calibration baselines should account for the missing within-configuration variance component.

Does adopting POLARIS eliminate the rank-reversal problem SafetyRepro identifies?

No. POLARIS compiles safety policies into First-Order Logic and generates coverage-driven test cases via Semantic Policy Graph traversal, which addresses whether a benchmark tests the right things. SafetyRepro addresses whether the rankings those tests produce are stable across configurations. A model could pass every POLARIS-generated test under one eval setup and fail under another. The two tools are complementary: policy-coverage guarantees do not substitute for configuration-stability audits.

What happens if the EU Article 50 consultation closes without eval-config disclosure requirements?

The next regulatory vehicle would be implementing acts or sector-specific standards developed later, which are harder to influence and slower to amend. The June 3 consultation window is the near-term gate. Organizations that do not submit the finite-envelope proposition as technical justification during this round face retrofitting disclosure requirements into already-hardened compliance frameworks, a process that historically takes 12 to 18 months in EU standard-setting.

Are academic benchmark comparisons more or less vulnerable to rank reversal than procurement evaluations?

Academic benchmarks typically share standardized prompt templates across papers, narrowing the configuration space. Procurement and operator evaluations use bespoke prompt sets tailored to specific use cases, creating a wider configuration space and therefore a higher probability of rank reversal. AI Model Benchmarks explicitly separates its prompts for this reason, noting they are designed for operator-grade evaluation rather than academic comparability.