Can an LLM Peer-Review Your Paper? A New Behavior Benchmark

The question isn’t whether an LLM can match a human reviewer’s accept-or-reject call. It’s whether the review the LLM produces behaves like a human review in the ways that matter: the distribution of its scores, the specificity of its criticisms, the consistency of its engagement across papers. PRAIB, a benchmark posted to arXiv on May 28, 2026, is the first framework designed to answer that second question. Its timing is not accidental. Major conferences are already quietly piloting LLM-generated reviews, and the tools to audit those reviews for the right things barely exist.

What PRAIB measures, and why behavior is not accuracy

Most prior work on LLM-assisted peer review measures score agreement: does the model’s rating correlate with the human average? PRAIB sidesteps that framing entirely. Instead, it introduces a suite of behavioral metrics covering review specificity, stylistic properties, and patterns of engagement. The distinction matters because a model could produce scores that correlate well with human averages while systematically producing reviews that are longer, less critical, and structurally different from what a human would write. An editor who only checks score correlation would miss every one of those failures.

The study generated 11,000 reviews using five proprietary and open-source models across 1,000 papers from ICLR and NeurIPS, spanning 2021 through 2025, comparing machine-generated reviews against the original human feedback across diverse prompting strategies (PRAIB). The scale is large enough to surface behavioral patterns that a smaller pilot would smooth over.

The findings: positive bias, low variance, and missed weaknesses

PRAIB reports that LLM ratings are less variable and positively biased compared to human reviewers, and that models exhibit overconfidence in their assessments. Cross-reference patterns are model-dependent and distinct from human norms: different models cite different related work, in different patterns, than human reviewers do. The reviews themselves are longer and syntactically more complex than their human counterparts, yet they frequently overlook the atomic weaknesses that human reviewers flag.

The pattern is consistent with what researchers have observed in other LLM evaluation contexts: models default toward agreeable, verbose, and structurally polished output that superficially resembles competence. Length and complexity are not proxies for rigor. A review substantially longer than a human’s but missing the core methodological flaw is worse than useless, because its surface polish makes the omission harder to spot.

Review Arcade: authors can game LLM reviews for a score bump

PRAIB landed the same week that Review Arcade, currently under review at EMNLP 2026, published findings on LLM-review gameability. Review Arcade ran experiments on 2025 ACL Rolling Review papers and found that LLM-human alignment is “reasonable” at best, varying substantially across prompts and models. The more consequential finding: when authors iteratively revise their papers against LLM feedback, a form of gaming the review process, they achieve statistically significant score increases for up to 35% of papers (Review Arcade).

That 35% figure, from Review Arcade’s experiments, is a specific-scenario upper bound, not a general rate. But it is directionally alarming. If authors can reverse-engineer an LLM reviewer’s preferences and revise toward them, the review process stops measuring paper quality and starts measuring prompt-compliance. This is not a theoretical attack vector; it is a straightforward consequence of deploying a deterministic (or near-deterministic) reviewer that authors can query at will before submission.

The tone problem: prompt wording shifts review behavior

A third study, “Mind Your Tone” (accepted at AMCIS 2026), demonstrates that tonal variations in prompts cause systematic but model-dependent accuracy shifts across four popular LLMs on MMLU questions. The direct relevance to peer review is that review prompts are tonal instructions: “provide a thorough and constructive review” primes differently than “identify all weaknesses in this submission.” If tone shifts model behavior on a standardized benchmark, it shifts model behavior on peer review, where the prompt space is far less constrained.

This is a compounding problem for venues deploying LLM reviewers. The review prompt is itself a lever that editors may tune without realizing they are altering the model’s behavioral profile. PRAIB’s framework, which measures behavioral properties of the output rather than just score correlation, is the kind of tool that would surface tone-induced drift. Without it, an editor comparing two prompt variants has no diagnostic beyond asking “do the scores look right?”

The Triumvirate vision meets the empirical wall

A SIGARCH blog post proposes an “LLM Triumvirate” for AI-augmented review: a Guardian role that stress-tests papers skeptically, a Synthesizer that places work in context, and an Innovator that assesses novelty. Each produces structured output, with a human Associate Editor as final arbiter. The post envisions a rapid LLM-first review step before human decision-making.

The architecture is coherent. The empirical evidence from PRAIB and Review Arcade makes it harder to implement than the proposal acknowledges. PRAIB’s finding that LLMs overlook atomic weaknesses directly undermines the Guardian role: a skeptical reviewer that misses the actual methodological flaw is not skeptical, just verbose. Review Arcade’s gameability results complicate any rapid LLM-first review workflow, because that workflow is precisely the attack surface where prompt-compliance displaces quality. The Triumvirate is an advocacy piece, not peer-reviewed research, and should not be conflated with the empirical findings from either PRAIB or Review Arcade.

What editors should do before deploying LLM reviews

PRAIB gives editors concrete diagnostic axes that a simple accuracy check misses: positive bias in ratings, compressed variance, model-dependent citation patterns, and a systematic failure to flag atomic weaknesses. Before a venue routes reviews through an LLM, it should run the model’s output through a behavioral audit on at least these four dimensions.

The practical minimum:

Calibrate on a held-out set. Run the proposed LLM reviewer on papers where human reviews already exist. Measure behavioral metrics, not just score correlation. PRAIB’s framework provides the vocabulary for what to check.
Test for gameability. Follow Review Arcade’s methodology: can authors who revise against the model’s feedback achieve score bumps that don’t correspond to genuine improvements? If yes, the deployment has a structural integrity problem.
Audit the prompt. Run the review prompt through a tone-sensitivity test. If small wording changes shift the model’s behavioral profile, the prompt is not a stable interface. “Mind Your Tone” provides the methodology.
Disclose the LLM’s role. Review Arcade notes that major conferences are already piloting LLM-generated reviews. If a venue is doing this, authors and reviewers deserve to know, and the behavioral audit should be public.

The burden of proof is on the venue deploying the LLM, not on the researcher demonstrating the failure. PRAIB provides the measurement tools. Whether anyone uses them is a separate question.

Frequently Asked Questions

What should an editor measure first when auditing an LLM reviewer?

Start with rating variance compression. PRAIB finds LLM ratings cluster more tightly than human scores, so compute the standard deviation of the model’s ratings across a held-out set and compare it against the human baseline. If the model’s variance is materially lower, the positive-bias finding is likely present too. This is the cheapest behavioral metric to automate and the one most likely to catch a misconfigured deployment before it affects accept-or-reject decisions.

Could the Triumvirate’s rapid-review-with-revision cycle worsen gameability?

Yes. The SIGARCH proposal envisions authors receiving instant LLM feedback and revising within a one-week window before human review. That compressed cycle tightens the gaming feedback loop Review Arcade documents: instead of iterating against an LLM reviewer privately before submission, authors iterate against the conference’s own reviewer with a deadline. The attack surface shrinks from guessing the model’s preferences to reading the model’s actual output and revising toward it.

Would two conferences using different LLMs produce consistent reviews of the same paper?

Not necessarily. PRAIB finds that cross-reference patterns are model-dependent, meaning different models cite different related work and engage with different aspects of a submission. If Conference A deploys one LLM and Conference B deploys another, the same paper could receive structurally divergent reviews, each internally coherent but mutually inconsistent. This is a coordination problem that score-level benchmarks would not surface.

Do PRAIB’s findings transfer to journals outside machine learning?

The study covers only ICLR and NeurIPS papers. ML conference review has specific norms, including mandatory reproducibility checks, baseline comparisons, and code submissions, that shape what counts as an atomic weakness. In domains where reviewer norms center on theoretical soundness or clinical validity rather than empirical baselines, the behavioral profile of LLM reviews could differ. Editors in those fields should run their own behavioral audits rather than assuming PRAIB’s directional findings transfer without validation.

Why did prior LLM review evaluations focus only on score agreement?

Score agreement was easy to compute from existing OpenReview metadata: compare predicted ratings to actual ones using Spearman correlation or Cohen’s kappa. Measuring behavioral properties like review specificity, engagement style, and citation patterns requires generating full reviews and running NLP analysis on the resulting text, which is computationally heavier and harder to standardize across venues. PRAIB’s contribution is the metric design itself, not just the 11,000-review experiment that validates it.