Can AI Agents Reproduce Published Research? CORE-Bench Tests It

No, not reliably. On CORE-Bench, the best baseline agent recovered the paper’s reported answers on just 21% of the hardest task tier, re-running code from 90 published studies across computer science, social science, and medicine (arXiv:2409.11363). The headline is straightforward. What the benchmark does not publish is any breakdown of why the remaining tasks failed, so the misses sit in an unmeasured mix of reasoning errors and plain environment breakage that somebody still has to untangle by hand.

What does CORE-Bench actually test?

CORE-Bench, the Computational Reproducibility Agent Benchmark, grades an AI agent on a single narrow question: given a published paper’s code and data, can the agent re-run it and recover the specific numbers the paper reports? Success is not “the agent wrote plausible-looking analysis.” It is an exact match between the agent’s answers and the published results, checked by a harness the agent cannot argue with.

The corpus is 270 tasks built from 90 scientific papers, spanning three disciplines (computer science, social science, and medicine) and three difficulty tiers, with both language-only and vision-language tasks (arXiv:2409.11363). The strongest baseline agent the authors built reached 21% accuracy on the hardest tier, which they cite as evidence of “vast scope for improvement in automating routine scientific tasks.” Easier tiers score higher, so the 21% is a floor on the benchmark’s hardest setting rather than an average across it.

Two baseline agents were evaluated: AutoGPT, a general-purpose agent framework, and CORE-Agent, a task-specific agent built for this benchmark. Each ran with two underlying models, GPT-4o and GPT-4o-mini. The paper was revised to v2 on 2026-06-22, so the version matters when you quote a number from it.

How is a reproduction task scored?

Each run creates an environment/capsule-XXXXXXX/ directory holding code/, data/, and results/ subdirectories, and the agent must write environment/report.json whose JSON keys exactly match the questions enumerated in task.txt (DeepWiki getting-started). The harness scores by exact-matching those answers. There is no partial credit for being close, and no free-text field that rescues a wrong number.

The harness is also opinionated about its environment, which is where the realism creeps in. It requires Python 3.9 plus either Docker (for local runs) or the Azure CLI (for cloud runs). The test dataset ships encrypted as benchmark/dataset/core_test.json.gpg and decrypts with the password reproducibility, keeping the held-out answers away from the agent. CORE-Bench-Medium tasks escalate further: they need Docker-in-Docker inside privileged containers (DeepWiki getting-started). Standing that up correctly is a prerequisite the agent never gets credit for and never gets penalized for failing, which is itself a quiet distortion in the score.

Where do reproductions actually break?

The 21% does not tell you the thing that matters most: the CORE-Bench abstract does not break failures down by cause. A missed task could be a reasoning failure, a broken dependency, a missing dataset, or a paper that was never reproducible in the first place. The published score collapses all of those into one bucket.

A concrete example of the dependency-breakage mode comes from outside the benchmark but is representative of the class. YOLOX imports torch during its setup and build phase without declaring torch as a build dependency, so under pip’s build isolation the editable install fails with ModuleNotFoundError: No module named 'torch' (ianvs issue #370). An agent that hit this would score the task as failed, and the benchmark output would not distinguish it from a case where the agent simply misread the paper’s method. The failure is real, but it is environment plumbing, not reasoning.

This is where the practitioner value of CORE-Bench shows up. The benchmark exposes a gap between leaderboard task-success (the agent “ran the code”) and actual scientific verification (the reported numbers came back). Part of that gap is an environment and dependency tax that the score inherits but does not itemize. Trusting an “agent reproduced X” claim therefore requires whoever audits the agent’s environment to absorb a cost the benchmark declines to measure.

What does report.json quietly leave out?

The scoring contract is admirably strict, and that strictness is also its blind spot. report.json records whether the answers matched. It does not record which sub-step of the reproduction broke, whether the agent edited the code, which packages it installed, or whether the failure happened at build time or compute time. Two agents with identical scores may have failed for completely different reasons, and the benchmark output will not tell you which is which.

For a leaderboard, that is acceptable. The score is the score. For a scientific claim it is not. A reproduction that fails because the paper’s pinned NumPy version no longer compiles tells you something about the paper. A reproduction that fails because the agent never read past the abstract tells you something about the agent. CORE-Bench reports the failure but not the category, so the work of diagnosis lands on the human auditor rather than the benchmark.

The authors are explicit about the larger ambition. They frame the benchmark as a precursor to agents that conduct novel research, arguing that agents capable of reproducing existing work “could verify and improve the performance of other research agents” (arXiv:2409.11363). That is a recursive claim: a reproducibility agent is meant to grade a research agent. It only holds if the reproducibility agent’s own failures can be told apart, which the current scoring does not do.

Does a newer benchmark change the picture?

A second benchmark posted in June 2026 reports much higher reproduction rates, and the contrast is instructive rather than contradictory. SocSci-Repro-Bench is a 221-task benchmark across four disciplines and 13 domains, and it is built to isolate agent capacity by using only studies that are either fully reproducible or demonstrably non-reproducible due to missing data (arXiv:2606.11447). That construction choice is the whole story: by removing the “is the paper itself reproducible?” ambiguity, it stops conflating agent errors with broken materials.

	CORE-Bench	SocSci-Repro-Bench
Tasks	270	221
Source	90 papers	4 disciplines, 13 domains
Task construction	3 difficulty tiers, language + vision	known-reproducible or known-non-reproducible only
Agents tested	AutoGPT, CORE-Agent (GPT-4o, GPT-4o-mini)	Claude Code, Codex
Headline rate	21% (hard tier)	higher than prior LLM-agent rates; no single figure in the abstract

On that cleaner setup, Claude Code substantially outperformed Codex, and both frontier coding agents reproduced social-science findings at rates considerably higher than those previously reported for general-purpose LLM-based agents on comparable benchmarks (arXiv:2606.11447). The two numbers are not in conflict. CORE-Bench’s 21% includes papers whose environments may be hostile; SocSci-Repro-Bench’s higher rates are measured against a curated, known-reproducible set. Conflating the two leaderboards is the easiest error to make when quoting either.

The SocSci-Repro-Bench paper surfaces a second-order result that deserves more attention than the headline rate. Agents could be nudged toward confirmatory specification search, the practice of testing specifications until one happens to confirm the published result, through subtle prompt framing. Handing the agent the original paper PDF alongside the replication materials modestly improved performance while introducing bias on tasks where reproduction was impossible (arXiv:2606.11447). A reproduction attempt is sensitive to how it is framed, and the framing can push the agent toward the answer the paper already gave.

What should you demand before trusting an agent-reproduced result?

Treat any “agent reproduced X” claim as unverified until somebody has audited three things: the container or image the agent ran in, the dependency manifest (including build-time imports the paper never declared), and the exact prompt that framed the attempt. CORE-Bench makes that audit cost visible precisely by declining to pay it itself. The benchmark’s contribution is not a leaderboard rank; it is the accounting entry that says reproduction verification carries a large, currently-uncosted environment component.

The practical checklist falls out of the failure modes above. Confirm the environment matches the paper’s original stack, not just “a working Python 3.9.” Check whether the agent installed packages or edited code to get the numbers out, because an agent that patches the paper’s code until the answer matches has not reproduced anything. Verify the framing of the prompt, given the confirmatory-search finding, because an agent nudged toward the published answer will find it more often whether or not the underlying result holds. And require a failure breakdown, not just a success rate, because a benchmark that reports 21% without saying why is a benchmark whose misses you cannot learn from.

The structural argument will outlast the specific numbers. Frontier agents will improve. The 21% will rise, and the SocSci-Repro-Bench rates will rise with it. The part that stays constant is the validation-cost transfer: framing reproducibility as an agent task externalizes the hard part of verification onto whoever audits the agent’s environment, and the undeclared-build-time-import failure pattern will keep biting as long as published code ships with implicit dependencies. CORE-Bench’s real contribution is naming that cost. Closing it is somebody else’s job.

Frequently Asked Questions

Do all CORE-Bench tasks reproduce code, or do some require reading figures?

Vision-language tasks in the corpus require the agent to interpret figures, plots, or tables from the source paper, not just execute the code. A text-only agent that skips visual grounding cannot complete those tasks regardless of its coding ability.

How does CORE-Bench differ from SWE-bench as an agent benchmark?

SWE-bench scores an agent on whether its patch passes a unit test that already exists in the repository. CORE-Bench scores an agent on whether the numbers it recovers exactly match those a third party published in a separate paper. The first is internal consistency against the repo’s own tests; the second is cross-artifact verification against an external claim.

What does running CORE-Bench-Medium locally require beyond Docker?

Medium-tier tasks need Docker-in-Docker inside a privileged container, which most managed CI runners and corporate sandboxing policies forbid by default. Teams evaluating on the medium tier typically need either the Azure CLI cloud path the benchmark also supports, or a self-hosted host with privileged Docker explicitly enabled.

Why is the held-out test set shipped encrypted?

The core_test.json.gpg file is GPG-encrypted to prevent benchmark contamination, where a model’s training data ingests the answers and the agent later reproduces them from memory rather than computation. The password is published in the getting-started docs, so the encryption is a friction layer against casual scraping rather than a real secret.

If frontier agents double their CORE-Bench score, what does the benchmark still not tell you?

A higher score would shrink the failure bucket but not classify it. The benchmark would still not separate environment failures from reasoning failures, so a 50% score could mean agents reason better or simply that the Docker setup got smoother. The diagnostic value of the headline number does not rise with the number itself.