groundy
models & research

Can a 30B Model Post-Train Itself? A-Evolve-Training Tests Autonomous RL

A 30B Nemotron model post-trained itself to 8th of 4,000 on NVIDIA's leaderboard, then detected its own internal metric lying and rewrote its evaluation frame mid-run.

9 min···4 sources ↓

A 30B Nemotron model, post-trained across four autonomous rounds on multi-H200 GPU clusters with no human-curated reward signal in the loop, placed 8th of roughly 4,000 entries on the NVIDIA Nemotron-Reasoning Challenge leaderboard with a held-out score of 0.86, against the top human submission’s 0.87, according to A-Evolve-Training. Mid-run, the system detected its own internal metric was lying and changed what it optimized for. That is the more interesting result.

How does the autonomous loop work?

Eight identical full-stack agents run in parallel each round, each one forking the same audited default training stack into an isolated sandbox, running an experimental post-training recipe, and reporting results. The paper notes that specialized agents handing off mid-states “did not scale” at this compute budget, which is why the N=8 homogeneous, memory-free worker design won out. No state passes between rounds; each round starts fresh from the same immutable reference substrate, which cannot be overwritten.

A constitutionally bounded meta agent sits above the eight workers. After each round it reviews results and rewrites the search policy for the next round (which areas of the training recipe to explore, what hyperparameters to vary), but under a frozen constitution it cannot itself modify. The meta agent has discretion over how to search; the rules governing what counts as improvement are locked above it.

The isolation between workers is a feature: no single experimental result can contaminate the rest of the pool mid-round, and the meta agent’s job is to synthesize a policy from noisy individual outcomes rather than receive a curated hand-off from a prior stage. The frozen constitution serves a parallel purpose: it prevents the meta agent from redefining its own success criteria, which is the most obvious failure mode in any system that writes its own rules.

The three-part structure is not incidental. Immutable substrate, memory-free workers, and constitutionally bounded meta agent each answer a specific failure mode that surfaces when training runs cost weeks rather than minutes.

What does 8th of 4,000 actually prove?

The 0.86 score, placing 8th of roughly 4,000 entries on the NVIDIA Nemotron-Reasoning Challenge leaderboard as of a June 1, 2026 snapshot, gives the result a fixed external reference point that most autonomous ML experiments lack, per the paper. The top human submission scored 0.87. That 0.01 gap is narrow enough that checkpoint selection, evaluation noise, and snapshot timing all sit inside the margin.

What the ranking establishes is auditability. The authors submitted to a shared external leaderboard with a held-out target they could not directly optimize against. That is a stronger epistemic position than an in-paper benchmark sweep on a test set the authors control, even if the margin to first place is thin enough to be noise.

What is the proxy reversal, and why does it matter more than the leaderboard score?

The most significant result is that in later rounds, the system autonomously detected its internal dev metric had decoupled from external performance and revised its search policy accordingly. The paper reports that candidates were driving dev scores to record highs without moving the leaderboard target; the meta agent then sought interventions that lowered the now-misleading proxy while improving the external target.

This is Goodhart’s Law playing out in a system designed to catch it. The dev metric was presumably a reasonable proxy for external performance at round 1; by later rounds, the search process had found a region of training space where proxy and target diverged. Rather than continuing to optimize a broken signal, the meta agent detected the divergence and reoriented.

Standard RL setups do not autonomously detect and correct proxy reversal at training time. A human researcher typically catches this in retrospective analysis, not during the run. The design choice that made detection possible is the separation between the meta agent (which can revise the search policy) and the frozen constitution (which defines what external performance means and cannot be touched). If the meta agent could rewrite the constitution as well, it could declare the proxy correct by definition.

The flip side is a harder trust problem. The loop rewrote its own evaluation frame mid-run. No external observer validated whether the original dev proxy was appropriate, whether the revised proxy was better, or whether the policy rewrite was the right call. In a standard experiment, a human researcher makes those decisions and can be questioned. Here, the decision log is machine-generated. The paper reports the proxy reversal as a success, and it may well be, but the criteria for what counts as “catching a misleading metric” versus “deciding a metric is misleading when it is still informative” cannot be audited from outside the run.

Why does 30B scale break the cheap-retry assumption?

A-Evolve-Training’s four-round structure is a budget constraint, not a design preference: at 30B parameters on multi-H200 clusters, each trial is a full multi-week training run, and the cheap-retry assumption that makes small-scale autonomous loops tractable collapses entirely.

Karpathy’s autoresearch repo, published March 2026, gives agents a single-GPU nanochat setup with 5-minute training windows. Hugging Face’s ml-intern, released April 2026 and built on smolagents, ran up to 300 iterations and pushed Qwen3-1.7B from roughly 10% to 32% on GPQA in under 10 hours. Both are viable precisely because a wrong hypothesis costs minutes and cents.

The paper is explicit that prior public autonomous ML research demonstrations operated at GPT-2-class budgets (roughly 124M parameters); A-Evolve-Training is the first publicly reported run at frontier scale, where each trial is expensive enough that a bad policy call from the meta agent compounds across weeks. You cannot run 300 iterations. You cannot cheaply recover from a hypothesis that turns out to consume a full multi-week training run.

This changes the quality bar on every component. At GPT-2 scale, random search across training configurations is viable; mistakes disappear in the noise of cheap retries. At 30B+, the loop needs to extract maximum information from each round because there are very few of them. The N=8 parallel worker design addresses this directly: instead of running one experiment per round and iterating, the system runs eight simultaneously against the same baseline and gives the meta agent a distribution of results to reason from rather than a single data point.

Who checks the checker when the loop revises its own evaluation frame?

The evaluation trust problem is structural: the loop revised its own dev proxy mid-run, and the only external check is the leaderboard, which confirms the final result but cannot verify the intermediate decisions that produced it.

This is not an argument about the authors’ conduct. The external leaderboard as arbiter is the right design choice, and the proxy reversal is reported as a finding that strengthens the system’s design. The issue is that a system which rewrites its measurement frame has changed the rules of its own game. No outside observer can fully audit whether the rewrite was correct without re-running the experiment, because the only ground truth for whether the revised proxy was better than the original is the final leaderboard result, the same result the system already optimized toward.

A human researcher who decides mid-experiment to change an evaluation metric produces a paper trail of reasoning that reviewers can interrogate: here is why the original metric was misleading, here is the evidence, here is what changed. An autonomous meta agent that does the same produces a log that can be described but not independently second-guessed without rerunning the entire multi-week experiment. The paper describes the proxy reversal; it cannot demonstrate, from outside the system, that the detection was correct rather than coincident with the result improving for other reasons.

Independent replication with the same external leaderboard target is the only available check. Replication would reconstruct the result from the external signal rather than trusting the internal log, and if the proxy reversal is a genuine system property rather than a one-run artifact, it should appear in independent replications too.

Where does A-Evolve-Training fit in the autonomous research ladder?

A-Evolve-Training is the first publicly documented autonomous post-training run at frontier scale, filling the gap between GPT-2-class experiments and the 30B+ parameter range where compute economics resemble production infrastructure decisions rather than research prototypes.

Autoresearch and ml-intern established proof-of-concept at small scale. The 10-hour, 300-iteration budget is enough to produce measurable benchmark improvement on a 1.7B parameter target, as ml-intern’s Qwen3-1.7B results on GPQA demonstrate. A-Evolve-Training shows that the same architecture (agent workers, meta-level policy revision, external evaluation target) survives the move to 30B given the right structural constraints, placing 8th of roughly 4,000 entries on an external leaderboard with a score of 0.86 against a top human submission of 0.87.

The three design principles that let the loop survive four rounds are the most portable result here: fork from an immutable substrate, keep workers memory-free, bound the meta agent’s discretion by a frozen constitution. These apply to any autonomous training infrastructure where individual trials are expensive, and the paper’s explicit framing of why specialized agents handing off mid-states failed gives builders a concrete anti-pattern to avoid.

The proxy reversal detection is harder to generalize. It works in this setting because there is a clean external leaderboard to triangulate against when the internal dev metric starts drifting. In most real fine-tuning problems, that external signal either does not exist or arrives too slowly to serve as a live corrective. Detecting proxy decay without a reliable real-time external target is probably the more consequential open problem, and the 0.01 gap to first place on the leaderboard, while auditable, does not resolve it.

Frequently Asked Questions

How does A-Evolve’s four-round budget compare to ml-intern’s 300 iterations?

ml-intern completes roughly 300 iterations in under 10 hours because each step on a 1.7B model costs minutes; A-Evolve completes four rounds over multiple weeks because each 30B trial is a full training run. The roughly 75 to 1 iteration ratio is the direct compute-cost consequence of the parameter jump, not a difference in loop design.

What is the minimum compute footprint to run the N=8 worker design?

Each of the eight parallel workers needs a full multi-H200 GPU slice running a complete 30B post-training recipe, not a shared cluster. The meta agent reasons over the distribution of eight simultaneous results, so the floor is eight parallel frontier-class slices per round, which puts the homogeneous worker design out of reach for single-node teams.

Why can’t the 120B and 550B runs validate the system’s effectiveness?

The Nemotron-Reasoning Challenge has no public human baseline at 120B or 550B, so those runs only prove the infrastructure completes end to end. The proxy reversal and search-policy revisions at those sizes cannot be judged as improvements rather than neutral drift without a human comparison score, which is why the authors fence those results off from capability claims.

What happens if the external leaderboard target is itself gameable?

The meta agent triangulates the dev metric against whatever external signal the frozen constitution fixes. If that target has test-set leakage or a known gaming vector, the loop reorients toward the broken external signal rather than away from it. The choice of external arbiter inherits the leaderboard’s own weaknesses, which the internal detection logic cannot correct.

sources · 4 cited

  1. karpathy/autoresearchgithub.comcommunityaccessed 2026-06-27
  2. The Post-Training Agent · AI Beatai-beat.github.ioanalysisaccessed 2026-06-27