groundy
ethics, policy & safety

When LLM Safety Lives at Inference, Not Training: A Certification Gap

Post-training alignment can reshape LLM behavior after the checkpoint regulators audit, leaving a gap between the certified artifact and what actually runs in production.

7 min · · · 4 sources ↓

Frontier AI governance frameworks, as of June 2026, determine whether a model qualifies as “high-impact” by tallying cumulative training compute. The enforcement mechanism for that threshold is self-reporting. No technical verification primitive exists to independently confirm what a training run actually did, according to Peigné et al.. At the same time, post-training alignment methods like GRPO-based self-rewarded RL demonstrate that a model’s behavior can be reshaped after the checkpoint that regulators would audit, leaving a governance gap between what was certified and what actually runs.

The Self-Reporting Gap: Why Frontier AI Governance Has No Teeth

Current frontier AI governance regimes use cumulative training compute as the primary criterion for designating high-impact models. The threshold is concrete enough: a model trained above a certain FLOP count triggers additional obligations. The problem is that the only mechanism for establishing whether a model crossed that line is the training organization’s own disclosure. Peigné et al. identify this as a structural verification void: enforcement rests on self-reporting because no technical primitive for verifying training runs exists.

This is not an abstract governance concern. If a training organization misreports compute, or if a third party fine-tunes a model using additional compute that goes unreported, the governance framework has no independent way to detect the discrepancy. The model’s behavior at inference time is the only observable output, and behavior alone cannot reconstruct the training compute budget that produced it.

What ZK Verification Could Actually Prove About Training

Peigné et al. propose a zero-knowledge verification architecture specifically designed for frontier dense pre-training. The system combines three inputs: pre-committed training specifications (declared before the run begins), inter-node network observations captured during training, and on-the-fly Merkle commitments to intermediate model state. These are verified through a zero-knowledge Virtual Machine (zkVM) with native BF16/FP32 precompiles, avoiding the performance penalties that general-purpose ZK systems would impose on the arithmetic precision that neural network training requires.

The protocol produces three distinct proof types. A genesis proof verifies initialisation against the committed specification. In-training step proofs run across the training run itself, providing ongoing attestation. Ex-ante attestations enforce policy-relevant claims as running invariants, meaning a regulator could verify that a training run adhered to declared constraints without learning proprietary details of the model.

The authors estimate a deployable proof of concept within approximately 36 months at single-digit-percent training-side overhead, per arXiv:2606.05433. That timeline and overhead figure are the authors’ own projections, not independently validated as of June 2026.

Post-Training Alignment: The Moving Target Problem

The ZK verification proposal addresses training-time integrity. A separate problem emerges after training is complete.

Ishikawa et al. demonstrate that post-training alignment modifications can substantially alter model behavior through GRPO-based self-rewarded reinforcement learning. The HMX-feel experiment used rubric-based self-rewarded training to enhance LLMs’ expression of feelings, intentions, and self-awareness. The results cut both ways: trained models showed robustness to sycophancy-inducing questions but also exhibited degradation in truthful question-answering capability.

The alignment processes applied to production LLMs produce a fixed checkpoint: a model whose behavior has been shaped and then frozen for deployment. The HMX-feel results show that this fixed checkpoint is not stable. Post-training methods like GRPO can reshape what the model does after the checkpoint that a governance framework would certify.

arXiv:2606.03070 (ASymPO) independently identified failure modes in asynchronous RL post-training, specifically a scale-imbalance problem where stale responses evaluated under a current policy produce asymmetric positive and negative loss contributions. However, that paper has been formally withdrawn due to incorrect proofs and its specific claims should not be treated as reliable. What it does confirm, before the retraction, is that the post-training RL landscape is an active area of research where the engineering details are not yet settled.

When the Certified Checkpoint Isn’t What Runs

The structural problem is straightforward. If a governance framework certifies a model at a specific training checkpoint, and post-training methods can modify that model’s behavior without a new certification event, then the certified artifact and the running artifact diverge.

Governance models designed around training compute thresholds evaluate deployment readiness against the resulting training checkpoint: you measure the thing you can count (FLOPs), certify the artifact that results, and evaluate the safety properties of that artifact.

PSEBench, a 5,074-case benchmark for patient safety event triage built on Minnesota’s 29 Reportable Adverse Health Events, illustrates the difficulty even when the target is narrow. The benchmark evaluated 15 representative LLMs and found actionable gaps in evidence-grounded policy reasoning and principled abstention. If certifying LLM safety in a single regulated domain with a well-defined event taxonomy produces measurable failures across 15 models, the prospect of certifying safety across the full capability surface of a frontier model, after post-training modification, is a categorically harder problem.

From One-Time Audit to Continuous Inference Monitoring

The ability to adjust model alignment after deployment without retraining is useful from a capability standpoint. A regulator reading the same mechanism would, if the technique generalizes, identify an audit gap. The artifact that was certified is no longer the artifact that runs.

Closing this gap requires a shift in the governance model. A one-time certification against a frozen checkpoint assumes the model’s behavior is a fixed property of its weights. Post-training alignment methods break that assumption. The alternative is continuous inference-time monitoring: verifying not what the model was trained to do, but what it is actually doing in production.

The ZK verification architecture proposed by Peigné et al. offers a starting point for the training-side half of this problem. It does not address the inference side. A complete governance framework would need to verify both that training complied with declared specifications and that post-training modifications did not alter the safety properties that the training compliance established. The training-side verification itself remains, in the authors’ characterization, “currently impractical at frontier scale,” with an estimated 36-month timeline to a deployable proof of concept.

Frequently Asked Questions

Does the ZK verification scheme cover fine-tuning runs or only initial pre-training?

The Peigné et al. proposal targets dense pre-training specifically, where the compute budget is large and the inter-node training topology is observable. Fine-tuning through methods like LoRA, which modifies only a slice of weights on far smaller compute budgets, falls outside the three-proof architecture. A frontier model could complete pre-training within declared specifications and then undergo fine-tuning with zero verification coverage.

How does the training-side overhead compare to existing AI compliance costs?

The single-digit-percent figure is a hardware cost measured in additional FLOPs during the training run itself. Current governance costs under frameworks like the EU AI Act are measured in personnel time for conformity assessment, documentation, and post-market surveillance. The ZK approach produces machine-checkable cryptographic proofs that any third party can validate independently, whereas current compliance produces documents that regulators must review manually, with no technical guarantee of completeness.

Can any governance regime track what happens to open-weight models after release?

Open-weight releases create a verification gap that neither training proofs nor inference monitoring can close. Once weights are published, any third party can apply fine-tuning or RL-based modification without the original organization’s knowledge. Ishikawa et al. used structured rubrics in their GRPO experiments, but far simpler gradient-based methods can also shift safety-relevant behavior. A certification regime anchored to the training organization’s output has no mechanism for tracking what downstream actors do with released weights.

What would inference monitoring need to measure that a one-time checkpoint audit cannot?

It would need per-domain behavioral benchmarks with explicit pass/fail criteria, run against every model revision rather than only at release checkpoints. PSEBench’s approach of testing evidence-grounded reasoning and principled abstention across 5,074 cases in a single regulated domain illustrates the granularity required. The operational requirement is a continuous regression suite that detects safety-property drift between the certified artifact and the running system, something current industry practice does not maintain.

sources · 4 cited

  1. Zero knowledge verification for frontier AI training is possible primary accessed 2026-06-06
  2. When AI Says It Feels primary accessed 2026-06-06
  3. ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information primary accessed 2026-06-06
  4. PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage primary accessed 2026-06-06