Can Reinforcement Learning Be Provably Safe Without Sacrificing Scale?

Two June 2026 arXiv preprints argue that formal safety guarantees no longer impose a capability tax, at least within their tested regime. PS2-RL from Kai S. Yun, Zeyang Li, and Navid Azizan at MIT maintains provable safety while remaining performant on robotic control up to 10 state dimensions. CSPO, an ICML 2026 Spotlight, tackles constraint recovery in the soft-constrained regime. Scale here means low-dimensional robotic control, not a humanoid.

Why hasn’t provably-safe RL scaled until now?

The bottleneck has been certificate synthesis, not learning. Methods that rely on explicit certificate functions require the direct synthesis and verification of control-invariant sets in the state space, and that construction scales poorly with dimension. The higher the state dimension, the harder it becomes to build the set the safety proof depends on. This is the curse of dimensionality applied to a step that sits outside the neural network the RL algorithm actually trains.

The second problem is over-conservatism. Such methods often stay safe by refusing to act, which is why “provably safe” has often meant “provably slow.” A guarantee that only holds when the system barely moves is a guarantee nobody deploys.

How does PS2-RL build invariant sets without synthesizing them?

PS2-RL constructs its safety set implicitly, online, by forward-integrating a learned backup policy rather than synthesizing or verifying the set offline. The framework runs in two phases, according to the full paper. Phase I trains a backup policy using a safe-arrival value function that characterizes the optimal backup policy for invariant-set construction. Phase II trains the final PS2 policy end-to-end through a differentiable projection layer that enforces the safety guarantees induced by the learned backup policy.

The differentiable projection is the load-bearing detail. A safety layer that destroyed the policy network’s expressiveness would defeat its own purpose. The abstract claims that by maximizing the volume of the implicit control-invariant set in Phase I, the resulting PS2 policy is performant and scalable while maintaining provable safety, trained through the projection rather than constrained after the fact.

PS2-RL is also algorithm-agnostic. PS2-RL’s authors state it imposes no restrictions on the underlying RL algorithm and slots into existing training pipelines, which lowers the adoption barrier relative to certificate-synthesis methods that demand bespoke trainers.

How big is “scalable” in these results?

PS2-RL’s demonstrated ceiling is 10 state dimensions, which is genuine progress but well short of the state spaces humanoid and multi-agent systems operate in. The paper reports evaluation on robotic control tasks with state dimensions up to 10, a regime in which prior provably-safe methods struggle or become impractical.

PS2-RL builds its implicit invariant set by forward-integrating the system dynamics; the guarantee is therefore only as good as that dynamics model, and model mismatch remains a live limitation. If the real dynamics deviate from the model, the premise the guarantee rests on is what breaks, not the proof itself.

What does CSPO solve, and why isn’t it the same guarantee?

CSPO improves how fast a soft-constrained method recovers from violations, but it does not provide the per-step formal guarantee PS2-RL’s projection layer enforces. CSPO (Constraint-Sensitive Policy Optimization), accepted as a Spotlight at the 43rd ICML in 2026, is a first-order primal-dual method that augments the primal objective with a correction derived from the shortest signed distance to the safety boundary in constrained MDPs. CSPO targets the delayed-constraint-correction failure mode of primal-dual safe RL, where oscillatory behavior and prolonged violations persist, claiming faster safety recovery and higher reward preservation than state-of-the-art primal-dual and penalty-based baselines on navigation and locomotion benchmarks.

The distinction is structural, not a matter of degree. CSPO operates in the soft-constrained CMDP regime: it improves constraint recovery and constrained return. Equating CSPO with PS2-RL would conflate a method that converges faster on a penalty with one that projects onto a provably invariant set. PS2-RL and CSPO make different safety claims, and the gap between them is the difference between “mostly safe, eventually” and “safe at every step the model holds.”

One peer-review caveat worth carrying through: CSPO and RAMAC are ICML 2026 accepted. PS2-RL is a fresh preprint whose claims have not yet passed peer review.

Is provable-and-scalable safe RL a real trend?

Three advances across late 2025 and June 2026 suggest the direction is active, but the methods address different failure modes and should not be lumped together. SPiDR claims provable guarantees for safe sim-to-real transfer via pessimistic domain randomization, validated on two real robotic platforms. RAMAC, also ICML 2026, tackles safe RL from the risk-aware angle: Conditional Value-at-Risk plus behavioral cloning with a diffusion/flow actor and distributional critic, which is statistical tail-risk control rather than a formal per-step guarantee.

Read as a cluster, the signal is that “provable” and “scalable” are no longer treated as mutually exclusive the way the field assumed five years ago. Read carefully, the methods are complementary: SPiDR attacks the sim-to-real gap, RAMAC attacks tail risk, CSPO attacks constraint recovery in soft-constrained settings, and PS2-RL attacks per-step invariance. They are not converging on a single technique.

What does this mean for safety governance?

The governance argument below is editorial extrapolation. None of the three papers discuss policy, governance, or auditing; the reasoning is the writer’s, not the authors’.

If a per-step safety certificate can be produced even within a limited regime, the line between an attestation and a verification sharpens. An attestation says “we trained it safely.” A verification is a checkable object: a constraint that holds at each step, under stated assumptions, that a third party can inspect. The difference matters for who bears the burden of trust. An attestation asks the auditor to trust the developer’s process; a verification asks the auditor to check a property of the artifact itself.

This is where the gap sits, as a matter of editorial argument rather than sourced regulatory fact. Safety practice that leans on training-process disclosures, red-teaming reports, and post-hoc evaluations is built around attestations because formal verification has been assumed not to scale. Whether that assumption still holds after PS2-RL is the open question, and it is an editorial one: the paper makes no governance claim, and the scaling result is bounded to 10 dimensions under strong model assumptions. The direction of travel is that auditable safety properties are closer to practical than the attestation default assumes, and frameworks built on the older assumption are the thing most likely to lag.

Frequently Asked Questions

What assumptions does PS2-RL make about the system dynamics?

PS2-RL assumes a known control-affine, Lipschitz dynamics model plus a pre-certified base set that the invariant construction starts from. The per-step guarantee rests on those structural assumptions, so unmodeled nonlinearity or an uncertified base set is where the proof loses its footing, not the projection layer itself.

Which specific tasks did PS2-RL validate on, beyond the dimension count?

The reported experiments are unicycle lane-keeping and powerloop tracking on a 10-dimensional quadrotor, with 100 percent safety claimed across both training and deployment. Those are continuous-control tracking problems, not contact-rich manipulation or multi-agent settings, which is why the dimension headline should not be read as coverage of the regimes humanoid teams care about.

How does PS2-RL’s differentiable projection differ from a standard Control Barrier Function?

The projection is a differentiable control-invariant layer enforcing backup-control-barrier-function constraints, and the authors prove it preserves the policy network’s universal approximation. A conventional CBF safety filter applied after training can clip or distort the learned policy and produce over-conservative behavior, whereas PS2-RL trains through the projection so the network learns to satisfy the constraint rather than being filtered by it.

What prerequisites does a team need before adding PS2-RL to an existing RL pipeline?

Because the framework is algorithm-agnostic, the RL algorithm itself can stay, but the team must supply a control-affine, Lipschitz dynamics model and a certified base set the invariant construction bootstraps from. Certifying that base set is an external step PS2-RL does not automate, so the practical adoption cost lives in the modeling and certification work, not the training loop.

What would extending this approach to humanoid-scale systems actually require?

A 30-degree-of-freedom humanoid state space is several times larger than PS2-RL’s 10-dimensional ceiling, and the bottleneck moves to certifying the base set at that dimension. Pushing past the current regime likely requires either a cheaper base-set certification procedure or a way to build the implicit invariant set without bootstrapping from a certified region, neither of which the paper demonstrates.