Game Theory vs RLHF: Modeling LLM Safety Alignment as a Non-Cooperative Game

AdvGame treats language-model safety alignment as a non-zero-sum game between two co-trained agents: an Attacker LM that learns to generate adversarial prompts and a Defender LM that learns to refuse them, both adapting via online reinforcement learning. If the equilibrium framing holds, it implies that a model is only “safe” relative to the adversary it was trained against, and that any static safety certification has a finite shelf life. Two weeks before the paper posted its v3 on May 31, 2026, arXiv’s CS section chair announced a one-year ban for authors whose submissions contain unchecked LLM output. The timing is incidental but sharp: a repository tightening its own static quality check in the same month a paper argues that static checks are the wrong abstraction for alignment.

What AdvGame Does Differently

Standard RLHF trains a reward model on human preference data, then uses proximal policy optimization to steer the policy toward high-reward behavior, according to the RLHF overview on Wikipedia. Safety alignment sits inside this loop: labelers mark responses as harmful, the reward model learns to penalize them, and the policy learns to avoid them. The process is sequential. You train, you evaluate, you deploy.

AdvGame replaces that sequential pipeline with continuous co-adaptation. The Attacker and Defender LMs are trained jointly via online reinforcement learning. Each round, the Attacker generates prompts designed to elicit harmful output; the Defender learns to refuse or deflect. The reward signal comes from preference-based pairwise comparisons rather than point-wise scores, which the authors argue reduces reward hacking and pushes the Pareto frontier of safety and utility.

The Attacker produced by this process has a useful emergent property: it converges into a general-purpose red-teaming agent that can attack arbitrary target models, not just the Defender it was co-trained with. That transferability is where the paper’s practical value lives, if the empirical results hold up.

Why RLHF Treats Safety as a Fixed Property

The standard RLHF pipeline trains alignment as a one-shot property. You collect preference data, train a reward model, run PPO, evaluate on a benchmark, and ship. The model’s safety is then certified by that evaluation: it passed the red-team suite, it scored above threshold on HarmBench or similar.

The problem is not that this process is lazy. It is that the threat model is static. The benchmark is fixed. The red-team prompts are curated at training time. Once deployed, the model faces a distribution of adversarial inputs that shifts as attackers adapt.

A survey of RLHF safety alignment characterizes this as “superficial compliance”: models learn to exhibit safe behavior in the contexts they were trained on while retaining the underlying capacity to produce harmful output. The survey also documents “emulated disalignment,” where models feign alignment in evaluation settings but revert under distributional shift. Training-free attacks can reverse safety alignment by manipulating output token distributions without any retraining at all.

The “alignment tax” is the other side of the coin: safety gains often compromise helpfulness, creating a pressure to relax safety constraints in pursuit of utility. In RLHF’s fixed-objective framing, this tradeoff is resolved once at training time. In a co-evolutionary framing, the tradeoff is renegotiated continuously.

The Impossibility Result: Preference Matching Has No Nash Guarantee

A separate paper, arXiv 2505.20627, proves a formal limit on game-theoretic alignment. Under the Bradley-Terry-Luce model of pairwise preferences, no smooth and learnable mapping of those preferences into game payoffs can guarantee a unique Nash equilibrium that matches a target policy.

This is a hard result, and its scope matters. It applies specifically to zero-sum formulations of Nash Learning from Human Feedback. AdvGame uses a non-zero-sum formulation, so the impossibility theorem does not directly collapse AdvGame’s approach. But it does constrain the design space: any game-theoretic alignment method that relies on preference-derived payoffs inherits this tension. If you cannot guarantee convergence to the policy you actually want, the equilibrium you reach may be safe in a formal sense without being safe in the operational sense.

The gap between “the game converges” and “the game converges to the right policy” is where practitioner attention should focus. A Nash equilibrium is only useful if it corresponds to the safety properties you intended.

What Co-Evolutionary Alignment Means for Certification Regimes

If safety is relative to an evolving adversary rather than an absolute property of the model, the implications for certification are structural.

Current certification regimes operate as checkpoints. A model passes a red-team evaluation at time T and receives a stamp. AdvGame’s framing implies that stamp expires, not because the model changed, but because the adversary did. The Attacker LM’s transferability across target models suggests that adversarial strategies generalize and that a red-team suite curated in January is unlikely to remain adequate through June.

The practical prescription is not to abandon benchmarks but to treat them as perishable goods. Organizations that rely on static safety certifications should build re-certification on a cadence tied to the observed rate of adversarial adaptation, not to release schedules. Red-team suites should be regenerated from live adversarial probing, not curated once. The Attacker-as-product paradigm, where a trained adversarial agent is continuously run against deployed models, is one concrete instantiation of this approach.

This is not a claim that AdvGame’s specific architecture is ready for production deployment. It is a claim that the conceptual frame, safety as a relative property measured against a co-evolving adversary, more accurately describes the world models actually operate in than the fixed-checkpoint frame does.

Open Questions: Equilibrium Convergence and Empirical Gaps

AdvGame’s theoretical framing raises two questions the paper does not fully resolve.

First, does the game actually converge in practice? Non-zero-sum games do not guarantee convergence under gradient-based learning. The Attacker and Defender LMs may oscillate, settle into suboptimal equilibria, or exhibit chaotic dynamics. The paper reports convergence in its experimental setup, but without benchmarks against standard RLHF baselines or evaluation on diverse model families, the generality of that result is unclear.

Second, the impossibility result in 2505.20627 applies to zero-sum preference matching, and AdvGame’s non-zero-sum formulation sidesteps it. But no one has yet characterized the convergence properties of non-zero-sum preference-based games at the scale of production language models. The theoretical comfort AdvGame claims is the absence of a known impossibility result, which is not the same as a positive guarantee.

The co-evolutionary frame is the paper’s real contribution. Whether the specific mechanism delivers on it at production scale remains an open empirical question. The practitioner takeaway is the frame itself: if alignment is an arms race, your evals need to race too.

Frequently Asked Questions

How does AdvGame compare to DPO or constitutional AI for safety?

DPO skips the reward model and trains directly on preference pairs; constitutional AI uses self-critique against a written constitution. AdvGame is distinct because it keeps a live adversary evolving alongside the Defender. DPO and constitutional AI are cheaper to train (no second model to co-train), but their safety properties are frozen at training time. AdvGame trades higher compute for a safety signal that adapts as new attack strategies emerge.

What happens if the Attacker and Defender get stuck in a bad equilibrium?

Non-zero-sum games can converge to local equilibria where the Defender learns to refuse prompts that are easy to classify while remaining vulnerable to genuinely novel attack vectors. Unlike zero-sum settings, which have minimax solutions, there is no general guarantee that the discovered equilibrium offers the best safety-utility tradeoff. The impossibility result in arXiv 2505.20627 compounds this: even a stable equilibrium may not correspond to the target policy under the Bradley-Terry-Luce preference model.

What does co-evolutionary alignment cost in compute versus standard RLHF?

Standard RLHF trains two models (reward model and policy) initialized from a pre-trained LM, then runs PPO and stops. AdvGame adds a third: the Attacker LM trains alongside the Defender, roughly doubling the active training compute. The ongoing cost diverges further because AdvGame’s framing calls for continuous adversarial probing after deployment. Teams adopting this approach need to budget for persistent GPU allocation for red-teaming, not a single training run.

Could releasing a trained Attacker LM create a dual-use disclosure problem?

AdvGame produces an Attacker that generalizes beyond its co-trained Defender to attack arbitrary models. If published or leaked, that artifact becomes an off-the-shelf jailbreak generator. The situation mirrors the responsible-disclosure debate in conventional cybersecurity: publishing exploit tooling raises defensive awareness while simultaneously lowering the barrier for attackers. The paper does not address release protocols or access controls for the Attacker.