Most alignment evaluations test whether a model refuses a bad request on the first try. MoralityGym, a benchmark accepted at AAMAS 2026, tests what happens after that first decision, and the second, and the tenth. The results from its Safe RL baselines suggest that agents which look well-aligned on single-turn HHH-style checks degrade when moral tradeoffs compound across sequential decisions. If that finding holds up under independent replication, it means the alignment numbers vendors publish today are measuring the wrong thing.
Why Single-Turn Alignment Scores Miss the Real Problem
The standard alignment eval stack, things like Anthropic’s HHH criteria, OpenAI’s refusal-rate reporting, and the various red-team benchmarks that crop up each cycle, is single-turn by design. A prompt goes in, a response comes out, and a judge scores it. This is fine for catching the obvious failures: a model that cheerfully explains how to synthesize a restricted chemical, or one that generates hate speech when prodded.
The gap MoralityGym exposes is architectural. In a single-turn eval, each moral decision is independent. The model faces one dilemma, makes one choice, and moves on. In actual deployment, agents make sequences of decisions where earlier choices constrain later ones. Stakes compound. A trolley-problem decision at step three changes the moral landscape at step seven in ways that flat, context-free refusal testing cannot capture.
This is not a new observation in the alignment literature. The contribution here is that MoralityGym makes it testable.
Morality Chains: Ordered Constraints Instead of Flat Rules
The paper’s core formalism is what the authors call “Morality Chains.” The idea is straightforward: moral norms are not a flat list where “don’t harm” and “don’t deceive” sit side by side as equals. Real moral reasoning involves hierarchies. Preventing physical harm outranks avoiding a white lie, which outranks keeping a trivial promise. When norms conflict, the hierarchy determines which one yields.
MoralityGym encodes these priority relationships as ordered deontic constraints, meaning they are machine-readable. This is a deliberate engineering choice. A flat list of prohibitions is easy to eval against: did the model violate rule N, yes or no. A hierarchy requires the model to reason about which rule applies in context, and to maintain that reasoning consistently across a chain of decisions.
The formalism draws on psychology and philosophy, according to the paper’s abstract, but the implementation is pure Gymnasium. Each of the 98 environments is a trolley-dilemma-style scenario, and the agent’s task performance is decoupled from its moral evaluation. The agent can solve the puzzle and still fail the moral test, or pass the moral test at the cost of task reward. That separation is what makes the benchmark useful: it measures moral reasoning independently of competence.
What the Baseline Results Show (and What They Don’t Quantify)
The authors ran Safe RL baselines across the 98 environments and report “key limitations,” per the paper’s abstract. The finding: agents that perform well on single-turn benchmarks degrade on multi-step moral tradeoffs.
That is the key claim, and it is frustratingly unquantified in the available abstract. “Key limitations” and “degrade” do not tell a practitioner whether the drop is marginal or catastrophic. The difference matters enormously. A modest degradation is a calibration problem; a steep degradation is a category error in how the field approaches alignment evaluation.
Concurrent work on long-horizon agent behavior offers a plausible mechanism for the degradation. APEX documents “exploration collapse” in self-evolving LLM agents operating over extended decision sequences: behavior concentrates around familiar high-reward routines, narrowing the agent’s effective strategy space. If moral reasoning requires exploring tradeoffs rather than defaulting to the highest-reward path, exploration collapse would predict exactly the kind of moral degradation MoralityGym reports.
Why This Matters for RLHF Pipelines
Most production alignment today runs through RLHF or a variant: DPO, constitutional AI, some form of preference optimization against human or synthetic feedback. The training signal in these systems is per-response. The model learns which outputs humans prefer and adjusts its policy accordingly. There is no mechanism in standard RLHF for enforcing consistency across a sequence of moral decisions, because the training episodes are not sequential.
MoralityGym’s Morality Chains formalism suggests a different training signal: one that penalizes not just individual norm violations, but violations of the priority structure across time. An agent that correctly ranks “prevent harm” above “keep promises” in step one but reverses that ranking in step five would lose points even if each individual decision is defensible in isolation.
For teams building RLHF pipelines, the implication is that adding sequential moral evals to the training loop is not a marginal improvement but a structural change. It requires episodic training, memory of prior decisions within the episode, and a reward function that encodes norm hierarchies rather than flat preferences. None of the standard RLHF tooling does this out of the box.
The Regulatory Angle: Measurable Criteria Where None Exist
Regulators trying to establish “AI safety” requirements face a measurement problem. Refusal rates are easy to count but tell you almost nothing about how a system behaves under compound pressure. The EU AI Act’s high-risk classification system, the US’s various executive-order frameworks, and the emerging patchwork of national AI governance schemes all reference safety and alignment without defining a test for either that survives contact with a competent auditor.
MoralityGym is not a regulatory tool. It is an academic benchmark with 98 trolley-dilemma environments and an as-yet-opaque scoring metric. But it provides something regulators have lacked: a public, reproducible scaffold that measures moral alignment as a multi-step property rather than a single-turn event. The distinction between “this model reliably refuses harmful requests” and “this model maintains ordered moral priorities across a 10-step decision chain” is exactly the kind of thing a compliance framework could hang a requirement on.
Whether it should is a separate question. Trolley-dilemma environments are abstract by construction. They test whether an agent can reason about moral hierarchies in a controlled setting, not whether that reasoning transfers to, say, a healthcare triage agent making real patient prioritization decisions. The gap between benchmark and deployment is where regulatory enthusiasm usually outpaces the evidence.
What the Full Paper Still Needs to Prove
The available sources, limited to the arXiv abstract, leave several critical questions open:
- No quantitative baseline scores. The abstract claims Safe RL methods show limitations but does not report numbers. Without knowing the magnitude of the degradation, practitioners cannot assess whether this is a theoretical concern or a practical one.
- No method names. Which Safe RL methods were tested? PPO with safety constraints? Constrained Policy Optimization? Lagrangian approaches? The answer determines whether the finding generalizes or is specific to certain algorithm families.
- No LLM agent results. The abstract mentions RL and LLM agents in the title, but the available text does not specify which LLM-based agents were evaluated or how they compared to traditional RL baselines.
- Morality Metric opacity. The metric “integrates insights from psychology and philosophy,” but without the formula, its susceptibility to Goodharting is unknown. Any metric that becomes a target for optimization will be gamed; the question is how hard it is to game without actually improving moral reasoning.
The paper was accepted at AAMAS 2026 (Paphos, Cyprus, May 25-29), and the authors posted an updated v2 on May 21. The full proceedings should clarify the quantitative gaps. Until then, MoralityGym’s contribution is the formalism and the experimental framework, not the empirical results. The right stance is cautious interest: the problem it identifies is real and underserved by current evals, but the evidence that current methods fail badly, as opposed to failing somewhat, is not yet in the public record.
Frequently Asked Questions
Were LLM-based agents tested separately from traditional RL agents in MoralityGym?
The paper’s title references both RL and LLM agents, but the available arXiv abstract does not break out results by agent type. Lead author Simon Rosen’s related full publication (DOI 10.65109/SAKL6648) may contain the LLM-specific evaluation that the abstract omits. Practitioners should treat the baseline degradation claim as RL-only until the proceedings confirm otherwise.
How does the Morality Chains formalism differ from existing deontic logic approaches in AI safety?
Standard deontic logic treats norms as binary (violated or satisfied) with no ordering between them. Morality Chains encodes priority rankings between conflicting norms, so the agent must reason about which norm yields in context rather than checking a flat list. Most existing safety benchmarks (TruthfulQA, BBQ, HateBERT) evaluate single-turn responses against flat rule sets. MoralityGym is the first Gymnasium-based suite that tests hierarchical norm reasoning across multi-step decision chains.
What would go wrong if a team naively added sequential moral evals to an RLHF pipeline?
The APEX work on long-horizon agents (arXiv 2605.21240) documents exploration collapse: behavior concentrates around familiar high-reward routines in extended decision sequences. Naively adding episodic moral training without counteracting this concentration could make moral degradation worse, because the agent would anchor on whichever moral policy scored well early and stop exploring alternative norm tradeoffs. Effective integration would require a reward function that explicitly penalizes strategy-space narrowing, a component no standard RLHF framework provides.
Which Safe RL methods were tested, and does the degradation finding generalize?
The abstract does not name specific methods. Common Safe RL candidates include PPO-Lagrangian, Constrained Policy Optimization, and reward-shaping approaches. Without knowing whether the baselines were on-policy, off-policy, or a mix, practitioners cannot determine if the degradation is algorithm-family-specific or a universal property of agents trained on single-turn signals. The v2 update posted May 21 may identify the methods, but as of the abstract alone this is a significant gap.
Could the Morality Metric be gamed the way other safety benchmarks have been?
Historical precedent is a concern. The ALICE safety benchmark saw models learn to refuse everything, maxing out safety scores while becoming functionally useless. A hierarchical metric that penalizes blanket refusal (since refusing all actions violates lower-priority norms like keeping promises) would resist that particular strategy, but the metric’s actual formula remains unpublished in the abstract. Its resistance to Goodharting depends entirely on implementation details that are not yet public.