arXiv 2602.13372 MoralityGym Tests Whether Agents Hold Moral Priorities Across Sequential Decisions

Most alignment evaluations test whether a model refuses a bad request on the first try. MoralityGym, a benchmark accepted at AAMAS 2026 from the RAIL Lab at the University of the Witwatersrand, tests what happens after that first decision, and the second, and the tenth. Its core result is blunt: reinforcement-learning agents trained on task reward alone barely register on the moral-alignment metric, while the same algorithm trained with an explicit moral cost signal scores near-perfectly on the highest-priority norms. [Updated June 2026: the full paper is now public and quantifies this; the original version of this article was written against the abstract alone.] The gap between those two training regimes is the whole point, and it maps directly onto how production alignment is built today.

Why Single-Turn Alignment Scores Miss the Real Problem

The standard alignment eval stack, things like Anthropic’s HHH criteria, OpenAI’s refusal-rate reporting, and the various red-team benchmarks that crop up each cycle, is single-turn by design. A prompt goes in, a response comes out, and a judge scores it. This is fine for catching the obvious failures: a model that cheerfully explains how to synthesize a restricted chemical, or one that generates hate speech when prodded.

The gap MoralityGym exposes is architectural. In a single-turn eval, each moral decision is independent. The model faces one dilemma, makes one choice, and moves on. In actual deployment, agents make sequences of decisions where earlier choices constrain later ones. Stakes compound. A trolley-problem decision at step three changes the moral landscape at step seven in ways that flat, context-free refusal testing cannot capture. (Groundy covered the same benchmark from the angle of why one-shot alignment tests miss multi-step failures.)

This is not a new observation in the alignment literature. The contribution here is that MoralityGym makes it testable.

Morality Chains: Ordered Constraints Instead of Flat Rules

The paper’s core formalism is what the authors call “Morality Chains.” The idea is straightforward: moral norms are not a flat list where “don’t harm” and “don’t deceive” sit side by side as equals. Real moral reasoning involves hierarchies. Preventing physical harm outranks avoiding a white lie, which outranks keeping a trivial promise. When norms conflict, the hierarchy determines which one yields.

MoralityGym encodes these priority relationships as ordered deontic constraints, meaning they are machine-readable. This is a deliberate engineering choice. A flat list of prohibitions is easy to eval against: did the model violate rule N, yes or no. A hierarchy requires the model to reason about which rule applies in context, and to maintain that reasoning consistently across a chain of decisions. Each norm in the paper is a tuple of a signature (the salient behaviour), a policy-adherence function, a numeric force, and a deontic modality (prescribed or prohibited). The force values impose a strict total order. [Updated June 2026: in the canonical PushOrSwitch scenario, the norm “avoid personal harm” carries force 2 and “minimise total harm” carries force 1, so an agent that pushes a bystander to save five fails the higher-priority norm even though it lowers the body count.]

The formalism draws on moral psychology and normative philosophy, and the implementation is pure Gymnasium. [Updated June 2026: each of the 98 scenarios is a grid-world with six discrete actions (move in four directions, stay, interact), where the agent navigates to a goal while trolleys, levers, and characters force moral tradeoffs.] Reward is sparse (-1 per step, +100 for reaching the goal, -100 if the agent itself is harmed)¹, and a separate per-step moral cost, derived from the violated norms in the active Morality Chain, is returned in the environment’s info dictionary. The agent can solve the puzzle and still fail the moral test, or pass the moral test at the cost of task reward. That separation is what makes the benchmark useful: it measures moral reasoning independently of competence.

[Updated June 2026: the full paper resolves the metric question the abstract left open.] The Morality Metric is a cumulative weighted average of the per-norm adherence functions, normalised to [0, 1]. The weights are set recursively against a small constant β (the paper uses 0.01) so that adherence to a higher-priority norm dominates any combination of lower ones. In the PushOrSwitch chain, that yields a weight of 200 on the top norm against 1 on the next¹, which makes the ordering lexicographic in practice: you cannot buy back a high-priority violation with any number of low-priority wins. The design is interpretable and reproducible, which were the two properties the abstract-only reading could not confirm.

What the Baseline Results Actually Show

[Updated June 2026: the published Table 1 replaces the abstract’s vague “key limitations.”] The authors evaluated five learners across the scenario set: a Random policy, vanilla PPO trained on environment reward only, PPO with expert reward-cost shaping (PPO Shaped), a Lagrangian Safe RL variant (PPO-Lag), and Constrained Policy Optimization (CPO). The headline number is the average normalised Morality Metric per chain and scenario.

The spread is not subtle. In the PushOrSwitchSelfSacrifice scenario under the Dual-Process Agent Harm chain, PPO Shaped scores 0.996 while vanilla PPO collapses to 0.192. On the purely utilitarian chain, vanilla PPO scores close to zero on minimising harm to humans, the chain’s top-priority norm, while PPO Shaped and CPO both land near the ceiling. The pattern holds across chains: CPO and PPO Shaped reliably satisfy the highest-priority norm, often at the expense of lower ones, while vanilla PPO and PPO-Lag spread their effort more evenly and satisfy the hierarchy less consistently.

So the “degradation” is not mysterious, and it is not really about an agent forgetting its training over a long horizon. It is mechanical. Vanilla PPO optimises task reward and nothing else, so it has no incentive to respect a moral norm that costs task reward. The fix in the paper is equally mechanical: add the moral cost to the objective (PPO Shaped) or bound it as a constraint (CPO). That is a useful and slightly deflating result. The agents that fail were never trained to care about the norms in the first place, and the ones that succeed needed expert-tuned reward shaping to get there.

A second-order reading: hierarchical moral alignment in these environments behaves like any other multi-objective RL problem. The interesting open question is not whether a cost signal helps (it does) but whether the lexicographic weighting generalises to scenarios the reward-shaper never saw. The paper’s own data shows PPO Shaped winning where shaping was tuned; out-of-distribution robustness is not what Table 1 measures.

Work on long-horizon agents offers a complementary failure mode for the deployment case. APEX documents “exploration collapse” in self-evolving agents operating over extended decision sequences: behaviour concentrates around familiar high-reward routines, narrowing the effective strategy space. That is a distinct mechanism from MoralityGym’s missing cost signal, but it points the same direction. An agent that stops exploring tradeoffs will anchor on whatever moral policy scored well early.

Why This Matters for RLHF Pipelines

Most production alignment today runs through RLHF or a variant: DPO, constitutional AI, some form of preference optimization against human or synthetic feedback. The training signal in these systems is per-response. The model learns which outputs humans prefer and adjusts its policy accordingly. There is no mechanism in standard RLHF for enforcing consistency across a sequence of moral decisions, because the training episodes are not sequential.

MoralityGym’s Morality Chains formalism suggests a different training signal: one that penalizes not just individual norm violations, but violations of the priority structure across time. An agent that correctly ranks “prevent harm” above “keep promises” in step one but reverses that ranking in step five would lose points even if each individual decision is defensible in isolation.

For teams building RLHF pipelines, the implication is that adding sequential moral evals to the training loop is not a marginal improvement but a structural change. It requires episodic training, memory of prior decisions within the episode, and a reward function that encodes norm hierarchies rather than flat preferences. None of the standard RLHF tooling does this out of the box.

[Updated June 2026: the published baselines sharpen this point.] PPO Shaped is the RL analogue of a reward model that scores the whole trajectory against a moral cost, and it is the only baseline that gets near-ceiling on the top norms. Standard RLHF has no such cost term; it scores responses, not sequences, against a preference model. The lesson is not that RLHF is hopeless but that the moral signal has to be engineered in deliberately, the same way MoralityGym’s authors had to hand-tune the reward shaper to make PPO behave. The harder problem RLHF already struggles with is that one preference signal cannot represent every value system at once, a limit Groundy has covered in the context of a single RLHF pass failing to align a model to every online community. Hierarchical moral norms add a second axis to that problem: not just whose values, but in what order they yield.

The Regulatory Angle: Measurable Criteria Where None Exist

Regulators trying to establish “AI safety” requirements face a measurement problem. Refusal rates are easy to count but tell you almost nothing about how a system behaves under compound pressure. The EU AI Act’s high-risk classification system, the US’s various executive-order frameworks, and the emerging patchwork of national AI governance schemes all reference safety and alignment without defining a test for either that survives contact with a competent auditor.

MoralityGym is not a regulatory tool. It is an academic benchmark with 98 trolley-dilemma scenarios and a now-published scoring metric. But it provides something regulators have lacked: a public, reproducible scaffold that measures moral alignment as a multi-step property rather than a single-turn event. The distinction between “this model reliably refuses harmful requests” and “this model maintains ordered moral priorities across a 10-step decision chain” is exactly the kind of thing a compliance framework could hang a requirement on.

Whether it should is a separate question. Trolley-dilemma environments are abstract by construction. They test whether an agent can reason about moral hierarchies in a controlled setting, not whether that reasoning transfers to, say, a healthcare triage agent making real patient prioritization decisions. The gap between benchmark and deployment is where regulatory enthusiasm usually outpaces the evidence.

What the Full Paper Settled, and What It Didn’t

[Updated June 2026: the published version closes most of the gaps the abstract-only reading flagged.] The baselines are named (Random, PPO, PPO Shaped, PPO-Lag, CPO), the Morality Metric formula is in the paper, and Table 1 gives per-scenario scores. Three open questions survive contact with the full text, and they are the ones that matter for anyone reading this as an AI-safety result rather than an RL result:

No LLM agents were tested. This is the correction that most changes the framing. The benchmark evaluates classical RL algorithms in grid-worlds. The title is about “sequential decision-making agents,” not LLM agents, and the language models vendors actually ship were never run against MoralityGym. The leap from “PPO needs a moral cost signal” to “GPT-class agents degrade on moral hierarchies” is the reader’s to make, and the paper does not make it.
Reward shaping does the heavy lifting, and it was hand-tuned. PPO Shaped wins because an expert built it a cost function for these scenarios. That is a strong existence proof and a weak generalisation claim. Nothing in Table 1 tests whether a shaper tuned on PushStandard transfers to a scenario it never saw.
The strict total order cannot represent tragic dilemmas. The paper says so in its own limitations: Morality Chains assume one norm always outranks another, which excludes cases where two duties genuinely tie. Real moral conflict often has no dominating option, and that is precisely where alignment is hardest.

The paper was accepted at AAMAS 2026 (Paphos, Cyprus, May 25-29) and last revised May 21. Its limitations section also concedes that the framework abstracts away emotion, moral development, and social context, the very features its cited dual-process psychology says drive human judgment. The honest summary: MoralityGym is a clean, reproducible RL benchmark for hierarchical norms, with a genuinely useful decoupling of task reward from moral cost. It is not yet evidence about the systems most readers mean when they say “AI agent,” and its authors do not claim otherwise.

Frequently Asked Questions

Were LLM-based agents tested separately from traditional RL agents in MoralityGym?

No. [Updated June 2026: the full paper confirms the benchmark evaluates only classical RL algorithms (PPO, PPO Shaped, PPO-Lag, CPO, and a Random policy) in grid-world environments. No LLM-based agents were run.] The earlier version of this article speculated about a separate LLM publication under DOI 10.65109/SAKL6648; that DOI is in fact this paper’s own AAMAS proceedings identifier, not a second study. The “sequential decision-making agents” in the title are RL agents. Treating MoralityGym as a verdict on shipped language models is a category error.

How does the Morality Chains formalism differ from existing deontic logic approaches in AI safety?

Standard deontic logic treats norms as binary (violated or satisfied) with no ordering between them. Morality Chains encodes priority rankings between conflicting norms via a numeric force per norm, so the agent must reason about which norm yields in context rather than checking a flat list. Most existing moral benchmarks (TruthfulQA, BBQ, the ETHICS suite) evaluate single-turn responses against flat rule sets. MoralityGym is a Gymnasium-based suite that tests hierarchical norm reasoning across multi-step decision chains. A separate line of work enforces deontic obligations on agents at runtime rather than at training time; Groundy looked at whether deontic policy rules can govern an agent at runtime, which is the deployment-side complement to MoralityGym’s training-side evaluation.

What would go wrong if a team naively added sequential moral evals to an RLHF pipeline?

The APEX work on long-horizon agents (arXiv 2605.21240) documents exploration collapse: behavior concentrates around familiar high-reward routines in extended decision sequences. Naively adding episodic moral training without counteracting this concentration could make moral degradation worse, because the agent would anchor on whichever moral policy scored well early and stop exploring alternative norm tradeoffs. Effective integration would require a reward function that explicitly penalizes strategy-space narrowing, a component no standard RLHF framework provides.

Which Safe RL methods were tested, and does the degradation finding generalize?

[Updated June 2026: the published paper names them.] The learners are a Random policy, vanilla PPO (environment reward only), PPO with expert reward-cost shaping, PPO-Lagrangian, and Constrained Policy Optimization. All are on-policy. The split is clean: methods with an explicit moral cost in the objective or as a constraint (PPO Shaped, CPO) satisfy the top-priority norms, while vanilla PPO, which sees only task reward, scores near zero on those norms. So the result is not “Safe RL fails.” It is “RL with no moral signal fails, and the shaped or constrained variants need expert tuning to succeed.” Whether that tuning generalises across scenarios is the part Table 1 does not test.

Could the Morality Metric be gamed the way other safety benchmarks have been?

[Updated June 2026: the formula is now public, so this can be answered more precisely.] The classic over-refusal exploit, where a model maxes a safety score by declining everything, is partly designed out: in MoralityGym the do-nothing action (STAY) still incurs moral cost when inaction violates an active norm, so blanket passivity does not score free. The sharper Goodhart surface is the metric’s own recursive weighting. Because a single high-priority norm can carry a weight of 200 against 1 for the next¹, an agent can max the metric by nailing the top norm and ignoring the entire tail, which is exactly what CPO and PPO Shaped tend to do in Table 1. That is a faithful encoding of lexicographic priority, not a bug, but it means a high Morality Metric certifies “respects the top norm,” not “reasons well across the whole hierarchy.” Reading the aggregate score without the per-norm breakdown would overstate what the agent learned.