AI Agent Alignment Tests Are One-Shot. A New Benchmark Catches Multi-Step Failures

Most AI safety evaluations work like a driving test that only checks whether you stop at a single stop sign. MoralityGym, accepted at AAMAS 2026 this week in Paphos, is built on the observation that agents can pass every one-shot alignment check and still produce harmful behavior when you let them take more than one action in sequence. The benchmark packages 98 ethical-dilemma environments designed to catch a specific failure mode: each individual decision looks defensible, but the chain drifts toward a violation the agent would have refused if asked directly.

What MoralityGym Actually Tests

MoralityGym frames moral alignment as a sequential-decision problem rather than a single-turn refusal task. Each of the 98 environments is a Gymnasium-compatible scenario built around trolley-dilemma-style tradeoffs, but with a structural twist: the benchmark decouples task-solving performance from moral evaluation. An agent can score well on the task objective while violating hierarchical moral constraints that govern the sequence of decisions, not just any single step.

The benchmark introduces a formalism called “Morality Chains,” which represent moral norms as ordered deontic constraints. The ordering matters. Higher-priority norms cannot be sacrificed for lower-priority ones, and the chain structure means that a sequence of individually permissible actions can collectively violate a high-priority constraint that would have blocked the first step if the agent had known where the trajectory was heading. MoralityGym also introduces a “Morality Metric” that integrates insights from psychology and philosophy into the evaluation of norm-sensitive reasoning, moving beyond whether the agent maximized a reward signal.

This is a synthetic benchmark. The trolley-dilemma framing is a deliberate simplification: the authors are testing whether alignment evaluation infrastructure can detect multi-step norm violations at all, not claiming that the specific dilemmas map directly onto production agent deployments. The applicability argument is structural (agentic systems make sequences of decisions where local optima diverge from global constraints) rather than domain-specific.

Why One-Shot Alignment Evals Miss the Drift Problem

Standard pre-deployment safety evaluations for LLM-based agents run red-team prompt batteries: present the model with harmful requests, check that it refuses, measure refusal rates across categories. This approach has an obvious blind spot. It evaluates whether the model refuses a directly harmful instruction in isolation. It does not evaluate whether a sequence of individually harmless instructions produces a harmful outcome.

The distinction matters because deployed agents do not operate in single-turn mode. An agent planning a multi-step workflow, executing tool calls, and reacting to intermediate results is making a sequence of decisions where each step conditions on the prior state. A PRISMA-compliant review of 78 studies on adversarial AI threats, also published this month, found that no single defense mechanism provides robustness across all layers of agentic AI systems, and that vulnerabilities propagate from perception through policy and actuation via feedback dynamics. The finding aligns with what MoralityGym demonstrates in controlled form: the failure mode is in the trajectory, not in any single prompt-response pair.

How Morality Chains Formalize Hierarchical Norms

The Morality Chains formalism is the core technical contribution. A Morality Chain is an ordered set of deontic constraints (obligations, prohibitions, permissions) with a defined priority ranking. When constraints conflict, the hierarchy determines which one governs. The chain is not a flat list of rules; it is a structure where the ordering encodes the relative weight of different moral norms.

This matters for evaluation because it gives the benchmark a principled way to distinguish between an agent that makes an acceptable tradeoff between two low-priority norms and an agent that sacrifices a high-priority norm to satisfy a lower one. Flat refusal-rate metrics cannot make this distinction. A model that refuses 99 out of 100 harmful prompts but accepts the one that violates the highest-priority constraint scores well on refusal rate but poorly on MoralityGym’s metric, which weights the hierarchical position of the violated norm.

The Morality Metric itself draws on moral psychology and normative ethics literature, integrating concerns about intention, foreknowledge, and proximity of causation that simple reward-penalty frameworks ignore. The authors argue that alignment evaluation needs this richer normative structure to capture the kinds of violations that concern real stakeholders, not just the kinds that are easy to count.

What Safe RL Baselines Got Wrong

MoralityGym’s baseline experiments tested standard Safe RL methods against the hierarchical scenarios. The result: current safety-trained agents exhibit key limitations on these environments. The phrasing in the abstract is restrained, but the implication is clear. Methods optimized to avoid single-step constraint violations do not generalize to multi-step hierarchical violations because they were never trained on trajectories where the violation emerges from the sequence rather than from any single action.

This is not a surprising result given the training objective. Safe RL methods typically learn a cost function that penalizes constraint violations at each step. If no single step violates the constraint, the cost function fires zero penalties, the agent receives no corrective signal, and the trajectory proceeds to completion despite the cumulative violation. The Morality Chain structure exposes this gap directly: the constraint that matters is defined over the full trajectory, and step-wise cost functions are structurally blind to it.

The AAMAS 2026 Cluster: PREFINE and the DPO Equivalence Problem

MoralityGym is not the only alignment paper landing at AAMAS this week. Two companion results sharpen the picture.

PREFINE, also accepted at AAMAS 2026, adapts Direct Preference Optimization to sequential decision-making by using trajectory-level preferences instead of single-turn response pairs. In experiments, PREFINE reduced constraint violations and catastrophic failures by over 60% while maintaining reward performance. The approach directly addresses the step-wise blindness problem: by optimizing over full trajectories rather than individual actions, PREFINE can learn to avoid the cumulative norm violations that MoralityGym exposes.

The third leg of the cluster is a theoretical result on DPO and RLHF equivalence. The paper shows that DPO does not guarantee equivalence with RLHF; the equivalence depends on an implicit assumption that the RLHF-optimal policy prefers human-preferred responses, an assumption the authors describe as “frequently violated in practice.” When the assumption fails, DPO can exhibit pathological convergence where training loss decreases while the model increasingly selects dispreferred responses. The finding is theoretical and there is no evidence yet that this specific failure mode has been observed in deployed production systems. But it raises a structural concern for teams that have shifted from RLHF to DPO-based alignment training: if the equivalence assumption does not hold, the safety guarantees transfer may not hold either.

What Alignment Teams Should Do Differently Starting Now

The practical implications are straightforward, if expensive.

First, pre-deployment evaluation pipelines for agents that take more than one action need longitudinal trajectory testing. Running a red-team prompt battery on the model before deployment tells you whether the model refuses direct harmful requests. It does not tell you whether a multi-step agent workflow drifts toward a violation across a sequence of individually permissible steps. MoralityGym provides the Gymnasium environments to start testing this now; PREFINE provides the training methodology to address what the testing finds.

Second, the DPO-equivalence result means that teams using DPO for alignment fine-tuning should verify that the implicit preference assumption holds for their specific domain and model configuration. If it does not, the training may be converging on a policy that scores well on the DPO loss while selecting responses that a human evaluator would disprefer. This is not a hypothetical concern. It is a known failure mode of the training procedure, and it requires explicit checking.

Third, the Frontiers survey finding that no single defense mechanism covers all layers of an agentic system suggests that alignment teams need to evaluate safety at the system level, not just the model level. A model that refuses harmful instructions can still be part of an agent that produces harmful outcomes when the perception, planning, and actuation layers interact. The MoralityGym formalism gives a way to structure that evaluation; the engineering work to apply it to production systems is still ahead.

The cost of all this is real. Trajectory-level evaluation is more expensive than prompt batteries. PREFINE-style training requires trajectory preference data, which is harder to collect than single-turn preference pairs. And verifying DPO assumptions adds a validation step that most teams do not currently run. But the alternative is shipping agents that pass every pre-deployment eval and then produce “unanticipated” behavior in production, a pattern that incident reports keep repeating and that MoralityGym is specifically designed to make visible before it happens.

Frequently Asked Questions

How does MoralityGym compare to Anthropic’s Responsible Scaling Policy or OpenAI’s preparedness framework?

Both of those frameworks evaluate alignment at the single-response level using RLHF-based red-team prompt batteries. MoralityGym targets a gap they don’t structurally cover: sequential decision-making with hierarchically ordered norms. The Morality Chains formalism catches trajectory-level drift that refusal-rate metrics cannot detect, regardless of how many individual prompts are in the battery.

At what point does an agent architecture warrant trajectory-level safety evaluation?

Any system that chains more than one tool call or planning step where intermediate state conditions subsequent decisions. Single-turn chatbots and pass-through tasks like summarization or content moderation don’t need it, the output is evaluated per-response. The PRISMA review of 78 adversarial-AI studies found that vulnerability risk scales with the number of feedback loops between perception, policy, and actuation layers, not just with step count.

What would production-grade MoralityGym-style evaluation look like?

Teams would need to build domain-specific Gymnasium environments with norm hierarchies relevant to their industry, for example, medical triage ordering constraints, financial advice sequencing rules, or autonomous vehicle priority chains. The synthetic trolley-dilemma environments prove the evaluation infrastructure works; translating it to production requires domain experts to define the deontic constraint orderings, which is the non-trivial and unbenchmarked part of the adoption path.

If DPO’s equivalence with RLHF can silently break, should teams revert to RLHF for safety-critical training?

Reversion isn’t the only option. PREFINE’s trajectory-level preference optimization sidesteps the DPO equivalence assumption entirely by optimizing over full trajectories rather than single-turn response pairs. For teams already on DPO, the minimum response is adding a domain-specific check that the preference-consistency assumption holds before treating DPO alignment as equivalent to RLHF alignment, something most current pipelines skip.