groundy
ethics, policy & safety

Selective Geometry Attacks Bypass LLM Safety Alignment, New arXiv Paper Reports

Two papers show LLM safety alignment can be bypassed by embedding perturbations, a surface neither standard evaluations nor regulatory certifications inspect.

6 min · · · 4 sources ↓

Two recent papers expose the same uncomfortable structural problem in LLM safety alignment: the defenses operate on one mathematical surface, and the attacks operate on another. ShaPO, updated to v2 on May 21, proposes a geometry-aware fix for alignment fragility that its authors argue standard evaluations cannot detect. Separately, Search-based Embedding Poisoning (SEP) demonstrates a 96.43% attack success rate against safety-aligned models by perturbing embedding vectors rather than prompts. The two results converge on a shared implication: regulatory certifications that treat alignment as a measurable, auditable control are resting on an incomplete threat model.

What ShaPO Actually Does

ShaPO (Selective Geometry Control) is a defense proposal, not an attack. The paper, submitted by Yonghui Yang on February 7, 2026 and revised May 21, identifies what it calls optimization-induced fragility in standard preference-based alignment methods like RLHF. The argument: RLHF trains a reward model to predict preferred outputs, then optimizes the LLM to satisfy it. This shapes surface-level behavior without guaranteeing robustness to perturbations in the underlying parameter geometry. When distribution shifts occur, the alignment degrades in ways that prompt-level testing will not catch.

ShaPO addresses this by restricting its optimization to alignment-critical parameter subspaces rather than applying uniform constraints across the entire model. It operates at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, and reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Both outperform standard preference optimization methods on safety benchmarks, according to the paper’s own evaluations.

The key claim is that data-centric methods alone (more training examples, better filtering) cannot fix what is fundamentally an optimization geometry problem. The alignment objective shapes a manifold in parameter space, and standard training does not enforce robustness across that manifold. ShaPO attempts to.

The Embedding Attack Surface

Where ShaPO diagnoses a weakness and proposes a fix, SEP exploits a related one. The attack introduces carefully chosen perturbations into embedding vectors associated with high-risk tokens. The result: a 96.43% average bypass rate against safety alignment, while preserving benign functionality and evading conventional detection.

The mechanism SEP exploits is what its authors call “Semantic Shift.” Small perturbations to embedding vectors move the model’s internal representation of a token enough to bypass safety-trained refusal behavior, but not enough to visibly corrupt normal outputs. The attack targets models distributed through platforms like Hugging Face, where basic security scanning does not inspect embedding-layer manipulations.

This is distinct from prompt-level jailbreaks or adversarial suffixes, which have dominated alignment-robustness research. Those attacks operate on input tokens. SEP operates on the embedding manifold, the same mathematical surface that alignment training shapes. The defenses and the attacks are fighting over the same territory, but most evaluation frameworks only probe the input layer.

Why Standard Safety Evaluations Miss This

Standard alignment evaluations test whether a model refuses harmful prompts. They probe input-space robustness: can you craft a prompt that gets the model to say something it shouldn’t? This is a necessary but insufficient test. It does not ask whether the model’s internal representations have been perturbed such that safety behavior degrades under distribution shift, or whether embedding-layer modifications have silently moved the model’s decision boundaries.

The ShaPO paper makes this case explicitly: robustness failures in LLM safety alignment cannot be addressed by data-centric methods alone because the fragility lives in the optimization landscape, not in the training data. Evaluations that only vary prompts are probing a different manifold than the one where the vulnerability exists.

SafeMed-R1: Domain-Specific Alignment Shows Thin Margins

A separate result underscores how shallow the margins remain even when alignment is domain-tailored. SafeMed-R1, a medical LLM aligned through clinician-audited supervision and red-team stress testing, reduces unsafe outputs by only 3-5% relative to baseline under adversarial testing. This is in a domain where alignment got focused medical attention, curated supervision, and dedicated red-teaming. The improvement is real but thin.

The SafeMed-R1 result is measured in a medical context and may not generalize to other domains. As a data point, though, it is consistent with the broader picture: alignment as currently practiced produces measurable but narrow gains against adversarial pressure, and those gains evaporate when the attack surface shifts away from what the alignment was trained to defend.

What This Means for EU AI Act Compliance

The EU AI Act’s phased enforcement timeline is underway, and comparable regulatory frameworks share a common assumption: that alignment is something a provider can attest to, a deployer can verify, and a regulator can audit.

These regulatory approaches presume that alignment is a deployable, certifiable control. If architecture-aware attacks like SEP can defeat alignment by operating on a surface that standard evaluations do not inspect, the regulatory baseline collapses to provider self-attestation. Deployers relying on those attestations have no visibility into the model internals being attacked.

No regulator has formally responded to either paper. The regulatory claims are interpretive. But the structural argument is straightforward: certifications that test input-space robustness without probing embedding-level integrity are certifying a subset of the attack surface.

What Deployers Can Actually Do

The research suggests two practical shifts for teams deploying safety-aligned models.

First, embedding-level integrity checks. If models are sourced from public repositories like Hugging Face, SEP demonstrates that the embedding layer is an attack surface. Checksums on model weights are necessary but may not catch targeted perturbations that preserve overall weight statistics while shifting specific embeddings. This is a hard problem without a clean solution in current tooling.

Second, red-teaming that includes distribution-shift probes, not just prompt-level jailbreaks. Standard adversarial evaluation suites do not test for the class of vulnerability ShaPO describes. Building internal evaluations that vary the optimization conditions under which the model is tested, rather than just varying the prompts, would close part of the gap. The research community has not yet standardized these evaluations, so deployers building them are working from first principles.

Both steps shift cost and complexity back onto deployers, who lack the model-internal visibility that providers have. That asymmetry is the structural problem the papers expose, and it is not one that current regulatory frameworks address.

Frequently Asked Questions

Do these attacks work against closed proprietary models, or only open-weight ones?

SEP requires write access to model weights, so it applies to open-weight distributions on hubs like Hugging Face, not to closed APIs. The ShaPO diagnosis of optimization-induced fragility under distribution shift, however, applies to any aligned model regardless of access level. The regulatory gap spans both: providers of closed models cannot certify robustness on surfaces they do not probe either.

How does embedding poisoning differ from adversarial suffix attacks in detectability?

Adversarial suffix attacks append tokens to prompts and are visible in input logs. SEP modifies embedding vectors before deployment, so the compromise persists across all interactions without any special prompt. A model with poisoned embeddings produces unsafe outputs from ordinary requests, leaving no trace in the input-level monitoring that teams typically rely on for safety auditing.

What should readers weigh given that neither paper has peer review?

arXiv announced in November 2025 that it will no longer accept unvetted CS review articles due to a rise in AI-generated submissions, and it separates from Cornell University to become an independent nonprofit on July 1, 2026. The SEP result, hosted on a GitHub Pages site with no institutional affiliation, falls outside even arXiv’s lightweight screening. The structural argument about the alignment gap is independently testable regardless.

What tooling exists today for verifying embedding-layer integrity?

Model-hub security scanning checks for malware signatures and aggregate weight anomalies but does not parse individual embedding vectors for semantic perturbations. Verification would require comparing layer-level activations against a trusted reference distribution under controlled inputs. No standardized framework or tool exists for this, so teams sourcing open-weight models must either build a custom verification pipeline or accept an unmonitored attack surface.

sources · 4 cited

  1. Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control primary accessed 2026-05-28
  2. Search-based Embedding Poisoning (SEP) primary accessed 2026-05-28
  3. Large language model - Wikipedia community accessed 2026-05-28
  4. SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models primary accessed 2026-05-28