Selective Geometry Attacks Bypass LLM Safety Alignment, New arXiv Paper Reports

Two recent papers expose the same uncomfortable structural problem in LLM safety alignment: the defenses operate on one mathematical surface, and the attacks operate on another. ShaPO, updated to v2 on May 21, proposes a geometry-aware fix for alignment fragility that its authors argue standard evaluations cannot detect. Separately, Search-based Embedding Poisoning (SEP) demonstrates a 96.43% average attack success rate across six safety-aligned models by perturbing embedding vectors rather than prompts. The two results converge on a shared implication: regulatory certifications that treat alignment as a measurable, auditable control are resting on an incomplete threat model.

[Updated June 2026] SEP is no longer the unaffiliated project page this article first described. The work is now on arXiv as 2509.06338 (Yuan et al., submitted September 2025) and is under review at ICLR 2026, with authors at the University of Queensland and Huazhong University of Science and Technology. It is still a preprint, not a peer-reviewed result, but it has cleared arXiv moderation and entered formal review.

What ShaPO Actually Does

ShaPO (Selective Geometry Control) is a defense proposal, not an attack. The paper, submitted by Yonghui Yang on February 7, 2026 and revised May 21, identifies what it calls optimization-induced fragility in standard preference-based alignment methods like RLHF. The argument: RLHF trains a reward model to predict preferred outputs, then optimizes the LLM to satisfy it. This shapes surface-level behavior without guaranteeing robustness to perturbations in the underlying parameter geometry. When distribution shifts occur, the alignment degrades in ways that prompt-level testing will not catch.

ShaPO addresses this by restricting its optimization to alignment-critical parameter subspaces rather than applying uniform constraints across the entire model. It operates at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, and reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Both outperform standard preference optimization methods on safety benchmarks, according to the paper’s own evaluations.

The key claim is that data-centric methods alone (more training examples, better filtering) cannot fix what is fundamentally an optimization geometry problem. The alignment objective shapes a manifold in parameter space, and standard training does not enforce robustness across that manifold. ShaPO attempts to.

The Embedding Attack Surface

Where ShaPO diagnoses a weakness and proposes a fix, SEP exploits a related one. The attack introduces carefully chosen perturbations into embedding vectors associated with high-risk tokens. The result: a 96.43% average bypass rate against safety alignment, while preserving benign functionality and evading conventional detection.

The mechanism SEP exploits is what its authors call embedding semantic shift. Small perturbations to embedding-layer outputs move the model’s internal representation of a token enough to bypass safety-trained refusal behavior, but not enough to visibly corrupt normal outputs. [Updated June 2026] The paper is precise about the threat model, and the precision matters: SEP does not modify model weights and does not modify the input text. It injects perturbations into the embedding layer’s activations at inference, which is the access an adversary gains by controlling an open-source deployment rather than by tampering with a distributed checkpoint. Models served from public hubs are exposed because the runtime around them, not the weight file, is where the perturbation lands, and basic hub security scanning inspects neither.

This is distinct from prompt-level jailbreaks or adversarial suffixes, which have dominated alignment-robustness research. Those attacks operate on input tokens. SEP operates on the embedding manifold, the same mathematical surface that alignment training shapes. The defenses and the attacks are fighting over the same territory, but most evaluation frameworks only probe the input layer.

Why Standard Safety Evaluations Miss This

Standard alignment evaluations test whether a model refuses harmful prompts. They probe input-space robustness: can you craft a prompt that gets the model to say something it shouldn’t? This is a necessary but insufficient test. It does not ask whether the model’s internal representations have been perturbed such that safety behavior degrades under distribution shift, or whether embedding-layer modifications have silently moved the model’s decision boundaries.

The ShaPO paper makes this case explicitly: robustness failures in LLM safety alignment cannot be addressed by data-centric methods alone because the fragility lives in the optimization landscape, not in the training data. Evaluations that only vary prompts are probing a different manifold than the one where the vulnerability exists.

A Convergent Diagnosis From Three Directions

ShaPO is not alone in locating alignment fragility in the optimizer rather than the dataset, and that is the part worth dwelling on. Two other recent preprints arrive at the same place from different starting points. Independent groups, optimizing for different goals, keep landing on parameter and representation geometry as the surface that breaks.

Aligned but Fragile (arXiv 2605.29396, May 2026) shows that a first-order safety-aligned model can shed its refusal behavior under perturbations as mundane as parameter noise or post-training quantization. Its proposed fix is a hybrid: standard first-order alignment followed by zeroth-order refinement applied only to the layers it identifies as robustness-critical, rather than retraining the whole network. The practical sting is in the failure case, not the fix. A model that passes its safety evals at full precision can lose those guarantees once a deployer runs the routine 4-bit quantization step that most edge and on-device deployments depend on. The alignment was real. It just did not survive a transformation nobody thought of as a safety event.

A year-older line of work, ALKALI and its GRACE method (arXiv 2506.08885), names the geometric failure directly. The authors call it latent camouflage: an unsafe completion that mimics the internal geometric structure of a safe one and slides past preference-based defenses like DPO. GRACE (Geometric Representation-Aware Contrastive Enhancement) couples preference learning with a latent-space regularizer that pushes safe and adversarial embeddings apart while pulling unsafe behaviors together, reporting up to a 39% reduction in attack success across 21 open and closed models without retraining the base weights. Its companion metric, the Adversarial Vulnerability Quality Index, scores how cleanly a model’s latent space separates safe from unsafe clusters. That is a diagnostic an input-space refusal test cannot produce, because it reads the representation rather than the response.

Read together, ShaPO, Aligned but Fragile, and GRACE describe one failure mode in three vocabularies: optimization geometry, optimizer fragility, and latent-representation separation. None of the three has cleared peer review, and their headline numbers are not comparable to each other (different model sets, threat models, and judge configurations). But agreement across independent setups is itself a signal, and the attack literature is converging on the same coordinates from the offensive side. SEP perturbs embedding-layer outputs; separate work on activation steering repurposes those same internal vectors as a control channel. Defenders and attackers are now fighting over representation geometry, while the standard evaluation harness is still standing at the input layer, varying prompts.

SafeMed-R1: Domain-Specific Alignment Shows Thin Margins

A separate result underscores how shallow the margins remain even when alignment is domain-tailored. SafeMed-R1, a medical LLM aligned through clinician-audited supervision and red-team stress testing, reduces unsafe outputs by only 3-5% relative to baseline under adversarial testing. This is in a domain where alignment got focused medical attention, curated supervision, and dedicated red-teaming. The improvement is real but thin.

The SafeMed-R1 result is measured in a medical context and may not generalize to other domains. As a data point, though, it is consistent with the broader picture: alignment as currently practiced produces measurable but narrow gains against adversarial pressure, and those gains evaporate when the attack surface shifts away from what the alignment was trained to defend.

What This Means for EU AI Act Compliance

The EU AI Act’s phased enforcement timeline is underway, and comparable regulatory frameworks share a common assumption: that alignment is something a provider can attest to, a deployer can verify, and a regulator can audit.

These regulatory approaches presume that alignment is a deployable, certifiable control. If architecture-aware attacks like SEP can defeat alignment by operating on a surface that standard evaluations do not inspect, the regulatory baseline collapses to provider self-attestation. Deployers relying on those attestations have no visibility into the model internals being attacked.

No regulator has formally responded to either paper. The regulatory claims are interpretive. But the structural argument is straightforward: certifications that test input-space robustness without probing embedding-level integrity are certifying a subset of the attack surface. The same gap shows up when safety lives at inference rather than in the weights: a control that a provider can attest to at training time says little about what the deployed runtime actually enforces.

What Deployers Can Actually Do

The research suggests two practical shifts for teams deploying safety-aligned models.

First, embedding-level integrity checks. If models are sourced from public repositories like Hugging Face, SEP demonstrates that the embedding layer is an attack surface. Checksums on model weights are necessary but may not catch targeted perturbations that preserve overall weight statistics while shifting specific embeddings. This is a hard problem without a clean solution in current tooling.

Second, red-teaming that includes distribution-shift probes, not just prompt-level jailbreaks. Standard adversarial evaluation suites do not test for the class of vulnerability ShaPO describes. Building internal evaluations that vary the optimization conditions under which the model is tested, rather than just varying the prompts, would close part of the gap. The research community has not yet standardized these evaluations, so deployers building them are working from first principles.

Third, re-run safety evaluations after every deployment-time transformation, not just at acceptance. Aligned but Fragile shows that quantization, parameter noise, and similar post-training steps can strip alignment a model demonstrably had at full precision. A safety eval that passes before 4-bit quantization is not evidence the quantized artifact is safe. Latent-space diagnostics in the style of AVQI, which score how well a model’s internal representations separate safe from unsafe behavior, are one way to catch this class of regression that a prompt-only suite will miss, though no productized tool ships such a check today.

Both steps shift cost and complexity back onto deployers, who lack the model-internal visibility that providers have. That asymmetry is the structural problem the papers expose, and it is not one that current regulatory frameworks address.

Frequently Asked Questions

Do these attacks work against closed proprietary models, or only open-weight ones?

[Updated June 2026] SEP does not modify model weights at all, a detail the original version of this answer got wrong. It perturbs the embedding layer’s activations at inference, which means the adversary needs control of the deployment runtime, not a tampered checkpoint. That threat model fits self-hosted open-source deployments rather than a closed API you do not operate. The ShaPO diagnosis of optimization-induced fragility under distribution shift, however, applies to any aligned model regardless of access level. The regulatory gap spans both: providers of closed models cannot certify robustness on surfaces they do not probe either.

How does embedding poisoning differ from adversarial suffix attacks in detectability?

Adversarial suffix attacks append tokens to prompts and are visible in input logs. SEP perturbs embedding-layer activations at inference inside a compromised deployment, so the manipulation persists across interactions without any special prompt. A model fed poisoned embeddings produces unsafe outputs from ordinary requests, leaving no trace in the input-level monitoring teams typically rely on for safety auditing. Catching it would require watching internal signals instead of outputs, the same shift behind work on reading per-layer entropy rather than the final response.

What should readers weigh given that neither paper has peer review?

[Updated June 2026] Treat both as preprints, but note that the bar has shifted. SEP is now on arXiv (2509.06338) and under review at ICLR 2026, with authors at the University of Queensland and Huazhong University of Science and Technology, so the earlier characterization of it as an unaffiliated project page outside arXiv’s screening no longer holds. ShaPO remains an arXiv preprint with public code. arXiv itself has tightened the screening that gets work this far: it stopped accepting unvetted CS review and survey articles in late 2025, began requiring first-time submitters to clear an endorsement check in January 2026, and becomes an independent nonprofit, separate from Cornell, on July 1, 2026. Clearing moderation is not peer review. The structural argument about the alignment gap is independently testable regardless of either paper’s review status.

What tooling exists today for verifying embedding-layer integrity?

Model-hub security scanning checks for malware signatures and aggregate weight anomalies but does not parse individual embedding vectors for semantic perturbations. Verification would require comparing layer-level activations against a trusted reference distribution under controlled inputs. No standardized framework or tool exists for this, so teams sourcing open-weight models must either build a custom verification pipeline or accept an unmonitored attack surface.