Jailbreak Suffixes Hit Harder at Specific Token Positions, New GCG Variant Shows

Greedy Coordinate Gradient (GCG) jailbreaks have been treated as a content problem: find the right adversarial tokens, stick them at the end of the prompt, and optimize until the model complies. SlotGCG, accepted at ICLR 2026 and submitted June 4, demonstrates that where those tokens sit matters as much as what they say. The attack achieves a 14% higher success rate than standard GCG and a 42% higher rate against defended models, simply by searching for the most vulnerable position inside the prompt rather than defaulting to the suffix.

What SlotGCG changes about jailbreak assumptions

Standard GCG-based attacks append adversarial tokens to the end of a user prompt and optimize them via gradient-guided token substitution. The suffix placement is treated as a fixed constraint: the attacker chooses the content, and the position is assumed to be the tail of the input. SlotGCG relaxes that constraint. It introduces a set of candidate “slots” throughout the prompt and evaluates which insertion point yields the highest likelihood of a jailbreak before running optimization.

The core insight: different positions within the same prompt have wildly different vulnerability profiles. A suffix that barely works at the tail can become highly effective when inserted earlier, near the system prompt boundary or between instruction segments. This is not a property of the tokens themselves; it is a property of how the model attends to tokens at different positional offsets.

How Vulnerable Slot Score finds the weakest slot

SlotGCG’s contribution is a lightweight metric called the Vulnerable Slot Score (VSS). For each candidate insertion position, VSS estimates how much a perturbation at that location would push the model toward compliance. The attacker ranks slots by VSS, selects the highest-scoring ones, and then runs standard GCG optimization at those positions.

The preprocessing overhead is approximately 200ms per prompt, according to the paper. At inference-time throughput levels that cost is nontrivial but manageable for a determined attacker; for a defender scanning every position on every incoming request, it compounds quickly.

Critically, the slot-search mechanism is attack-agnostic. It can wrap any optimization-based adversarial method, not just GCG, which means the positional-vulnerability signal is reusable across attack families.

Why suffix-only defenses miss the strongest attacks

Most input-level defenses deployed today assume that adversarial content occupies a predictable region, typically the prompt suffix. Perplexity-based filters flag sequences with statistically unlikely token patterns, which is effective against standard GCG suffixes because the optimized tokens are, by construction, low-probability under normal language models. The filtering adds under 50ms latency and has been proposed as an opt-in feature in open-source guardrail tooling.

But perplexity filtering depends on the adversarial tokens forming a contiguous, detectable anomaly. When the same tokens are relocated to a position where the model is already more vulnerable, fewer adversarial tokens may be needed, or they may blend more naturally with surrounding context. SlotGCG exploits exactly this gap: the tokens that trigger a jailbreak at the optimal position may not trigger a perplexity threshold designed for suffix-shaped anomalies.

The defense assumption that “adversarial content lives at the tail” is now explicitly broken. Position-agnostic input sanitization and fixed-region perplexity checks are defending the wrong boundary.

Prior evidence that position matters

SlotGCG is not the first signal that token placement affects jailbreak success. Prior work catalogued in the Promptfoo vulnerability database showed that on Vicuna-7b, attack success rate jumps from 83% (suffix-only) to 99% when prefix configurations are allowed. Relocating an existing adversarial suffix to the beginning of the prompt also increases cross-model transferability: attacks optimized against Llama-2 transfer more effectively to Vicuna when placed at the prompt head.

Models confirmed vulnerable to positional attack variants include DeepSeek-7b-chat, Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Llama-2-7b-chat-hf, and Vicuna-7b-v1.5, all documented in the same Promptfoo entry. These are 7B-scale open-weight models; whether the positional vulnerability signal holds at 70B or proprietary-API scale remains untested as of June 2026.

What deployers should do now

The Promptfoo analysis recommends three concrete mitigations that align with what SlotGCG exposes:

Test adversarial tokens at all positions, not just suffixes. Red-team evaluations that only append adversarial strings to the prompt tail are incomplete. A position-sweep adds coverage for exactly the vulnerability SlotGCG exploits.
Monitor internal refusal directions, not just attention scores at later layers. Positional attacks likely disrupt refusal behavior at specific intermediate layers, not only at the output head. Observing the model’s internal activation directions during adversarial probing provides a more robust signal than attention-pattern heuristics.
Incorporate prefix-based and mid-prompt adversarial examples into safety fine-tuning data. If safety training only sees suffix-position attacks, the model remains vulnerable at other positions. Training-data diversity across insertion points is the structural fix.

For teams running perplexity filters in production, the practical implication is straightforward: a filter that only scans the last N tokens of a prompt is bypassable by an attacker who runs VSS, finds the optimal slot, and places adversarial tokens there instead. The defense must either scan the full input or accept that position-aware attacks will evade region-limited checks.

SlotGCG does not introduce a new class of attack; GCG remains the underlying optimization method. What it demonstrates is that the positional assumption baked into both attacks and defenses is a lever that matters. The 200ms cost of finding the weakest slot is a small price for an attacker and a non-trivial line-item for a defender who now has to cover more surface area with the same budget.

Frequently Asked Questions

Does SlotGCG work against closed-source models like GPT-4 or Claude?

The paper evaluates only open-weight models in the 7B parameter range. Computing VSS requires gradient access to the model’s internals, so an attacker with API-only access to a proprietary model cannot run the slot-ranking step at all. Whether closed-source models exhibit similar positional vulnerability profiles remains an open question.

How does VSS differ from brute-forcing GCG at every possible position?

Running full GCG optimization at each candidate slot would multiply attack cost by the number of slots. VSS is a gradient-based saliency proxy that ranks positions cheaply before any optimization begins. Only the top-ranked slots then receive expensive GCG runs. The reported 200ms overhead is the cost of this lightweight ranking pass, not a full optimization sweep.

What does full-input positional scanning cost a defender at production throughput?

A service handling 1,000 requests per second would need roughly 200 concurrent VSS-equivalent scans running continuously (1,000 rps times 200ms per scan). Teams currently relying on tail-only perplexity checks at under 50ms per request face a roughly 4x per-request defense latency increase if they extend coverage to all positions.

Can an attacker reuse the same vulnerable slot across different models?

Adversarial tokens placed at the prompt head transfer from Llama-2 to Vicuna, suggesting some positional correlation across architectures in the same weight class. However, VSS scores are model-specific because they depend on each model’s internal gradient directions. The highest-scoring slot on Llama-2-7b may not be the highest on Qwen2.5-7B, even though both are vulnerable to positional attacks in general.