The defense model for most deployed LLM safety systems still assumes the attack is a recognizable string. A preprint posted to arXiv on June 25, 2026 by Prarabdh Shukla demonstrates why that assumption is eroding: by casting jailbreak selection as a multi-armed bandit problem, the paper shows that automated loops can converge on high-performing attack variants without a human red-teamer selecting them. The practical consequence is that the gap between a script-runner and a capable adversary narrows significantly.
What the paper actually does
The paper frames the jailbreak selection problem this way: given a pool of candidate jailbreaks and a target model, instead of exhaustively evaluating all n jailbreaks across T queries (O(nT) query complexity), a bandit algorithm runs a short exploration phase across a small sample of queries, learns which jailbreaks perform best, then exploits that knowledge on the remaining queries. The reduction to O(T) complexity matters because the attacker pays a fixed exploration cost regardless of how large the jailbreak library is.
The paper tests five algorithms from the bandit family: EXP3, Thompson Sampling, LinUCB, LinearCB, and SquareCB. The evaluation runs against 15 open-weight LLMs.
The benchmark itself is worth noting. FrankensteinBench, assembled for this paper, contains 11,279 malicious queries drawn from seven existing benchmarks, with automated query enhancement applied on top. The query-complexity classifier the authors built to sort enhanced from unenhanced queries reached 89.2% accuracy on held-out validation, per the preprint. That classifier is itself an automatable component, which matters for the skill-floor argument.
What the numbers actually say (and what they don’t)
The headline figure is 97% attack success rate, averaged across 15 open-weight models under the Transfer Attack scenario. Single-shot success rates against individual models are lower and vary by algorithm and target.
The contrast with static jailbreak selection is direct. A blocklist catches known strings; an adaptive bandit selects whichever variant the static list doesn’t block. The paper demonstrates the approach outperforms naive uniform sampling and brute-force strategies.
The query enhancement finding adds a separate attack surface: adding complexity to queries raises ASR by up to 26 percentage points on average across models. That lift is automatable through the same complexity classifier, meaning an attacker who skips jailbreak selection entirely and focuses on query reformulation still gets a measurable gain.
Why the skill floor drops
The key design decision in the bandit approach is that the exploration phase does not require knowing which jailbreak will work in advance. The algorithm treats each candidate jailbreak as an “arm,” observes a noisy reward signal (did the model comply?), then updates its selection probabilities accordingly. An attacker who cannot read the target model’s weights and has no red-team intuition still converges on a working attack, given a jailbreak library and a few dozen queries.
The paper formalizes two attack scenarios to capture different operational contexts. In a Transfer Attack, the bandit policy is learned offline on an exploration set, then applied unchanged to the actual target queries. In a Continual Attack, the policy updates throughout exploitation as well, interleaving exploration and exploitation on the same target. The domain- and model-transfer ablations show minimal ASR loss when the exploration phase runs on a non-target domain or a proxy model. An attacker without direct access to the target model can learn on a cheaper proxy, then transfer the policy.
The skill-floor implication is direct: the bandit loop performs the per-model, per-query tuning that previously required a red-teamer to read outputs, recognize which jailbreak category the model was responsive to, and iterate manually. That iteration is now algorithmic.
Why static blocklists fail against this threat model
The standard jailbreak blocklist defense enumerates known attack strings or templates and rejects prompts that match. This works when the attack is a fixed template. The bandit framing breaks that assumption by operating over a variable jailbreak pool, selecting different attacks for different queries, and learning which variants the current target is not blocking.
The bandit is not smarter than the jailbreaks it has available; it finds the best one in the pool, and the best one is rarely the one a blocklist was written for. The paper’s evaluation confirms the bandit consistently outperforms naive uniform sampling from the same jailbreak pool.
Query enhancement compounds this. If the attack surface is the semantic content of the query rather than its template structure, template matching catches nothing. A complexity-enhanced query that paraphrases a harmful request into a multi-part scenario with plausible framing shares no structural signature with the original malicious prompt.
What should defenders move to?
Output-side monitoring is the implied direction: evaluate what the model says rather than what the user asked. A response that provides synthesis instructions for a controlled substance is harmful regardless of how the prompt was structured to elicit it. Input-side filtering and output-side monitoring address different threat models, and the bandit attack lives almost entirely on the input side.
Per-response monitoring has real costs, it requires a reliable harm classifier operating on outputs, which is a different engineering problem from prompt filtering, and it adds latency. But a defense that can be defeated by rotating through a jailbreak pool is not a defense; it is a list.
The Transfer Attack and Continual Attack framing also suggests that rate-limiting and anomaly detection on query patterns could add friction. An exploration phase generates a burst of queries across diverse jailbreak templates; that distribution looks different from normal usage. Whether this signal is detectable at production scale is not addressed in the paper, but it is a direction that blocklist-based defenses cannot support at all.
Where this fits in the prior literature
Bandit-based jailbreaking predates this paper. h4rm3l (arXiv:2408.04811), submitted in August 2024, combined bandit-based few-shot program synthesis with a composable attack DSL; it generated 2,656 novel jailbreaks with ASR exceeding 90% on claude-3-haiku and GPT-4o. The current paper’s contribution is the selection-plus-enhancement pipeline and FrankensteinBench as a consolidated evaluation framework, not automated jailbreaking as a concept.
RAILS (arXiv:2601.03420), published in January 2026, demonstrated gradient-free and prior-free jailbreaks with transfer properties, filling the niche of attacks that work without model internals access. Jailbreak Mimicry (arXiv:2510.22085) automated discovery of narrative-based attack frames. The current paper’s contribution sits in the selection layer above these: given any jailbreak library, learn which item to apply per query.
The cumulative picture is a layered attack stack. Prior work solved the generation problem (create novel jailbreaks) and the transfer problem (attacks that work without target internals). This paper addresses the selection problem (given a menu, pick the best item per query). A mature automated attacker now has all three layers available off-the-shelf.
What the paper doesn’t cover
The paper’s stated limitations are hard boundaries on inference. Proprietary models were not evaluated due to access and cost constraints; results for GPT-4o, Claude, and Gemini are entirely unverified. The attacks are single-turn and English-only, which excludes multi-turn context manipulation and non-English populations.
The benchmark’s composition limits generalization. FrankensteinBench queries were assembled from existing safety benchmarks and further enhanced specifically to elicit harmful outputs; those harm rates don’t transfer to real-world traffic distributions where malicious queries are rare.
The paper also does not address countermeasures in any detail. Whether the query burst pattern from an exploration phase is detectable, what throughput the bandit approach requires to converge, and what happens when the target model is updated mid-attack are open questions. The research establishes that automated selection is viable; the defender’s engineering response remains an open problem.
Frequently Asked Questions
Does the bandit approach remain effective against safety-aligned models, or does stronger alignment narrow the gap?
The gap widens rather than narrows. On safety-aligned models specifically, Thompson Sampling reached up to 49% ASR while the best static jailbreak in the same pool yielded only 6%. Alignment suppresses fixed-template attacks more than it suppresses adaptive selection, because the bandit concentrates queries on the rare variants that still get through.
What happens to attack success rates if the attacker’s jailbreak library is mostly weak?
Bandit algorithms held above 50% ASR even when restricted to a weak jailbreak pool; uniform sampling from the same pool dropped to 20%. The bandit’s relative advantage is largest precisely when pool quality is uneven, because it routes queries away from the arms that consistently fail and toward the few that do not.
What is the actual scale of the evaluation behind these results?
The evaluation covers 70 distinct jailbreaks spanning six high-stakes domains, models ranging from 270 million to 120 billion parameters, and approximately 12 million query-response pairs. That breadth is why the 97% figure is reported as a broad average rather than a result tuned against a single cooperative target.
Which bandit variant underperformed, and what does that reveal about the method’s limits?
Contextual bandit algorithms, which incorporate per-query feature information rather than treating all queries identically, underperformed the simpler stateless variants. The result suggests the compliance reward signal is too noisy for per-query context to reduce variance below what a stateless algorithm already achieves, which sets a ceiling on how much per-query personalization helps in this setting.
What does FrankensteinBench’s baseline harm rate mean for real-world risk estimates?
Without any jailbreak applied, FrankensteinBench shows a 44% baseline ASR, reflecting deliberate curation from existing safety benchmarks rather than natural traffic. In production, malicious queries arrive at far lower base rates, so these ASR figures describe attack potential against a concentrated harmful-query set and do not translate to expected harm rates in deployed services.