Why Audio Jailbreaks Slip Past the Safety Training Built for Text LLMs

Audio jailbreaks work because the safety training was built for text. The guardrails that refuse harmful prompts in a chat window were tuned on text corpora, evaluated on text benchmarks, and deployed on the assumption that whatever modality the user chooses, the underlying language model’s refusal behavior holds. A taxonomy paper on arXiv, submitted May 28, shows why that assumption breaks: audio-channel attacks exploit four distinct surfaces that text-side RLHF never trained against, and the benchmarks now quantifying the gap suggest vendors shipping voice interfaces carry an alignment debt they have not acknowledged.

The four attack surfaces text RLHF never saw

The taxonomy paper organizes audio jailbreak attacks into four categories: semantic, acoustic, signal-level, and embedding-layer. Each targets a different stage of the audio-to-token pipeline, and text-side RLHF touches none of them.

Semantic attacks manipulate the content of the spoken prompt, using narrative framing or roleplay to push the model past its refusal boundary. These are close cousins of text jailbreaks, but the audio channel introduces variability in delivery, accent, and pacing that text-trained guardrails do not account for.

Acoustic attacks exploit properties of the sound itself: pitch, cadence, stress patterns, background noise layered under speech. The model processes these as features, not noise, because its audio encoder was trained to extract meaning from acoustic variation.

Signal-level attacks operate on the raw waveform, injecting perturbations at sample-level precision that are inaudible or near-inaudible to human listeners but alter the encoder’s representation enough to shift the model’s output. AudioJailbreak (accepted by IEEE TDSC) demonstrated that crafted audio perturbations can jailbreak OpenAI’s GPT-4o-Audio and bypass Meta’s Llama-Guard-3, even in a weak-adversary scenario where the attacker cannot fully manipulate the user’s prompt. The same paper found that advanced text jailbreak attacks “cannot be easily ported to end-to-end LALMs via text-to-speech techniques,” meaning the attack surface is genuinely audio-native, not a text attack wearing a speech costume.

Embedding-layer attacks target the intermediate representations between the audio encoder and the language model backbone, manipulating the latent space directly.

The structural point: text-side RLHF and refusal training operate at the language-model layer. They never see the audio encoder, the signal processing, or the embedding space. The guardrails sit downstream of the attack surface.

What the benchmarks agree on: no model is safe everywhere

Five concurrent benchmarking efforts converge on the same conclusion from different angles.

AJailBench built a dataset of 1,495 adversarial audio prompts across 10 policy-violating categories and found that no single large audio model is robust across all safety dimensions. Models adopt varied strategies, from strict denial to permissiveness, each reflecting a different robustness-usability trade-off.

Jailbreak-AudioBench demonstrated that query-based audio editing exposes more powerful jailbreak threats than static audio prompts, with vulnerability varying significantly across model architectures.

JALMBench, the largest audio jailbreak benchmark to date, appeared at ICLR 2026 with 11,316 text samples and 245,355 audio samples spanning more than 1,000 hours, across 12 LALMs, 8 attack methods, and 5 defenses. Its finding is the most measured: text-based safety alignment “can partially transfer to audio inputs,” and interleaved audio-text strategies enable more robust cross-modal generalization. But general-purpose moderation methods “only slightly improve security.” The transfer is partial, not total.

Why Acoustic Best-of-N and Narrative Framing stand out

The taxonomy paper identifies two attack classes with particularly unfavorable cost-to-success ratios for defenders.

Acoustic Best-of-N attacks generate multiple audio variants of the same adversarial prompt and submit whichever one the model fails to refuse. The paper reports “strong worst-case audio-space vulnerabilities” across the ten evaluated LALMs. Generating variants is computationally cheap. The attacker needs only one success. The defender must refuse every variant, which makes the asymmetry structural.

Narrative Framing operates at the semantic layer, wrapping the harmful request in a story, dialogue, or hypothetical that bypasses the refusal trigger without any acoustic manipulation. It is a low-latency attack requiring no signal processing, no perturbation optimization, just carefully structured speech. The taxonomy flags it as effective precisely because it demands the least technical sophistication from the attacker.

AJailBench’s Audio Perturbation Toolkit represents the high-effort end: Bayesian optimization over perturbation configurations across time, frequency, and amplitude domains, constrained by a semantic consistency check to ensure the perturbed audio still carries the original jailbreak intent. This automates the search for effective perturbations in a way that scales.

The range from “narrative framing with a microphone” to “Bayesian-optimized signal perturbation” means defenders face a wide threat surface with no single mitigation.

The defense dilemma: robustness vs. benign usability

Every defense approach in the taxonomy trades one cost for another.

The taxonomy paper evaluates three defense families (guard-based, training-free, and training-based) across the ten open-source LALMs and finds that making the model safer also makes it refuse legitimate requests more often. This is the same robustness-usability tension that appears in text jailbreak defense, compounded in audio by the additional variability of spoken input: accents, background noise, emotional prosody, and overlapping speech all increase the chance that a legitimate prompt sounds “adversarial” to an over-sensitive guard.

JALMBench confirms the pattern. General-purpose moderation methods “only slightly improve security,” suggesting text-trained guards do not generalize well enough to the audio channel to justify their false-positive cost. The most promising direction in JALMBench’s findings is interleaved audio-text training, which improves cross-modal generalization, but this is a training-time intervention, not a deployable patch for models already in production.

What voice-interface vendors must now do per channel

The operational takeaway: a vendor that aligned a text model and then wrapped it in a voice interface has not aligned the voice interface. The alignment coverage is per-channel, and the speech channel has attack surfaces the text channel does not.

Red-teaming must cover all four attack categories (semantic, acoustic, signal, embedding) separately for each deployed modality. Benchmarks like AJailBench, Jailbreak-AudioBench, and JALMBench provide the evaluation frameworks; the taxonomy paper provides the attack classification to structure the testing.

The per-channel alignment cost compounds with every modality a vendor ships. Text, audio, image, video: each has its own attack surface, its own encoder, its own embedding space. A flat “we aligned the base model” claim omits every channel-specific gap.

Frequently Asked Questions

How does GPT-4o’s audio jailbreak resistance compare to open-source LALMs?

Jailbreak-AudioBench recorded GPT-4o-Audio’s attack success rate near zero on standard jailbreak audios, rising only from 0.7% to 8.4% under query-based audio editing. By contrast, Qwen2-Audio jumped from 13.3% to 48.8% and SALMONN-7B from 31.6% to 85.1% under the same conditions. The gap points to architecture-level differences in how end-to-end audio models process adversarial signals, not just differences in training data volume or safety fine-tuning intensity.

Do benchmark attack success rates predict real-world over-the-air attacks?

Several of the five benchmarking efforts generate adversarial audio by converting text prompts through TTS pipelines, producing clean digital-path samples that skip the acoustic channel entirely. Over-the-air attacks must survive microphone capture, ambient noise, room reverberation, and codec compression before reaching the audio encoder. None of these benchmarks model that degradation, so the published ASR figures function as upper bounds on theoretical exposure rather than field predictions. AudioJailbreak’s weak-adversary scenario, which assumes the attacker cannot fully manipulate the prompt, is the closest approximation to realistic conditions tested so far.

What does interleaved audio-text training require that standard post-hoc guards do not?

JALMBench identifies interleaved audio-text strategies as the most promising path to cross-modal safety generalization, but this is a training-time intervention: paired audio and text refusal examples must be baked into the model’s alignment phase through RLHF or DPO. It cannot be retrofitted as a guard or filter on an already-deployed model. Teams would need to produce or source a multimodal safety dataset with aligned audio-text pairs, then repeat the full alignment run. For models already in production, the cost is a retrain cycle, not a configuration change.

How much protection does text-side RLHF actually provide against each attack category?

JALMBench’s evaluation across 245,355 audio samples found that general-purpose moderation methods improve audio-channel security only slightly. The partial transfer is uneven across attack categories: semantic attacks that mirror text patterns (like roleplay or hypothetical framing) are the most likely to be caught by text-trained refusals, because the harmful content passes through the language model layer where RLHF operates. Acoustic, signal-level, and embedding-layer attacks manipulate representations upstream of that layer, so text-side alignment provides negligible coverage against them. The transfer is category-dependent, not a uniform percentage reduction.