Video Jailbreaks Hit Multimodal LLMs by Splitting Payloads Across Clips

Splitting harmful intent across clips

A paper accepted to ACL 2026 demonstrates that distributing a harmful request across several short video clips, each benign in isolation, consistently raises jailbreak success on eight tested multimodal LLMs. The attack exploits a structural assumption in current moderation pipelines: that safety can be evaluated per frame or per clip. It cannot. As of the paper’s evaluation date, increasing clip count and contextual diversity raised attack success across all eight models in the benchmark.

How the multi-clip attack works

The MCV SafetyBench dataset contains 2,920 videos, each assembled from multiple short clips depicting diverse contexts related to a harmful query. No single clip contains the full payload. The harmful intent only becomes legible when the model processes the sequence as a whole and synthesises meaning across clip boundaries.

This is a compositional trick, and it is effective because most deployed content-safety systems evaluate individual frames or short segments independently. A clip of someone handling laboratory glassware, followed by a clip of a chemical reaction, followed by a clip describing a procedure: each passes moderation on its own. The assembled sequence does not.

The paper reports three structural findings as of its evaluation date. Video is more vulnerable than image. Dynamic video is more vulnerable than static. And clips spanning more diverse contexts produce higher attack success than homogeneous sequences. All three held across the eight tested models. The full PDF (27 pages, 20 figures) contains per-model attack success rates in its results tables; the abstract discloses only the aggregate trend. arXiv:2606.02111

Per-frame moderation was already weak

The multi-clip finding is consistent with a growing body of evidence that multimodal safety inherits structural weaknesses from every modality it ingests.

JailBreakV-28K, a 2024 benchmark spanning 28,000 test cases across 10 open-source MLLMs, found that jailbreak prompts transferred from text-only LLMs achieved notably high attack success rates. The text-processing pipeline is the vector. Add a second modality and the attack surface grows, but the underlying weakness is the same: the model processes and reasons about input it cannot refuse to attend to.

Compositional adversarial attacks on vision-language models, presented at ICLR 2024, demonstrated that adversarial images need access only to the vision encoder (such as CLIP), not the language model itself. For closed-source models where the LLM weights are inaccessible, the attacker does not need them. The vision encoder is enough. This lowers the barrier to entry for cross-modality jailbreaks considerably.

A 2024 survey on multimodal jailbreak attacks and defenses organises the threat landscape across four lifecycle levels: input, encoder, generator, and output. The multi-clip video attack lands squarely at the input level. The model receives the clips, processes them in sequence, and composes a response that no single clip warranted refusing.

VideoJail confirms the gap with specific numbers

Separate from MCV SafetyBench, VideoJail provides concrete attack-success figures that quantify how far current defenses lag. Using a jigsaw-based approach to bypass closed-source models’ harmful-visual-content detection, VideoJail reports average ASR of 96.53% on LLaVA-Video, 96.00% on Qwen2-VL, and 92.13% on Gemini-1.5-flash (per the paper’s evaluation results on OpenReview).

These are single-method numbers from a different attack technique, not directly comparable to MCV SafetyBench’s multi-clip approach. But they confirm the same structural problem: video moderation on current MLLMs is not robust, and the gap is measured in dozens of percentage points, not single digits.

The cost of cross-clip reasoning

The MCV SafetyBench paper proposes a defense: use the relative robustness of the image modality as a mitigation path for video-based jailbreaks. arXiv:2606.02111. The logic is straightforward. If video is more vulnerable than image, and dynamic video more vulnerable than static, then flattening the video input into representative frames and evaluating those frames as images might recover some safety margin.

This works only as a fallback. The whole finding of the paper is that compositional meaning across clips defeats per-clip evaluation. Any defense that discards temporal information to reduce the problem to image-level moderation has already conceded the attack’s central insight: the harmful payload lives in the relationship between clips, not inside any one of them.

Effective defense against multi-clip attacks requires cross-clip semantic reasoning: processing the full temporal sequence, building a representation of the composed intent, and evaluating that representation against safety criteria. This is computationally expensive. It is also architecturally different from what most deployed moderation systems do today, which is to evaluate segments independently and aggregate per-segment verdicts.

What to do now

For teams deploying video-capable multimodal models, the practical implications are specific.

First, per-clip and per-frame moderation is insufficient against adversarial inputs, as of June 2026. This is not a theoretical gap; the MCV SafetyBench benchmark demonstrates it across eight models and VideoJail quantifies the single-attack ASR above 92% on three widely used MLLMs.

Second, cross-clip and cross-segment evaluation needs to be part of the safety pipeline. This raises inference cost, possibly substantially. There is no published benchmark yet for the cost-quality trade-off of cross-clip reasoning defenses, which is itself a gap.

Third, image-level safety checks as a fallback are better than nothing, but they are a fallback. They miss exactly the class of attacks MCV SafetyBench introduces. Teams should treat them as a stopgap, not a solution, and invest in testing against multi-clip adversarial inputs specifically.

The MCV SafetyBench dataset itself is a usable evaluation tool. 2,920 adversarial videos with controlled clip counts and context diversity give safety teams a concrete benchmark for measuring whether their cross-clip reasoning actually works, rather than assuming it does.

Frequently Asked Questions

Do multi-clip attacks also work on image-only or text-only models?

No. The technique depends on temporal ordering across clips. Image-only models receive a single static input, so there is no sequence to compose across. Text-only models process tokens sequentially, but the 2024 multimodal jailbreak survey (arXiv:2411.09259) classifies multi-clip splitting under its Any-to-Text attack configuration, specific to models that ingest video and output text. The technique has not been benchmarked against audio or other sequential modalities.

How does multi-clip splitting differ from the ICLR 2024 compositional image attacks?

The ICLR 2024 attacks manipulate the vision encoder’s embedding space directly, requiring access to a component like CLIP but not the LLM weights. Multi-clip video attacks require no access to any model component. The attacker simply arranges benign clips in a sequence that composes into harmful instructions during temporal processing. This makes multi-clip attacks simpler to mount, since no gradient or embedding access is needed, but also potentially easier to audit after the fact, because the composed intent is fully reconstructible from the clip sequence. Adversarially optimised image embeddings, by contrast, are designed to evade detection at the pixel level.

Would hardening text-only LLM safety reduce video jailbreak success?

Partially. JailBreakV-28K showed that text-based jailbreaks transfer to multimodal models, so stronger text-layer refusal behavior would close that transfer vector. But multi-clip attacks exploit the temporal composition layer in video processing, not the text pipeline. The two weaknesses are architecturally independent: closing text-pipeline gaps would not address the finding that distributing harmful intent across clips bypasses per-clip safety evaluation entirely.

Has cross-clip reasoning been deployed in any production moderation system?

As of June 2026, no production moderation system is known to perform full cross-clip semantic reasoning on user-submitted video. The 2024 multimodal jailbreak survey notes that most deployed defenses operate at the input or encoder level, evaluating segments independently. Moving to generator-level or cross-segment evaluation remains an open research problem. The MCV SafetyBench paper’s proposed fallback, flattening video to image-level checks, is something platforms already do, but that approach concedes the attack’s core mechanism.

Could the same payload-splitting technique target audio or other sequential modalities?

No benchmark exists for audio. MCV SafetyBench tests video only, and the 2024 multimodal jailbreak survey does not include audio-specific attack configurations. In principle, any modality involving sequential processing could be vulnerable: distribute a harmful request across benign segments that combine only at the reasoning level. Audio streams, sensor feeds, and mixed-modal inputs all fit this pattern. But without published benchmarks, this remains conjecture, not a demonstrated finding.