Task-Focused VLMs Suppress Hazards They Detect in Isolation, June 2026 Preprint Finds

A June 2026 preprint says yes. Kwan Soo Shin’s “The Inattentional Gap” (arXiv:2606.26529, submitted 25 June 2026) shows that language and vision models, once conditioned on a narrow task, omit safety-critical signals they reliably report when asked in isolation. The finding strikes at how multimodal systems are certified: a benchmark that probes a model with a direct question overstates the safety of that same model once it runs under task load.

What is the Inattentional Gap?

The Inattentional Gap is the dissociation between a model’s ability to report a hazard when asked directly and its failure to surface that same hazard while it is occupied with a narrower task. Shin frames it as a machine analogue of human inattentional blindness, the phenomenon behind the well-known selective-attention experiment in which viewers counting basketball passes fail to notice a person in a gorilla suit walking through the scene. The analogy is evocative, and the paper uses it deliberately, but it is careful about the boundary: the machine version arises from a different mechanism than the human one.

That distinction matters because the fix differs. Human inattentional blindness is a perceptual-attentional limit; the gorilla is genuinely not seen. In the models Shin tested, the hazard is “seen,” in the sense that the model can report it when prompted, yet it is dropped from the output when the model is busy with something else. The signal is present in the model’s capability and absent from its behavior under load. That is the gap.

How do standard safety benchmarks manufacture it?

Standard safety evaluations probe models in their most favorable operating mode: a direct, unconstrained question. A radiology model is asked whether a scan is dangerous. A driving agent is asked whether a scene is safe. Under those conditions the model deploys its full hazard vocabulary and reports well. The benchmark records a high score, and the model ships.

Deployed systems do not run in that mode. A radiology assistant is instructed to summarize a scan for a referring physician. A driving agent is told to plan a lane change. The task instruction narrows attention to the task’s object, and the hazard that sits beside it, unasked, goes unreported. Shin’s argument, set out in arXiv:2606.26529, is that this mismatch decouples measured benchmark safety from real-world safety. A system can score near-perfectly on the hazards an evaluation specifies while remaining blind to the hazards that actually cause harm, because the evaluation never places the model under the load where the omission occurs.

What did the June 2026 study actually find?

Across radiology text scenarios, driving text scenarios, and chest-radiograph vision tasks, suppression of safety-critical signals appeared in every model Shin tested, according to the abstract of arXiv:2606.26529. Four properties of the result stand out, and each carries a specific implication.

First, the effect ran through the entire tested set. Every model omitted hazards under task load that it reported when unconstrained. Second, scale did not help: suppression did not diminish as models grew larger, which undercuts the assumption that capability gains wash out safety gaps on their own. Third, the effect persisted in a reasoning model, suggesting that more chain-of-thought does not repair it. Fourth, the gap varied more by model family than by size, which points the cause at architecture or training procedure rather than parameter count.

The paper is a 20-page preprint with eight figures and a reproducibility deposit. Its abstract does not name the specific models evaluated, so the claims cannot yet be checked against a known frontier lineup. That is a real limit. “Every model tested” is a statement about Shin’s sample, not about every model in the field, and the strong findings on scale and reasoning should be read at the scope that sample supports until the model list is public.

Two adjacent preprints describe related failure modes, and together they suggest the Inattentional Gap is one face of a broader pattern in which a vision-language model’s safety score is hostage to how it is queried.

A study of zero-shot VLM safety classifiers (arXiv:2605.00326) found that a single prompt’s first-token probability is unstable under semantically equivalent reformulation. Cross-prompt variance was strongly associated with higher error, and a training-free mean ensemble improved calibration on all 14 dataset-model pairs the authors tested. The takeaway is congruent with Shin’s: the safety signal a VLM emits depends heavily on the evaluation’s framing, so a single phrasing can overstate or understate the model’s true detection ability.

HomeGuard (arXiv:2603.14367) approaches the problem from the embodied-agent side. It studies VLMs acting in household settings where a benign command becomes hazardous because of a subtle environmental state, and it finds that prompt-engineering safeguards specifically “suffer from unfocused perception, resulting in missed risks or hallucinations.” That is a sibling failure to the Inattentional Gap: in Shin’s framing the model suppresses a hazard it can report; in HomeGuard’s the model never focuses on the contextual risk in the first place. HomeGuard’s Context-Guided Chain-of-Thought lifted risk match rates by more than 30% over base VLMs while also reducing oversafety, though that figure is specific to household tasks and should not be treated as a general fix.

What should practitioners do differently?

The remediation follows from the diagnosis. If safety gaps emerge under task conditions, safety has to be evaluated under task conditions.

Run task-conditioned evaluations. Instead of asking “is this scene dangerous?”, place the model under the instruction it will actually receive in deployment and check whether the hazard still surfaces in the output. A direct-probe score is a useful upper bound on capability; it is not a deployment estimate.

Probe with ensembles, not single prompts. The arXiv:2605.00326 result that mean-ensembling across paraphrases improved calibration on all 14 dataset-model pairs is a cheap, training-free hedge against the framing sensitivity that inflates single-prompt scores. Treat a single phrasing’s verdict as one noisy sample.

Separate perception from judgment. HomeGuard’s gain came from explicitly decomposing what the model sees from what it decides, which attacks the unfocused-perception failure directly. Structuring the prompt so the model must enumerate observations before it renders a verdict gives the hazard a chance to surface before task pressure suppresses it.

Treat safety as a function of operating context. A model is not “safe” or “unsafe” in the abstract. It is safe-or-not under a specific instruction, modality, and load. Certification that does not vary those conditions certifies only the easiest case.

Where does the research stop short?

Several limits are worth holding clearly. The paper is a preprint, not yet peer-reviewed, and its claims await independent replication. Its abstract does not name the models tested, so the headline findings, on scale, on reasoning, on family variation, hold at the scope of Shin’s sample until the model list and data are open to inspection. The 30% HomeGuard improvement is household-domain and is evidence for a direction, not a portable number. And the human inattentional-blindness analogy, useful as framing, describes a different mechanism and should not be imported as an explanation of why the machine gap occurs.

The evergreen claim survives these caveats. Benchmark evaluations that probe a model outside its operational context overstate deployed safety, and that critique holds whether or not Shin’s specific findings replicate. For teams building radiology readers, driving planners, or any VLM that acts inside a loop, the actionable read is narrow and uncomfortable: the safety number on the evaluation sheet measures a mode the system does not run in, and the hazards it will actually drop are the ones the benchmark never put under load.

Frequently Asked Questions

Does the Inattentional Gap affect text-only language models or only vision models?

Both. Shin’s study ran text-based radiology and driving scenarios alongside chest-radiograph vision tasks, and suppression appeared across all of them. The finding is not limited to multimodal architectures; any language model conditioned on a narrow instruction is subject to the same omission pattern.

How is this different from a jailbreak or adversarial prompt failure?

Jailbreaks require crafted adversarial inputs designed to defeat safety training. The Inattentional Gap requires nothing adversarial at all: a standard task instruction is sufficient. That distinction matters for remediation. Adversarial training and red-teaming datasets are designed to catch manipulated inputs; they do not help when the failure mode is an ordinary deployment instruction suppressing a hazard the model could otherwise have reported.

What is the practical cost of switching to task-conditioned safety evaluations?

The primary cost is dataset construction: every hazard scenario needs a task-wrapped version that mirrors actual deployment instructions, while keeping the ground-truth hazard label constant. Models and inference infrastructure stay the same. The dataset rework is bounded; the harder discipline is running both direct-probe and task-conditioned versions in parallel as a standard practice rather than a one-time audit.

Can practitioners use this paper to compare models and select one with a smaller gap?

Not yet. The abstract does not name the specific models tested, so teams cannot map the family-level variation findings onto their procurement options. Until the full model list and evaluation data are public, the paper supplies a methodology critique and a design direction, but not a comparative ranking that informs model selection.

Would fine-tuning on task-conditioned hazard examples eliminate the gap?

Possibly, but with a known tradeoff. Naive safety fine-tuning has historically degraded base task performance in some domains, and the data volume required for this specific failure mode is unknown. HomeGuard’s Context-Guided Chain-of-Thought recovered over 30 percentage points in household risk detection with no retraining, which sets a concrete bar any fine-tuning approach would need to clear without costing task accuracy.