A Theory of Time-Sensitive Language Generation Says Sparse Hallucination Beats Mode Collapse

arXiv:2605.11302¹, revised May 19, proves something most practitioners already suspect but few have formalized: if you want a language model to generate timely, comprehensive output about events past its training data, you must accept some hallucination. The alternative is not safety. The alternative is the model refusing to say anything useful at all.

What the paper proves

The authors work within a framework introduced by Kleinberg and Wei for language generation under a global preference ordering on strings. They add a timeliness constraint: higher-ranked strings must be produced before a rank-dependent deadline. Under this setting, they prove an impossibility result. Timely generation cannot be achieved by “eventually consistent generators,” the class that underpins most prior formal work on language generation guarantees. Consistency, the property that the generator never outputs a string it does not “know” to be correct, is incompatible with the deadline requirement.

This is not a narrow negative result. The eventually consistent generator class is broad enough to cover the formal models used in prior work on generation with quality guarantees. The impossibility holds for that entire class.

The loophole: vanishing hallucination

The paper then relaxes the consistency requirement. Instead of demanding zero hallucination, it allows a hallucination rate that vanishes over time, described by the authors as “perhaps the mildest natural relaxation of consistency.” Under this relaxation, the generator can achieve optimal density with respect to any superlinear deadline function. That is, if the deadline grows faster than linearly in the string’s rank, the generator can cover the space of high-value outputs in time, provided it is allowed to hallucinate at a rate that converges to zero.

This is the core result. Sparse, decaying hallucination is not a bug to be patched. It is the condition under which timely generation becomes possible at all.

Why the tightness matters

The superlinear deadline requirement is not an artifact of the proof technique. The paper shows that if deadlines are only linear in rank, timely generation remains impossible even with the vanishing hallucination relaxation. The result is tight. There is no gap between what the positive result achieves (superlinear deadlines work) and what the negative result rules out (linear deadlines do not). The boundary between possible and impossible generation is exactly at the transition from linear to superlinear deadlines.

Tight results are rare in this corner of formal language theory, and this one sharpens the practical takeaway: the deadline structure of the generation task determines whether any amount of controlled hallucination helps. Fast deadlines, of the kind you would want in a real-time assistant, are the hardest case.

The RLHF connection

The practical force of the paper lands on RLHF training pipelines. In standard RLHF, the reward function combines a preference model score with a KL-divergence penalty against the initial pretrained model, as Hugging Face’s RLHF explainer documents.² Without the KL penalty, the policy drifts into output that scores well on the reward model but is essentially gibberish. That drift is a form of reward hacking.

This result¹ suggests a second-order version of this problem. When safety tuning penalizes hallucination aggressively, the model’s available strategy space shrinks toward refusal and safe non-answers. The formal term for this outcome is mode collapse, and the paper argues it is the worse failure mode. A model that occasionally hallucinates but covers the relevant space is more useful than a model that never hallucinates and says nothing of substance.

Kalai’s complementary angle

The timing is not accidental. Kalai³ argues that hallucinations originate as errors in binary classification and persist because evaluation benchmarks reward guessing over admitting uncertainty. Kalai frames this as a socio-technical misalignment in how leaderboards score models. A model that guesses wrong on a factual question gets penalized less than a model that refuses to answer, because refusal typically receives a zero or a default low score rather than a penalty for being unhelpful.

The two papers share a structural observation: the current evaluation and training regime selects against the failure mode (sparse hallucination) that formal theory suggests is the preferable one, and selects for the failure mode (reliable non-answers) that theory identifies as the dead end.

What this means for safety tuning

This result¹ reframes the conversation from minimizing hallucination to characterizing an acceptable hallucination budget. If sparse, vanishing hallucination is the price of timely generation, then the engineering problem shifts from “how do we eliminate hallucination?” to “what rate of hallucination is acceptable given the deadline structure of the task?”

This is a different optimization target. It requires specifying what “timely” means for a given deployment, what the deadline function looks like, and whether the task’s deadline structure is linear or superlinear. It also requires admitting that some hallucination is structurally necessary, which is a hard sell in the current regulatory and public-relations environment.

The formal theory also connects to the broader project of a scientific theory of deep learning. Simon et al.⁴ argue that a “learning mechanics” is emerging, characterized by falsifiable quantitative predictions about training dynamics. Impossibility results like these are exactly the kind of structural bound that learning mechanics needs: hard limits on what any architecture can achieve, independent of parameter count or training data.

Open questions

The gap between formal limits and production systems remains wide. The paper’s generator model abstracts away architecture, retrieval, grounding, and the distinction between factual recall and parametric knowledge. Whether the vanishing-hallucination bound applies to a retrieval-augmented system, where the model can check an external source before committing to an answer, is an open question. RAG systems arguably change the deadline structure by providing a verification step, which could shift the task from pure generation to generation-plus-verification, a setting the paper does not address.

The other open question is measurement. The paper defines hallucination rate formally, but measuring it in a production model requires a ground truth that does not exist for open-ended generation. Until the field agrees on a hallucination metric that maps to the formal definition, the practical impact of these bounds will remain indirect: they inform how teams should think about the tradeoff, but they do not yet provide a dial that can be turned during training.

Frequently Asked Questions

How does Kalai’s explanation of why models hallucinate differ from the 2605.11302 framework?

Kalai traces hallucination to a binary classification error, the model must decide “do I know this?” and gets it wrong, and argues it persists because leaderboards penalize refusal less than wrong answers. The 2605.11302 paper treats hallucination as a rate parameter in a generative process, not a classification failure. The views are complementary: Kalai explains the causal origin, while 2605.11302 proves that some nonzero rate is mathematically necessary for timely coverage regardless of origin.

What would happen if evaluation benchmarks penalized refusal more harshly than wrong guesses?

Inverting the penalty structure would shift training pressure away from mode collapse toward more aggressive generation, potentially aligning benchmark incentives with the formal result. The risk is that such a redesign could push the absolute hallucination rate well beyond what the vanishing-rate bound requires, since there would be no natural calibration mechanism to keep it decaying toward zero. Kalai’s framing suggests the benchmark design space is underexplored, the current penalty asymmetry is a historical accident, not an engineering optimum.

What does a superlinear deadline mean concretely for a chat-based assistant versus a single-turn interface?

A superlinear deadline means the time allowed for high-ranked responses grows faster than the response rank. In a multi-turn assistant, this maps naturally: the model can address less obvious subtopics across successive turns. Single-turn interfaces, where the user expects a complete answer in one shot, impose something closer to a linear deadline, the hardest case in the paper’s taxonomy and the one where even vanishing hallucination cannot rescue timely generation. This suggests single-turn factual QA is structurally the worst deployment for the tradeoff the paper describes.

Could retrieval augmentation actually satisfy the superlinear deadline condition?

RAG splits generation into a retrieve-then-generate pipeline, effectively inserting a verification step the paper’s framework does not model. If retrieval latency is small relative to the deadline, the verification step could convert a linear deadline into an effectively superlinear one, precisely the condition where the positive result holds. However, retrieval adds a fixed time cost per subtopic, and for queries requiring many distinct facts, that cost accumulates linearly, potentially canceling the superlinear benefit. The interaction between retrieval latency and deadline structure is unanalyzed in the current work.