What MIDI Tests and Why It Matters
Current LLMs do not reliably interpret idioms in low-resource languages, and the MIDI benchmark, submitted to arXiv on June 1, 2026, quantifies the gap for the first time across 18 languages. The degradation is not subtle: state-of-the-art models that handle figurative English with reasonable competence fail sharply when the same kind of reasoning is required in languages with less training data.
The MIDI Dataset
The MIDI dataset (Multilingual Idioms in Sentences and Conversations) covers 18 languages split into three tiers: 3 high-resource, 3 medium-resource, and 12 low-resource. Idioms are embedded in two settings, isolated sentences and multi-turn conversations, and each one was curated by native speakers. That last detail matters. Idioms are culture-specific fixed expressions whose meaning cannot be derived from the literal meanings of their individual words (7ESL), so a benchmark that skips native-speaker validation is testing something other than what it claims.
This is the first benchmark designed to test idiom comprehension in conversational context across this many languages. Prior work, notably Liu et al. (ACL 2023 Findings), found that multilingual language models showed significant deficiency in figurative language for every non-English language tested: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili, and Yoruba. Performance tracked the availability of pre-training data. But that work tested idioms in sentence-level isolation. MIDI extends the probe to conversational context, which is where real-world LLM deployments actually operate: chat interfaces, content moderation pipelines, translation tools.
The Literal-vs-Figurative Surprise
The paper reports that across all resource tiers, literal interpretations of idioms are substantially harder for models than figurative ones. This is the inverse of what many people assume about how LLMs process language.
The intuition might be that literal meanings should come easily because they correspond more directly to surface-level word semantics. In practice, the benchmark shows the opposite. When a model encounters an idiom used in its figurative sense, the surrounding context provides scaffolding for the correct interpretation. When the same idiom is used literally (someone actually kicking a bucket, not dying), the model has to override a strong learned association, and it often fails.
Low-Resource Languages Bear the Brunt
The gradient across MIDI’s three resource tiers is consistent: high-resource performance leads, medium-resource follows, low-resource lags. This tracks with LLM training data distribution: models are trained on corpora heavily weighted toward high-resource languages, and downstream performance for underrepresented languages reflects that imbalance. The pattern is reinforced by prior work. Liu et al. found the same gradient in 2023: figurative-language competence scales with pre-training data volume.
Conversational Context Helps, But Not Enough
One of MIDI’s contributions is testing idioms in conversational context rather than only isolated sentences. The paper finds that providing conversational context does improve model performance on idiom interpretation. More context gives the model more signal to disambiguate figurative from literal usage.
The improvement does not eliminate the gap between high- and low-resource languages. Context narrows it; context does not close it. A model that misinterprets an idiom in a low-resource language because it has seen too few examples in training does not get rescued by a few extra turns of dialogue. The deficit is in the model’s learned representation, not in the available context window.
The paper shows this for the specific conversational settings MIDI tests. Whether longer context, retrieval-augmented generation, or few-shot prompting would change the result is not addressed.
Memorization vs Reasoning: What the Probes Reveal
The MIDI authors go beyond benchmark accuracy. They run controlled tests and interventions on hidden representations to separate memorization from genuine reasoning about idiomatic meaning.
This distinction matters. If a model correctly interprets an idiom because it has memorized that specific phrase-meaning pair during training, that is a different kind of competence than reasoning from context. The former is brittle: change the wording slightly or embed the idiom in an unfamiliar domain, and performance drops. The latter would generalize.
Current models rely heavily on memorization rather than genuine reasoning for idiom interpretation, according to the paper’s probe results. For high-resource languages with abundant training examples, the memorization strategy works well enough. For low-resource languages where the model has seen fewer examples, it breaks down. The model is pattern-matching against training data, not understanding the cultural or contextual logic of the idiom.
Who Pays the Cost of Misinterpretation
The practical consequence falls on exactly the communities least represented in LLM training data.
Consider a content moderation system that flags or filters text based on LLM interpretation. If that system misinterprets an idiom in a low-resource language, it either lets harmful content through or censors benign speech. The speakers of that language bear the cost: worse moderation, worse translation, worse tooling than English speakers get, and fewer resources to verify or correct the model’s output.
Translation is the same problem made visible. Idioms are resistant to word-for-word translation because their meaning is not compositional (7ESL). An LLM that translates an idiom literally does not produce a “close enough” result. It produces nonsense. The burden of catching that nonsense falls on the user, who in a low-resource-language community is less likely to have access to alternative tools or fluent bilingual support.
The MIDI paper does not directly evaluate content moderation systems or commercial translation products; it benchmarks idiom comprehension in controlled settings. But the mechanism it identifies, insufficient training data producing unreliable idiom handling, applies directly to any system that relies on LLMs to interpret text in those languages.
What Would Close the Gap
The MIDI results point to a structural problem, not a prompting problem. Context helps but does not fix the underlying deficit in low-resource-language representations. The memorization-vs-reasoning probes suggest that even when models get the right answer, they are often doing so for the wrong reason.
Closing the gap would require more idiomatic training data in low-resource languages, curated by native speakers. The MIDI dataset itself is a step in that direction: it provides a standardized evaluation across 18 languages with native-speaker-validated examples. But evaluation does not fix training. The hard part remains collecting, curating, and incorporating enough high-quality idiomatic data in languages where digital text corpora are small.
Led by Fajri Koto with 19 authors across multiple institutions, the MIDI paper is filed under Computation and Language and Artificial Intelligence on arXiv. As of June 2026, it is a preprint and has not yet been peer-reviewed.
Frequently Asked Questions
How does MIDI differ from FLORES or translated MMLU?
General multilingual benchmarks like FLORES and translated MMLU test broad language competence but do not isolate figurative language. A model can score well on FLORES translation BLEU while still failing on idiom interpretation, because BLEU rewards surface-level n-gram overlap and idioms require non-compositional understanding. MIDI targets this blind spot by combining idioms with conversational context across 18 languages, a setup no prior benchmark merges.
Does the literal-vs-figurative finding apply beyond idioms?
MIDI tests only idiomatic fixed expressions. Whether the same inversion holds for metaphor, irony, or sarcasm is unknown. Liu et al. 2023 found figurative-language deficiency across non-English languages but did not include a literal-usage condition, so the inversion could be specific to idioms, where the figurative meaning is strongly associated in training data, or it could be a general property of non-compositional language that future work will confirm.
Could RAG or few-shot prompting fix the low-resource gap?
The paper does not test retrieval-augmented generation, few-shot prompting, or extended context as mitigations. The memorization-vs-reasoning probes, however, suggest the core problem lives in the model’s weights, not in the available context window. If the model has never internalized an idiom during training, a few in-context examples may fall short, especially in languages where even retrieval sources are sparse.
Which deployed systems are affected beyond translation and moderation?
Any pipeline that passes user text through an LLM for intent classification, sentiment analysis, or summarization will surface the same idiom failures in low-resource languages. The representation deficit MIDI documents is task-agnostic: if the model’s internal encoding of an idiom is wrong, every downstream classifier built on that encoding inherits the error. Customer-support chatbots, social-listening tools, and automated ticket-routing systems in multilingual markets all carry this risk.