A paper submitted three days ago makes a straightforward empirical point with uncomfortable implications for anyone deploying LLMs outside English-speaking markets. TukaBench, submitted by Victor Akinode on May 31, 2026, extends the standard JailbreakBench framework to seven African languages and finds that safety alignment trained on English does not transfer reliably to low-resource languages. Prompting in African languages reduces model refusal rates, and culturally adapted prompts widen the gap further.
What TukaBench tests
JailbreakBench (JBB) is a standard jailbreak evaluation framework, built and benchmarked primarily in English. TukaBench extends it across four settings designed to test different vectors of multilingual safety failure:
- Human translation of existing JBB prompts into target African languages
- English prompts adapted to African cultural contexts, then human-translated
- Human-curated prompts validated through interactions with GPT-5.2 to confirm they produce coherent responses rather than triggering comprehension failures
- Code-switched prompts that mix English and African languages within a single input
The four settings are not variations on a theme. They isolate distinct failure modes. Plain translation tests whether English safety training generalizes when the same semantic content arrives in a different language. Cultural adaptation tests whether the training handles context shifts that go beyond vocabulary. Code-switching tests the messier real-world case where users naturally mix languages in a single prompt.
Africa has between 1,250 and 3,000 native languages across 54 recognised sovereign states, serving a population of roughly 1.5 billion people. The seven languages TukaBench covers are a fraction of that total, but they are the first systematically tested against a standard jailbreak framework with cultural grounding as a controlled variable.
English safety training does not transfer
The paper’s central finding is direct: across both closed and open models, prompting in African languages reduces model refusal relative to English. Culturally adapted prompts, the second setting, produce the least refusal of all four. The safety behavior tuned on English inputs degrades when the language and cultural context change.
This is not a new hypothesis. The multilingual safety gap has been discussed in prior work, including benchmarks like MMSafetyBench that cover some European and Asian languages. What TukaBench adds is the cultural-adaptation variable and its measurement across low-resource African languages specifically, a region that standard safety benchmarks have largely skipped.
The Deflection problem
Standard jailbreak evaluation uses a binary classification: a response is either Refused or Jailbroken. TukaBench introduces a third category called Deflection.
Deflection captures the case where a model neither refuses nor produces a valid jailbroken response. It simply fails to understand the prompt. In high-resource languages like English, comprehension failures on adversarial prompts are rare enough that the binary split works. In low-resource African languages, the failure rate is high enough that treating every non-refusal as a successful jailbreak produces misleading numbers.
This is a methodological contribution worth noting. Any benchmark measuring jailbreak robustness in low-resource languages that does not account for comprehension failure will overestimate attack success. The Deflection category makes it possible to distinguish “the model was tricked” from “the model did not understand the question.” Those are different failure modes with different remediation paths.
LLM-as-a-judge breaks down in low-resource languages
The paper identifies a second structural limitation: using LLMs to evaluate responses in low-resource languages is unreliable. Judge-human agreement drops in lower-resource languages and less commonly supported scripts.
This matters because automated evaluation pipelines for safety testing increasingly rely on LLM-as-a-judge setups to scale beyond what human annotation can cover. If the judge itself degrades on the languages being tested, the evaluation results become untrustworthy in a way that is hard to detect without a human annotation baseline. TukaBench surfaces this by comparing automated judgments against human annotations and reporting the agreement gap explicitly.
The deployment implication
The immediate takeaway for model deployers, particularly anyone serving users in African markets, is that vendor-reported safety numbers are incomplete. If a lab reports an attack success rate (ASR) on JailbreakBench or HarmBench, that number is almost certainly an English-only measurement. It does not represent the model’s safety behavior in Swahili, Yoruba, Amharic, or the other languages TukaBench tested.
This shifts the burden of multilingual red-teaming onto downstream deployers who may have assumed the vendor’s safety tuning was language-agnostic. It is not. Deployers building products for African users need their own adversarial testing in the languages their users speak, with cultural context appropriate to those users, and with evaluation pipelines that account for the comprehension-failure and judge-reliability problems TukaBench identifies.
The code-switching setting is particularly relevant here. Real users do not always submit clean monolingual prompts. They mix languages within a conversation, within a single input, and across scripts. Any safety testing that does not cover code-switched inputs is testing a scenario that does not match how the model will be used.
What needs to happen next
TukaBench covers seven languages. There are thousands more on the continent with zero systematic safety evaluation. The paper’s framework, particularly the Deflection category and the cultural-adaptation methodology, provides a template for extending that coverage. But the work is labor-intensive: human translation, cultural adaptation, and human annotation of evaluation results do not scale the way automated benchmarks do.
The structural argument in this paper will outlast the specific benchmark results. As LLMs are deployed more widely in non-English markets, the gap between English-tuned safety behavior and real-world multilingual usage will grow. TukaBench documents that gap at a specific point in time, with a specific set of models. The question is whether labs will start reporting multilingual ASR alongside their English numbers, or whether that remains the deployer’s problem.
Frequently Asked Questions
How does TukaBench differ from MMSafetyBench?
MMSafetyBench evaluates multilingual safety across primarily European and Asian languages with high digital representation, using the standard binary Refused/Jailbroken scoring. TukaBench is the first benchmark to isolate cultural grounding as a controlled variable and to introduce Deflection as a third category. Without that third category, MMSafetyBench-style scoring would count comprehension failures as successful jailbreaks in low-resource African languages, inflating attack-success numbers.
Does the writing system affect evaluation reliability independently of language resource level?
Yes. The paper reports that judge-human agreement drops for languages written in less commonly supported scripts, distinct from the drop caused by low training data volume. A language with Latin orthography but limited corpus may still evaluate more reliably than a language with equivalent speaker population using a script the judge model rarely encounters, because tokenization and script familiarity are separate bottlenecks from vocabulary coverage.
Can teams machine-translate existing English jailbreak prompts instead of commissioning human translation?
Machine translation introduces an uncontrolled variable: translation errors that change the adversarial intent of the prompt. TukaBench uses human translation precisely to isolate whether the safety behavior degrades from the target language itself, not from a garbled translation. A machine-translated benchmark would conflate model safety gaps with translation artifacts, making results unreliable for any language where MT quality is poor.
Does using GPT-5.2 to validate human-curated prompts create a bootstrapping problem?
It introduces a dependency on a validator model whose own low-resource language competence is not independently measured. If GPT-5.2 has the same comprehension gaps TukaBench tests for, some prompts may pass automated validation despite being incoherent or off-target in the target language. The authors follow GPT-5.2 validation with human curation, but the intermediate validation step still assumes a baseline of multilingual competence that the benchmark itself calls into question.