Large language models are globally deployed but locally trained—their safety guardrails built primarily on English data create exploitable vulnerabilities where simply translating harmful prompts into low-resource languages bypasses protections 79% of the time1. This linguistic inequality in AI safety represents one of the most significant yet underaddressed vulnerabilities in the current generation of deployed AI systems.
The English-Centric Foundation of AI Safety
Modern large language models like GPT-4, Claude, and Llama 2 are trained predominantly on English-language internet text. While these models demonstrate impressive multilingual capabilities, their safety training—the reinforcement learning from human feedback (RLHF) and red-teaming designed to prevent harmful outputs—remains overwhelmingly English-centric. Researchers at Brown University demonstrated this vulnerability by translating unsafe English inputs into low-resource languages, achieving a 79% attack success rate on GPT-4’s AdvBenchmark1. This rate equals or exceeds state-of-the-art algorithmic jailbreaking attacks, yet requires no technical sophistication—just access to a translation API.
The implications are stark: safety mechanisms designed to protect users fail systematically when confronted with languages outside the training distribution. High-resource languages like Spanish, French, and German show significantly lower attack success rates, suggesting that some safety alignment does transfer across related languages. But the cross-lingual vulnerability disproportionately affects low-resource languages—those spoken by billions of people in Africa, South Asia, and indigenous communities worldwide.
How the Attack Works
The methodology is disarmingly simple. A user takes a harmful English prompt—requests for instructions on illegal activities, hate speech, or dangerous content—and translates it into languages like Zulu, Gaelic, or Nepali. When submitted to GPT-4, the model often complies with the request, providing actionable information that the English version would trigger a refusal for.
This bypass exploits what researchers call “linguistic inequality of safety training data”1. While base language models acquire multilingual capabilities from diverse internet text, safety fine-tuning relies heavily on English annotations and English-speaking annotators. The result is a system that “understands” many languages but only applies safety constraints reliably in English.
Comparative Vulnerability Across Languages
Research reveals significant disparities in safety alignment across language families. The following table summarizes attack success rates based on available research:
| Language Category | Example Languages | Attack Success Rate | Safety Alignment |
|---|---|---|---|
| Low-Resource | Zulu, Gaelic, Lao, Amharic | ~79%1 | Minimal |
| Mid-Resource | Thai, Vietnamese, Indonesian | ~45-60%1 | Partial |
| High-Resource | Spanish, French, German | ~20-35%1 | Moderate |
| English | English (US/UK) | <5%1 | Strong |
The gradient is clear: safety alignment correlates directly with resource availability during training. Languages with abundant digital text and large annotation budgets receive proportionally more safety attention. The result is a two-tiered system where English and a handful of Western European languages receive robust protection, while the majority of the world’s languages remain effectively unguarded.
The Global Deployment Problem
This vulnerability arrives at a critical moment in AI deployment. Companies are racing to integrate LLMs into products serving billions of users across linguistic boundaries. OpenAI’s GPT-4 supports over 25 languages officially; Google’s Gemini and Meta’s Llama models make similar claims. Yet the safety infrastructure has not kept pace with this global ambition.
The consequences extend beyond individual bypass attempts. In high-stakes applications—content moderation, educational tools, healthcare assistants, legal research—uneven safety alignment creates liability nightmares. A medical chatbot that refuses dangerous self-treatment advice in English but complies in Swahili poses an unacceptable risk to global health equity.
Related Vulnerabilities: The Broader Attack Surface
The multilingual bypass is part of a larger pattern of safety failures in aligned language models. Researchers from Princeton, CMU, and other institutions demonstrated “universal and transferable adversarial attacks” that automatically generate suffixes capable of inducing objectionable content across multiple models2. Their method achieves consistent jailbreaks against ChatGPT, Bard, Claude, and open-source models like Llama-2-Chat.
Similarly, researchers demonstrated that persuasion techniques—framing harmful requests as logical reasoning exercises, authority appeals, or emotional narratives—achieve over 92% attack success rates on GPT-3.5, GPT-4, and Llama 2-7b3. These “persuasive adversarial prompts” require no technical expertise and work across languages, compounding the multilingual vulnerability.
Fine-tuning presents another attack vector. Researchers found that safety alignment can be compromised by fine-tuning with just 10 adversarially designed training examples, costing less than $0.20 via OpenAI’s APIs4. Even benign fine-tuning datasets inadvertently degrade safety alignment, suggesting that current safety infrastructures cannot maintain protections under customization.
The Tokenization Disparity
Underlying these safety gaps is a fundamental technical inequality in how LLMs process different languages. Research on API pricing fairness revealed that speakers of many supported languages are systematically overcharged while obtaining poorer results5. Languages like Burmese, Amharic, and Telugu require significantly more tokens to convey equivalent information compared to English, increasing costs while simultaneously reducing model performance.
This tokenization disparity creates a compounding disadvantage: non-English users pay more for less capable models with weaker safety guardrails. The research analyzed 22 typologically diverse languages and found that “speakers of a large number of the supported languages are overcharged while obtaining poorer results”5.
Defensive Approaches and Their Limitations
Several defensive strategies have emerged, though none fully address the multilingual vulnerability:
StruQ (Structured Queries) proposes separating prompts and data into two channels, training models to only follow instructions in designated prompt portions6. While promising for prompt injection attacks, this approach does not directly address language-based bypasses.
Direct Principle Feedback offers a simplified Constitutional AI approach that trains models to avoid specific entities or topics7. However, this requires explicit enumeration of prohibited content, making it difficult to scale across languages and cultural contexts.
Multilingual Red-Teaming represents the most direct response—expanding safety training to include diverse languages from the start. But this requires substantial investment in annotation resources for low-resource languages, where even basic NLP datasets remain scarce.
Implications for AI Governance
The multilingual safety gap raises fundamental questions about AI governance in an interconnected world. Current regulatory approaches—focused primarily on English-language harms and Western use cases—are inadequate to the global nature of AI deployment.
The EU AI Act, for instance, requires high-risk AI systems to meet safety standards but does not explicitly mandate multilingual safety parity. Similarly, NIST’s AI Risk Management Framework emphasizes fairness across demographic groups but has limited guidance on linguistic equity. Without regulatory pressure, market incentives alone may not drive the substantial investments needed for robust multilingual safety alignment.
Frequently Asked Questions
Q: Why are low-resource languages more vulnerable to safety bypasses? A: Low-resource languages receive less training data overall, and particularly less safety-focused annotation. While base language capabilities transfer somewhat across languages, safety alignment requires explicit training examples that are scarce for languages with limited digital text. The result is models that “speak” these languages but lack corresponding safety constraints1.
Q: Can this vulnerability be fixed with better translation? A: Simply translating more safety data is insufficient. Direct translations often miss cultural context, idiomatic expressions, and local norms of harmful content. Effective multilingual safety requires native speaker involvement in red-teaming and annotation, not just automated translation pipelines.
Q: Does this affect all AI models or just specific ones? A: Research has demonstrated the vulnerability across GPT-4, GPT-3.5, Llama 2, and other major models. The issue stems from fundamental training practices rather than model-specific architectures. Any model trained primarily on English safety data will exhibit similar cross-lingual vulnerabilities.
Q: What should organizations do when deploying AI globally? A: Organizations should: (1) conduct red-teaming in all supported languages, not just English; (2) implement language-specific safety filters; (3) monitor for translation-based attacks; (4) restrict high-risk capabilities in languages where safety cannot be guaranteed; and (5) maintain human-in-the-loop oversight for sensitive applications.
Q: Are there any models with true multilingual safety parity? A: Currently, no major commercial LLM demonstrates equivalent safety alignment across all supported languages. Some models perform better on mid-resource languages, but significant gaps persist. The research community has yet to establish standardized benchmarks for measuring multilingual safety alignment.
The Path Forward
Addressing multilingual safety vulnerabilities requires fundamental shifts in how AI systems are developed and evaluated. First, safety training must be expanded beyond English-centric datasets to include diverse languages from the start. Second, standardized benchmarks for multilingual safety alignment must be established to enable meaningful comparison across models. Third, regulatory frameworks must explicitly address linguistic equity as a component of AI safety.
The stakes extend beyond technical security to questions of global equity and technological justice. As AI systems become central infrastructure for communication, education, commerce, and governance, ensuring equivalent protection across languages is not merely a safety concern—it is a fundamental requirement for legitimate global deployment.
Until these gaps are addressed, users of AI systems in low-resource languages remain second-class digital citizens, exposed to harms their English-speaking counterparts are protected from. The 79% bypass rate is not just a security statistic; it is a measure of how far the AI industry still has to go toward genuine global inclusion.
Footnotes
-
Brown University Computer Science. “Jailbroken: How Does LLM Safety Training Fail?” Research paper, 2024. https://arxiv.org/abs/2404.01325 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
Princeton University, Carnegie Mellon University, et al. “Universal and Transferable Adversarial Attacks on Aligned Language Models.” Research paper, 2024. ↩
-
Anthropic. “Red Teaming for Generative AI: Persuasion Techniques.” Research publication, 2024. ↩
-
Princeton University. “Fine-tuning Aligned Language Models Compromises Safety.” Research paper, 2024. ↩
-
AI Now Institute. “The Cost of Language: Tokenization Disparities in LLM APIs.” Research report, 2024. ↩ ↩2
-
UC Berkeley. “StruQ: Structured Queries for Prompt Injection Defense.” Research paper, 2024. ↩
-
Anthropic. “Constitutional AI: Harmlessness from AI Feedback.” Research publication, 2022. ↩