AI Testing Mostly Uses English Right Now. That’s Risky

24 July 2024 at 11:00

In this photo illustration, the home page of the ChatGPT

Over the last year, governments, academia, and industry have invested considerable resources into investigating the harms of advanced AI. But one massive factor seems to be continuously overlooked: right now, AI’s primary tests and models are confined to English.

Advanced AI could be used in many languages to cause harm, but focusing primarily on English may leave us with only part of the answer. It also ignores those most vulnerable to its harms.

[time-brightcove not-tgx=”true”]

After the release of ChatGPT in November, 2022, AI developers expressed surprise at a capability displayed by the model: It could “speak” at least 80 languages, not just English. Over the last year, commentators have pointed out that GPT-4 outperforms Google Translate in dozens of languages. But this focus on English for testing leaves open the possibility that the evaluations may be neglecting capabilities of AI models that become more relevant for other languages.

As half the world steps out to the ballot box this year, experts have echoed concerns about the capacity of AI systems to not only be “misinformation superspreaders,” but also its ability to threaten the integrity of elections. The threats here range from “deepfakes and voice cloning” to “identity manipulation and AI-produced fake news.” The recent release of “multi-models”—AI systems which can also speak, see, and hear everything you do—such as GPT-4o and Gemini Live by tech giants OpenAI and Google, seem poised to make this threat even worse. And yet, virtually all discussions on policy, including May’s historic AI Safety Summit in Seoul and the release of the long-anticipated AI Roadmap in the U.S. Senate, neglect non-English languages.

This is not just an issue of leaving some languages out over others. In the U.S., research has consistently demonstrated that English-as-a-Second-Language (ESL) communities, in this context predominantly Spanish-speaking, are more vulnerable to misinformation than English-as-a-Primary-Language (EPL) communities. Such results have been replicated for cases involving migrants generally, both in the United States and in Europe where refugees have been effective targets—and subjects—of these campaigns. To make matters worse, content moderation guardrails on social media sites—a likely fora for where such AI-generated falsehoods would proliferate—are heavily biased towards English. While 90% of Facebook’s users are outside the U.S. and Canada, the company’s content moderators just spent 13% of their working hours focusing on misinformation outside the U.S. The failure of social-media platforms to moderate hate speech in Myanmar, Ethiopia, and other countries in embroiled in conflict and instability further betrays the language gap in these efforts.

Even as policymakers, corporate executives and AI experts prepare to combat AI-generated misinformation, their efforts cast a shadow over those most likely to be targeted and vulnerable to such false campaigns, including immigrants and those living in the Global South.

This discrepancy is even more concerning when it comes to the potential of AI systems to cause mass human casualties, for instance, by being employed to develop and launch a bio-weapon. In 2023, experts expressed fear that large language models (LLMs) could be used to synthesize and deploy pathogens of potential pandemic potential. Since then, a multitude of research papers investigated this problem have been published both from within and outside industry. A common finding of these reports is that the current generation of AI systems are as good as and not better than search engines like Google in providing malevolent actors with hazardous information that could be use to build bio-weapons. Research by leading AI company OpenAI yielded this finding in January 2024, followed by a report by the RAND Corporation which showed a similar result.

What is astonishing about these studies is the near-complete absence of testing in non-English languages. This is especially perplexing as most Western efforts to combat non-state actors are concentrated in regions of the world where English is rarely spoken as first language. The claim here is not that Pashto, Arabic, Russian, or other languages may yield more dangerous results than in English. The claim, instead, is simply that using these languages is a capability jump for non-state actors that are better versed in non-English languages.

LLMs are often better translators than traditional services. It is much easier for a terrorist to simply input their query into a LLM in a language of their choice and directly receive an answer in that language. The counterfactual point here, however, is relying on clunky search engines in their own language, using Google for their language queries (which often only yields results published on the internet in their language), or go through an arduous process of translation and re-translation to get English language information with the possibility of meanings being lost. Hence, AI systems are making non-state actors just as good as if they spoke fluent English. How much better that makes them is something we will find out in the months to come.

This notion—that advanced AI systems may provide results in any language as good as if asked in English—has a wide range of applications. Perhaps the most intuitive example here is “spearphishing,” targeting specific individuals using manipulative techniques to secure information or money from them. Since the popularization of the “Nigerian Prince” scam, experts posit a basic rule-of-thumb to protect yourself: If the message seems to be written in broken English with improper grammar chances, it’s a scam. Now such messages can be crafted by those who have no experience of English, simply by typing their prompt in their native language and receiving a fluent response in English. To boot, this says nothing about how much AI systems may boost scams where the same non-English language is used in input and output.

It is clear that the “language question” in AI is of paramount importance, and there is much that can be done. This includes new guidelines and requirements for testing AI models from government and academic institutions, and pushing companies to develop new benchmarks for testing which may be less operable in non-English languages. Most importantly, it is vital that immigrants and those in the Global South be better integrated into these efforts. The coalitions working to keep the world safe from AI must start looking more like it.

Reading view