When AI doesn’t speak your language
If you want to send a text message in Mongolian, it can be tough – it’s a script that most software doesn’t recognize. But for some people in Inner Mongolia, an autonomous region in northern China, that’s a good thing.
When authorities in Inner Mongolia announced in 2020 that the language would no longer be the language of instruction in schools, ethnic Mongolians — who make up about 18% of the population — feared the loss of their language, one of the last remaining markers of their distinctive identity. The news and then plans for protest flowed across WeChat, China’s largest messaging service. Parents were soon marching by the thousands in the streets of the local capital, demanding that the decision be reversed.
With the remarkable exception of the so-called Zero Covid protests of 2022, demonstrations of any size are incredibly rare in China, partially because online surveillance prevents large numbers of people from openly discussing sensitive issues in Mandarin, much less planning public marches. With automated surveillance technologies having a hard time with Mongolian though, protestors had the advantage of being able to coordinate with relative freedom.
Most of the world’s writing systems have been digitized using centralized standard code (known as Unicode), but the Mongolian script was encoded so sloppily that it is barely usable. Instead, people use a jumble of competing, often incompatible programs when they need to type in Mongolian. WeChat has a Mongolian keyboard, but it’s unwieldy and users often prefer to send each other screenshots of text instead. The constant exchange of images is inconvenient, but it has the unintended benefit of being much more complicated for authorities to monitor and censor.
All but 60 of the world’s roughly 7,000 languages are considered “low-resource” by artificial intelligence researchers. Mongolian belongs to the vast majority of languages barely represented on the internet whose speakers deal with many challenges resulting from the predominance of English on the global internet. As technology improves, automated processes across the internet — from search engines to social media sites — may start to work a lot better for under-resourced languages. This could do a lot of good, giving those language speakers access to all kinds of tools and markets, but it will likely also reduce the degree to which languages like Mongolian fly under the radar of censors. The tradeoff for languages that have historically hovered on the margins of the internet is between safety and convenience on one hand, and freedom from censorship and intrusive eavesdropping on the other.
Back in Inner Mongolia, when parents were posting on WeChat about their plans to protest, it became clear that the app’s algorithms couldn’t make sense of the jpegs of Mongolian cursive, said Soyonbo Borjgin, a local journalist who covered the protests. The images and the long voice messages that protesters would exchange were protected by the Chinese state’s ignorance — there were no AI resources available to monitor them, and overworked police translators had little chance of surveilling all possibly subversive communication.
China’s efforts to stifle the Mongolian language within its borders have only intensified since the protests. Keen on the technological dimensions of the battle, Borjgin began looking into a machine learning system that was being developed at Inner Mongolia University. The system would allow computers to read images of the Mongolian script, after being fed and trained on digital reams of printed material that had been published when Mongolian still had Chinese state support. While reporting the story, Borjgin was told by the lead researcher that the project had received state money. Borjgin took this as a clear signal: The researchers were getting funding because what they were doing amounted to a state security project. The technology would likely be used to prevent future dissident organizing.
Until recently, AI has only worked well for the vanishingly small number of languages with large bodies of texts to train the technology on. Even national languages with hundreds of millions of speakers, like Bangla, have largely remained outside the priorities of tech companies. Last year, though, both Google and Meta announced projects to develop AI for under-resourced languages. But while newer AI models are able to generate some output in a wide set of languages, there’s not much evidence to suggest that it’s high quality.
Gabriel Nicholas, a research fellow at the Center for Democracy and Technology, explained that once tech companies have established the capacity to process a new language, they have a tendency to congratulate themselves and then move on. A market dominated by “big” languages gives them little incentive to keep investing in improvements. Hellina Nigatu, a computer science PhD student at the University of California, Berkeley, added that low-resource languages face the risk of “constantly trying to catch up” — or even losing speakers — to English.
Researchers also warn that even as the accuracy of machine translation improves, language models miss out on important, culturally specific details that can have real-world consequences. Companies like Meta, which partially rely on AI to review social media posts for things like hate speech and violence, have run into problems when they try to use the technology for under-resourced languages. Because they’ve been trained on just the few texts available, their AI systems too often have an incomplete picture of what words mean and how they’re used.
Arzu Geybulla, an Azerbaijani journalist who specializes in digital censorship, said that one problem with using AI to moderate social media content in under-resourced languages is the “lack of understanding of cultural, historical, political nuances in the way the language is being used on these platforms.” In Azerbaijan, where violence against Armenians is regularly celebrated online, the word “Armenian” itself is often used as a slur to attack dissidents. Because the term is innocuous in most other contexts, it’s easy for AI and even non-specialist human moderators to overlook its use. She also noted that AI used by social media platforms often lumps the Azerbaijani language together with languages spoken in neighboring countries: Azerbaijanis frequently send her screenshots of automated replies in Russian or Turkish to the hate speech reports they’d submitted in Azerbaijani.
But Geybulla believes improving AI for monitoring hate speech and incitement in Azerbaijani will lock in an essentially defective system. “I’m totally against training the algorithm,” she told me. “Content moderation needs to be done by humans in all contexts.” In the hands of an authoritarian government, sophisticated AI for previously neglected languages can become a tool for censorship.
According to Geybulla, Azerbaijani currently has such “an old school system of surveillance and authoritarianism that I wouldn’t be surprised if they still rely on Soviet methods.” Given the government’s demonstrated willingness to jail people for what they say online and to engage in mass online astroturfing, she believes that improving automated flagging for the Azerbaijani language would only make the repression worse. Instead of strengthening these easily abusable technologies, she argues that companies should invest in human moderators. “If I can identify inauthentic accounts on Facebook, surely someone at Facebook can do that too, and faster than I do,” she said.
Different languages require different approaches when building AI. Indigenous languages in the Americas, for instance, show forms of complexity that are hard to account for without either large amounts of data — which they currently do not have — or diligent expert supervision.
One such expert is Michael Running Wolf, founder of the First Languages AI Reality initiative, who says developers underestimate the challenge of American languages. While working as a researcher on Amazon’s Alexa, he began to wonder what was keeping him from building speech recognition for Cheyenne, his mother’s language. Part of the problem, he realized, was computer scientists’ unwillingness to recognize that American languages might present challenges that their algorithms couldn’t understand. “All languages are seen through the lens of English,” he told me.
Running Wolf thinks Anglocentrism is mostly to blame for the neglect that Indigenous languages have faced in the tech world. “The AI field, like any other space, is occupied by people who are set in their ways and unintentionally have a very colonial perspective,” he told me. “It’s not as if we haven’t had the ability to create AI for Indigenous languages until today. It’s just no one cares.”
American languages were put in this position deliberately. Until well into the 20th century, the U.S. government’s policy position on Indigenous American languages was eradication. From 1860 to 1978, tens of thousands of children were forcibly separated from their parents and kept in boarding schools where speaking their mother tongues brought beatings or worse. Nearly all Indigenous American languages today are at immediate risk of extinction. Running Wolf hopes AI tools like machine translation will make Indigenous languages easier to learn to fluency, making up for the current lack of materials and teachers and reviving the languages as primary means of communication.
His project also relies on training young Indigenous people in machine learning — he’s already held a coding boot camp on the Lakota reservation. If his efforts succeed, he said, “we’ll have Indigenous peoples who are the experts in natural language processing.” Running Wolf said he hopes this will help tribal nations to build up much-needed wealth within the booming tech industry.
The idea of his research allowing automated surveillance of Indigenous languages doesn’t scare Running Wolf so much, he told me. He compared their future online to their current status in the high school basketball games that take place across North and South Dakota. Indigenous teams use Lakota to call plays without their opponents understanding. “And guess what? The non-Indigenous teams are learning Lakota so that they know what the Lakota are doing,” Running Wolf explained. “I think that’s actually a good thing.”
The problem of surveillance, he said, is “a problem of success.” He hopes for a future in which Indigenous computer scientists are “dealing with surveillance risk because the technology’s so prevalent and so many people speak Chickasaw, so many people speak Lakota or Cree, or Ute — there’s so many speakers that the NSA now needs to have the AI so that they can monitor us,” referring to the U.S. National Security Agency, infamous for its snooping on communications at home and abroad.
Not everyone wishes for that future. The Cheyenne Nation, for instance, wants little to do with outsiders, he told me, and isn’t currently interested in using the systems he’s building. “I don’t begrudge that perspective because that’s a perfectly healthy response to decades, generations of exploitation,” he said.
Like Running Wolf, Borjgin believes that in some cases, opening a language up to online surveillance is a sacrifice necessary to keep it alive in the digital era. “I somewhat don’t exist on the internet,” he said. Because their language has such a small online culture, he said, “there’s an identity crisis for Mongols who grew up in the city,” pushing them instead towards Mandarin.
Despite the intense political repression that some of China’s other ethnic minorities face, Borjgin said, “one thing I envy about Tibetan and Uyghur is once I ask them something they will just google it with their own input system and they can find the result in one second.” Even though he knows that it will be used to stifle dissent, Borjgin still supports improving the digitization of the Mongol script: “If you don’t have the advanced technology, if it only stays to the print books, then the language will be eradicated. I think the tradeoff is okay for me.”
The story you just read is a small piece of a complex and an ever-changing storyline that Coda covers relentlessly and with singular focus. But we can’t do it without your help. Show your support for journalism that stays on the story by becoming a member today. Coda Story is a 501(c)3 U.S. non-profit. Your contribution to Coda Story is tax deductible.