Wikipedia is the most ambitious multilingual project after the Bible: There are editions in over 340 languages, and a further 400 even more obscure ones are being developed and tested. Many of these smaller editions have been swamped with automatically translated content as AI has become increasingly accessible. Volunteers working on four African languages, for instance, estimated to MIT Technology Review that between 40% and 60% of articles in their Wikipedia editions were uncorrected machine translations. And after auditing the Wikipedia edition in Inuktitut, an Indigenous language close to Greenlandic that’s spoken in Canada, MIT Technology Review estimates that more than two-thirds of pages containing more than several sentences feature portions created this way.

This is beginning to cause a wicked problem. AI systems, from Google Translate to ChatGPT, learn to “speak” new languages by scraping huge quantities of text from the internet. Wikipedia is sometimes the largest source of online linguistic data for languages with few speakers—so any errors on those pages, grammatical or otherwise, can poison the wells that AI is expected to draw from. That can make the models’ translation of these languages particularly error-prone, which creates a sort of linguistic doom loop as people continue to add more and more poorly translated Wikipedia pages using those tools, and AI models continue to train from poorly translated pages. It’s a complicated problem, but it boils down to a simple concept: Garbage in, garbage out.

“These models are built on raw data,” says Kevin Scannell, a former professor of computer science at Saint Louis University who now builds computer software tailored for endangered languages. “They will try and learn everything about a language from scratch. There is no other input. There are no grammar books. There are no dictionaries. There is nothing other than the text that is inputted.”

There isn’t perfect data on the scale of this problem, particularly because a lot of AI training data is kept confidential and the field continues to evolve rapidly. But back in 2020, Wikipedia was estimated to make up more than half the training data that was fed into AI models translating some languages spoken by millions across Africa, including Malagasy, Yoruba, and Shona. In 2022, a research team from Germany that looked into what data could be obtained by online scraping even found that Wikipedia was the sole easily accessible source of online linguistic data for 27 under-resourced languages.

This could have significant repercussions in cases where Wikipedia is poorly written—potentially pushing the most vulnerable languages on Earth toward the precipice as future generations begin to turn away from them.

“Wikipedia will be reflected in the AI models for these languages,” says Trond Trosterud, a computational linguist at the University of Tromsø in Norway, who has been raising the alarm about the potentially harmful outcomes of badly run Wikipedia editions for years. “I find it hard to imagine it will not have consequences. And, of course, the more dominant position that Wikipedia has, the worse it will be.”

  • realitista@lemmus.org
    link
    fedilink
    English
    arrow-up
    17
    ·
    10 hours ago

    So it hasn’t sent the languages themselves into a doom spiral but rather the AI models for those languages into a doom spiral.

    • sleepundertheleaves@infosec.pub
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      2 hours ago

      Unfortunately, it’s likely to harm speakers of those languages as well. For these languages, there’s not enough training data on the Internet because speakers of those languages don’t have good access to the Internet - because of poverty, because of lack of education, because they live in isolated regions where access to the Internet is limited, all the factors that play into the “digital divide” between people who can access the Internet (and all its benefits) and people who can’t.

      If people can’t access AI tools in their native language because LLMs for those languages were trained on recursive slop, but devices and operating systems are incorporating more and more AI into them anyway, it’s just going to worsen that digital divide, and be another factor encouraging young people to give up their native languages entirely.

      Also, there’s the damage that bad AI-generated Wikipedia articles are doing to speakers of those languages already, which the article discusses.