Africa has thousands of languages. Can we train AI for all of them?

How can you teach a language to read if there is nothing to read? This is a problem facing developers across the African continent as they seek to train AI to understand and respond to prompts in local languages.

To train a language model, you need data. For languages like English, developers have articles, books, and manuals easily accessible on the Internet. However, for most of Africa’s languages (estimated to be between 1,500 and 3,000 of them), very few written resources are available. Vukosi Maribate, a computer science professor at the University of Pretoria in South Africa, uses the number of Wikipedia articles available to illustrate the amount of data available. There are over 7 million articles in English. Tigrinya, spoken by about 9 million people in Ethiopia and Eritrea, has 335 words, but Akan, Ghana’s most widely spoken mother tongue, has none.

Of these thousands of languages, only 42 are currently supported by the language model. Of Africa’s 23 scripts and alphabets, only three are available: Latin, Arabic, and Ge’Ez (used in the Horn of Africa). This underdevelopment “comes from a financial perspective,” says Chinasa T. Okoro, founder of TechnēculturĎ, a research institute working to advance global equity in AI. “Even though there are more people who speak Swahili than Finnish, Finland is a better market for companies like Apple and Google.”

Okoro warns that unless more language models are developed, the implications for the entire continent could be dire. “We’re going to continue to see people being excluded from opportunities,” she told CNN. As the continent looks to develop its own AI infrastructure and capabilities, those who do not speak one of these 42 languages risk being left behind.

To avoid this, Okoro says, AI developers across the continent “need to rethink the way they approach model development in the first place.”

This is what Marivate did. Mr Maribate heads the South African arm of the African Next Voices project and has made recordings in 18 languages in South Africa, Kenya and Nigeria. Over two years, the three teams collected 9,000 hours of recordings from people of different ages and locations, creating a dataset that AI developers across the continent can use to train their models.

The researchers sometimes gave native speakers scripts to read, but mostly they gave prompts, recorded their responses, and transcribed them. In the case of Isindebele, a language spoken in South Africa and Zimbabwe, I had great difficulty finding written material, so I relied on government manuals for goat herders to create prompts.

African Next Voices does not collect enough data to train large-scale language models (LLMs) like ChatGPT and Gemini, which can cover thousands of topics in detail. But Maribate said he focused his recordings on specific topics he considered most important, such as health and agriculture.

Using small datasets to create generalized models results in high error rates, while small, focused datasets can have high accuracy within the limited scope of specialized models, explained Nyalleng Moorosi, a researcher at the Distributed AI Research Institute (DAIR) who is not affiliated with the African Next Voices project.

For her, it’s a matter of “prioritizing error.” “If someone just wants to know what’s going on in downtown Nairobi, mistakes there can be tolerated,” Murosi says, but mistakes in models dealing with topics such as banking or health care can have serious consequences.

“We need to make sure that the people building these models understand the culture enough to understand the consequences and understand the weight of these errors,” Murosi told CNN.

Nyalleng Moorosi, Researcher at the Distributed AI Institute.

Words and symbols have multiple meanings, she says. For example, St George’s Cross has associations with right-wing politics in the UK, but this is not obvious to someone from Ghana or Lesotho. This problem is especially noticeable in languages with fewer resources. “There is a lot of contextual knowledge, but very little documentation,” she says.

A DAIR investigation found that social media websites were unable to recognize and remove hate speech related to ethnic violence in Ethiopia, in part because automated systems and human moderators were unfamiliar with the slang terminology used.

Without this cultural understanding, Murosi says, it is impossible for “AI systems to behave and make decisions that are consistent with our beliefs and values.”

While many Africans speak multiple languages, including African and European languages that are already supported by language models, Moorosi believes the goal should be “to make AI accessible in all languages, even those with a single speaker.” All languages deserve expression or preservation.

But lack of data is not the only challenge facing AI developers in Africa. Most African languages have not been codified through dictionaries or grammatical studies. In Kinyarwanda, the language of Rwanda, there are three common ways to spell the country’s name: uRwanda, Urwanda, and uRwanda. Without spelling rules, even the most basic text processing becomes difficult.

Another problem is the lack of data centers. The African Union has warned that only 10% of the continent’s data center needs will be met in 2024, creating a bottleneck for Africa’s AI hopes.

The concern for Marivate is that if models are not created for these small languages, they will “die.” If developers were to create a dataset for a language that might not even have a writing system, “the model would have to change,” he added.

The African Next Voices project has just finished collecting and transcribing data. Maribate said that while he is not currently working on a new language, he is already thinking about which language to develop next.

Source link

What's Hot

See the images shortlisted for the 2026 Wildlife Photographer of the Year People’s Choice Award.

Trump Project Vault Stockpile Contains All Critical Minerals

AMD Earnings Report Q4 2025

See the images shortlisted for the 2026 Wildlife Photographer of the Year People’s Choice Award.

Grieving Iranians cower in silence next to protesters’ graves

Who is Peter Mandelson and why has his relationship with Epstein rocked the British establishment?

US House passes $1.2 trillion spending package to end government shutdown | Politics News

Trump-Petro talks: How cold have U.S.-Colombia relations been? |Donald Trump News

Modi, Trump announce India-US ‘trade deal’: What we know and what we don’t | Explainer News

Xcode moves to agent coding with deeper OpenAI and Anthropic integration

Intel will start manufacturing GPUs, a market dominated by Nvidia

Lotus Health wins $35 million for AI doctors to examine patients for free

What's Hot

Africa has thousands of languages. Can we train AI for all of them?

Related Posts

Subscribe to Updates