6 hours ago

Nine years ago, Google’s CEO Sundar Pichai first pledged that artificial intelligence would make information “universally accessible” to everyone, regardless of language.

He has continued to repeat that promise ever since, fuelling expectations around the world that technology would finally bridge linguistic divides and provide equal access to knowledge for all.

Yet for those who speak any of Africa’s more than 2,000 languages, that promise remains distant.

Millions across the continent still find that the advanced AI tools transforming agriculture, education, and daily life cannot understand or communicate in their own languages.

According to research, ChatGPT – which has 800 million weekly active users worldwide— recognises only 10 to 20 per cent of sentences written in Hausa, which is spoken by over 94 million Nigerians.

The same goes for other widely spoken African languages such as Yoruba, Igbo, Swahili, and Somali, all of which remain severely underrepresented in mainstream AI models despite having tens of millions of speakers.

But why have so many African languages been overlooked by today’s most powerful AI tools and what does this reveal about who gets to shape the digital future?

‘Low resource’ languages

One of the foremost and utmost reasons for African languages’ exclusion from AI is what researchers call the “low-resource” problem.

In this context, “low-resource” refers to the scarcity of online materials such as websites, books, and transcripts available in those languages.

Since most large language models (LLMs) rely on huge volumes of such digital data to learn and generate text, the vast majority of this data is in English (high-resource) or a handful of other widely spoken global languages in the West.

“Our measure for progress and research agenda is based on what works for Western languages,” says Hellina Hailu Nigatu, an AI researcher focused on LLMs at the University of California, Berkeley.

The lack of training data leaves AI models like ChatGPT or Gemini struggling to recognise, generate or even meaningfully “see” African languages, no matter how many people speak them.

“African languages are categorised as ‘low-resource’ and are usually excluded, or even when they are included, systems perform poorly on them,” she tells TRT World.

This classification system that divides the world's languages into "high-resource" and "low-resource" categories has become the industry's preferred framework for discussing this disparity.

Commercial incentives, systemic bias and cost issue

Another reason for underrepresentation is the priorities of global AI research and development.

Research shows that large language model (LLM) outputs lean towards “Western stereotypes”.

The standards are set mostly by Western tech companies and academic institutions, which focus on languages with the largest online footprints and most funding directed towards a small group of “high-resource” languages.

As a result, African languages are rarely prioritised for investment or innovation.

Commercial incentives also play a major role. Since the immediate economic returns from African language markets are limited, companies have little motivation to dedicate time and resources to improving AI support for these languages.

This structural bias is reinforced by the datasets used to train AI models.

Even when African languages are included, the systems often adopt Western cultural assumptions, sometimes misrepresenting local contexts or perpetuating stereotypes.

The findings align with broader research on algorithmic bias.

“What we see in research is that adopting LLMs to multiple languages without careful consideration risks importing biases from English to these multilingual contexts, or misses contextual notions of bias that do not exist in English,” Nigatu says.

There is also a technical challenge in the way AI models process text, one that puts many African languages at an additional disadvantage.

Recommended

Belarus's longtime leader Lukashenko says he won't seek another term, dismisses dynastic succession

Xi to Putin: China welcomes warming Russia–US ties ahead of potential Trump-Putin meeting

Beijing warns Philippines: Don’t meddle in Taiwan, avoid ‘playing with fire’

Greene slams AIPAC as 'foreign lobby' after calling Israel's Gaza war 'genocide'

Recommended

Belarus's longtime leader Lukashenko says he won't seek another term, dismisses dynastic succession

Xi to Putin: China welcomes warming Russia–US ties ahead of potential Trump-Putin meeting

Beijing warns Philippines: Don’t meddle in Taiwan, avoid ‘playing with fire’

Greene slams AIPAC as 'foreign lobby' after calling Israel's Gaza war 'genocide'

Research has found that using non-Latin scripts in popular AI tools actually costs more than using English or French.

This is because the software breaks down sentences into smaller parts called “tokens” and it takes more tokens to write the same sentence in languages that do not use the Latin alphabet.

This means users who can least afford it end up paying more to process the same amount of text, and often receive less reliable results.

Nigatu stresses that these barriers reflect entrenched inequalities about who is shaping these systems in the first place.

As she points out, it matters greatly “who is doing the research, i.e., how involved are speakers of these languages in what is done for their language.”

Africa’s digital self-determination

Against this backdrop of systematic exclusion, a groundbreaking initiative is taking place for African language representation in artificial intelligence.

The African Next Voices project, funded by a $2.2 million grant from the Gates Foundation, represents the largest AI-ready language data creation initiative for multiple African languages to date.

Rather than waiting for Silicon Valley's attention, researchers across the continent have taken matters into their own hands.

Language specialists have already recorded 9,000 hours of speech across 18 languages in Nigeria, Kenya, and South Africa, transforming these recordings into digitised datasets that developers can incorporate into large language models.

The first tranche of this data, released this month, marks a watershed moment in the democratisation of AI development.

"It's really exciting to see the improvements this is going to bring to the modelling of these specific languages, and how it's also going to help the entire community that is working across language technologies for Africa," says Ife Adebara, chief technology officer at the non-profit organisation Data Science Nigeria, who co-leads the Nigerian arm of the project.

Her team focuses on languages including Hausa, Yoruba, Igbo, and Naija, collectively spoken by hundreds of millions of people yet virtually absent from mainstream AI systems.

The methodology behind African Next Voices reveals a fundamentally different approach to language data collection. Instead of scraping existing digital content as Western tech companies do, researchers engage directly with diverse communities.

Lilian Wanzare, a computational linguist at Maseno University in Kenya who leads the Kenyan component, explains how her team shows individuals images and asks them to describe what they see in their native languages, including Dholuo, Kikuyu, Kalenjins, Maasai, and Somali.

Their approach prioritises authentic, everyday language use over formal or literary texts.

"There's a huge push towards localised data sets, because the impact is in capturing the people within their local settings," Wanzare says.

In South Africa, Vukosi Marivate, a computer scientist at the University of Pretoria, leads efforts to collect data for seven languages – including Setswana, isiZulu, isiXhosa, Sesotho, Sepedi, isiNdebele, and Tshivenda.

His team works with a consortium of organisations to create AI language models that technology businesses can then improve upon.

Beyond the technical achievements, African Next Voices embodies a philosophical shift in how AI development should proceed.

While many tech firms treat African languages as afterthoughts to be addressed only after profitable markets are saturated, this initiative positions them as primary subjects worthy of dedicated resources and expertise.

The project's methodology documentation will be shared alongside the data, enabling researchers elsewhere to replicate this work for other marginalised languages globally.

Organisations like Masakhane have already built strong networks focused on natural language processing, showing what is possible when African languages are developed by Africans, for Africans.

By taking the initiative themselves, these communities are showing that the future of artificial intelligence can be shaped on their own terms, rather than waiting for Silicon Valley to decide who gets a voice.