African researchers create ‘largest dataset of indigenous languages on the continent’

With the aim of ensuring millions of people are not excluded from advances in artificial intelligence such as chatbots, African researchers have created what is believed to be the largest dataset of indigenous languages on the continent.

African researchers have created what is believed to be the largest dataset of indigenous languages on the continent, aiming to ensure millions of people are not excluded from advances in artificial intelligence such as chatbots.

Although Africa is home to more than a quarter of the world’s languages, most of them are missing from the development of AI.

The majority of tools, such as ChatGPT, are trained primarily on English, European and Chinese languages, which benefit from vast amounts of online text.

But many African languages are spoken more often than written, leaving little material to train AI systems and limiting their usefulness for speakers across the continent.

Professor Vukosi Marivate who teaches at the University of Pretoria, told the BBC: “We think in our own languages, dream in them and interpret the world through them.

“If technology doesn’t reflect that, a whole group risks being left behind. We’re going through this AI revolution, imagining all that can be done with it.

“Now imagine there’s a part of the population that just doesn’t have that access because all the information is in English.”

The African Next Voices project brought together linguists and computer scientists to develop AI-ready datasets in 18 African languages, with plans to expand in the future.

Over two years, the team recorded 9,000 hours of speech across Kenya, Nigeria and South Africa, covering scenarios in farming, health and education.

The collection included Kikuyu and Dholuo in Kenya, Hausa and Yoruba in Nigeria, and isiZulu and Tshivenda in South Africa.

Professor Marivate added: “You need some basis to start off with and that’s what African Next Voices is and then people will build on top of that and add their own innovations.”

Kenyan computational linguist Lilian Wanzare said: “We gathered voices from different regions, ages and backgrounds so it’s as inclusive as possible. Big tech can’t always see those nuances.”

The project was supported by a $2.2 million (£1.6 million) grant from the Gates Foundation.

Its data will be open access, allowing developers to create tools that translate, transcribe and respond in African languages.

For farmer Kelebogile Mosime, who manages a 21-hectare vegetable site in Rustenburg, South Africa, AI in local languages already makes a difference, uses AI-Farmer, an app recognising Sesotho, isiZulu and Afrikaans, to help with crop problems.

She said: “Daily, I see the benefits of being able to use my home language Setswana on the app when I run into problems on the farm, I ask anything and get a useful answer.

“For somebody in the rural areas like me who is not exposed to technology it’s useful. I can ask about different options for insect control, it’s also been useful with diagnosing sick plants.”

Pelonomi Moiloa, chief executive of South African start-up Lelapa AI, also told the BBC: “English is the language of opportunity.

“For many South Africans who don’t speak it, it’s not just inconvenient – it can mean missing out on essential services like healthcare, banking or even government support.

“Language can be a huge barrier. We’re saying it shouldn’t be.”

Professor Marivate added: “Language is access to imagination.

“It’s not just words – it’s history, culture, knowledge. If indigenous languages aren’t included, we lose more than data; we lose ways of seeing and understanding the world.”

Close Bitnami banner
Bitnami