Data scientists in Cameroon joined forces to form team LangTech as a part of the SAS Global Hackathon. In the face of rapid digitalization and modernization, they sought a way to preserve indigenous African languages.
There are over 1,000 African languages, but those with fewer than 100,000 speakers are considered "lost". Cameroon is home to at least 250 languages with many considered living and others deemed lost or extinct.
Cameroon is home to fewer native speakers of local languages since young generations aren't encouraged to learn the language. Little documentation and structuration of these traditional languages also make it difficult for these citizens to participate in international collaboration.
Some languages have little documentation and this makes those languages unstructured – and when those languages aren't structured or documented, it's challenging to digitalize the language.
— Swi Innocent Che, Team LangTech co-leader
To reverse the erosion of languages in Cameroon, these data scientists joined together to find an innovative solution.
Preserving languages with Natural Language Processing (NLP)
Team LangTech worked to preserve local African languages by leveraging the power of Natural Language Processing (NLP). In contrast to structured information that can be neatly arranged into rows and columns, NLP is used on unstructured text information. Websites and smartphone apps can include these traditional languages along with English and French when they are digitized.
In the video below, you can take a look at the team explaining the analysis they performed on the data and how they built the use case using unlimited access to the SAS® Viya® platform on Microsoft Azure cloud environments.
The methodology the team used began with collecting data while traveling to villages with voice recorders to find locals who wanted to participate in expanding libraries of data. Using the recorders, key phrases from their native language were then identified and transcribed using AI and NLP.
LangTech gathered the terminology and loaded it onto SAS® Visual Analytics as Excel databases. From there they were able to obtain critical information such as the uniqueness percentage between each data column, percentage of null values, pattern count, overall data length and minimum and maximum length. The team then created models using a Jupyter Python notebook with Keras as the principal NLP library. This resulted in a demo based on four of the many living indigenous languages of Cameroon.
The team says they are eager to stay curious, build upon project results and reach their language preservation goal by collecting additional data and seeking access to more computing power.
“I am naturally curious, but the way I stay curious is I ask, 'What’s next?' When you’re curious, you want to know more," said Che. "You want to better the solution that exists, to look for something new. Innovation is all about creating something new, so I think curiosity leads to innovation.”
Their efforts could improve local services from international agencies, increase national government representation and lead to better customer service from utility and telecommunications providers.