Lawrence Muchemi
2023
Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks
Barack Wanjawa
|
Lilian Wanzare
|
Florence Indede
|
Owen McOnyango
|
Edward Ombui
|
Lawrence Muchemi
Journal for Language Technology and Computational Linguistics, Vol. 36 No. 2
Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya (three dialects of Lumarachi, Lulogooli and Lubukusu). Data collection was done by researchers who were deployed to the various data collection sources such as communities, schools, media, and publishers. The Kencorpus’ dataset has a collection of 5,594 items, being 4,442 texts of 5.6 million words and 1,152 speech files worth 177 hours. Based on this data, other datasets were also developed such as Part of Speech tagging sets for Dholuo and the Luhya dialects of 50,000 and 93,000 words tagged respectively. We developed 7,537 Question-Answer pairs from 1,445 Swahili texts and also created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. Additionally, we developed two proof of concept systems: for Kiswahili speech-to-text and a machine learning system for Question Answering task. These proofs provided results of a performance of 18.87% word error rate for the former, and 80% Exact Match (EM) for the latter system. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages. Challenges in developing the corpus included deficiencies in the data sources, data cleaning challenges, relatively short project timelines and the Coronavirus disease (COVID-19) pandemic that restricted movement and hence the ability to get the data in a timely manner.