Abdulrahman Ndegwa Karatu

2022

Corpus Development of Kiswahili Speech Recognition Test and Evaluation sets, Preemptively Mitigating Demographic Bias Through Collaboration with Linguists
Kathleen Siminyu | Kibibi Mohamed Amran | Abdulrahman Ndegwa Karatu | Mnata Resani | Mwimbi Makobo Junior | Rebecca Ryakitimbo | Britone Mwasaru
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Language technologies, particularly speech technologies, are becoming more pervasive for access to digital platforms and resources. This brings to the forefront concerns of their inclusivity, first in terms of language diversity. Additionally, research shows speech recognition to be more accurate for men than for women and more accurate for individuals younger than 30 years of age than those older. In the Global South where languages are low resource, these same issues should be taken into consideration in data collection efforts to not replicate these mistakes. It is also important to note that in varying contexts within the Global South, this work presents additional nuance and potential for bias based on accents, related dialects and variants of a language. This paper documents i) the designing and execution of a Linguists Engagement for purposes of building an inclusive Kiswahili Speech Recognition dataset, representative of the diversity among speakers of the language ii) the unexpected yet key learning in terms of socio-linguistcs which demonstrate the importance of multi-disciplinarity in teams developing datasets and NLP technologies iii) the creation of a test dataset intended to be used for evaluating the performance of Speech Recognition models on demographic groups that are likely to be underrepresented.

Co-authors

Kathleen Siminyu 1

Venues

ComputEL1

Fix author