Corpus Development of Kiswahili Speech Recognition Test and Evaluation sets, Preemptively Mitigating Demographic Bias Through Collaboration with Linguists

Kathleen Siminyu, Kibibi Mohamed Amran, Abdulrahman Ndegwa Karatu, Mnata Resani, Mwimbi Makobo Junior, Rebecca Ryakitimbo, Britone Mwasaru


Abstract
Language technologies, particularly speech technologies, are becoming more pervasive for access to digital platforms and resources. This brings to the forefront concerns of their inclusivity, first in terms of language diversity. Additionally, research shows speech recognition to be more accurate for men than for women and more accurate for individuals younger than 30 years of age than those older. In the Global South where languages are low resource, these same issues should be taken into consideration in data collection efforts to not replicate these mistakes. It is also important to note that in varying contexts within the Global South, this work presents additional nuance and potential for bias based on accents, related dialects and variants of a language. This paper documents i) the designing and execution of a Linguists Engagement for purposes of building an inclusive Kiswahili Speech Recognition dataset, representative of the diversity among speakers of the language ii) the unexpected yet key learning in terms of socio-linguistcs which demonstrate the importance of multi-disciplinarity in teams developing datasets and NLP technologies iii) the creation of a test dataset intended to be used for evaluating the performance of Speech Recognition models on demographic groups that are likely to be underrepresented.
Anthology ID:
2022.computel-1.3
Volume:
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Sarah Moeller, Antonios Anastasopoulos, Antti Arppe, Aditi Chaudhary, Atticus Harrigan, Josh Holden, Jordan Lachler, Alexis Palmer, Shruti Rijhwani, Lane Schwartz
Venue:
ComputEL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13–19
Language:
URL:
https://aclanthology.org/2022.computel-1.3
DOI:
10.18653/v1/2022.computel-1.3
Bibkey:
Cite (ACL):
Kathleen Siminyu, Kibibi Mohamed Amran, Abdulrahman Ndegwa Karatu, Mnata Resani, Mwimbi Makobo Junior, Rebecca Ryakitimbo, and Britone Mwasaru. 2022. Corpus Development of Kiswahili Speech Recognition Test and Evaluation sets, Preemptively Mitigating Demographic Bias Through Collaboration with Linguists. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 13–19, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Corpus Development of Kiswahili Speech Recognition Test and Evaluation sets, Preemptively Mitigating Demographic Bias Through Collaboration with Linguists (Siminyu et al., ComputEL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.computel-1.3.pdf
Video:
 https://aclanthology.org/2022.computel-1.3.mp4