Kathleen Siminyu
2025
On the Tolerance of Repetition Before Performance Degradation in Kiswahili Automatic Speech Recognition
Kathleen Siminyu | Kathy Reid | Rebecca Ryakitimbo | Britone Mwasaru | Chenai Chair
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)
Kathleen Siminyu | Kathy Reid | Rebecca Ryakitimbo | Britone Mwasaru | Chenai Chair
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)
State of the art end-to-end automatic speech recognition (ASR) models require large speech datasets for training. The Mozilla Common Voice project crowd-sources read speech to address this need. However, this approach often results in many audio utterances being recorded for each written sentence. Using Kiswahili speech data, this paper first explores how much audio repetition in utterances is permissible in a training set before model degradation occurs, then examines the extent to which audio augmentation techniques can be employed to increase the diversity of speech characteristics and improve accuracy. We find that repetition up to a ratio of 1 sentence to 8 audio recordings improves performance, but performance degrades at a ratio of 1:16. We also find small improvements from frequency mask, time mask and tempo augmentation. Our findings provide guidance on training set construction for ASR practitioners, particularly those working in under-served languages.
2022
Corpus Development of Kiswahili Speech Recognition Test and Evaluation sets, Preemptively Mitigating Demographic Bias Through Collaboration with Linguists
Kathleen Siminyu | Kibibi Mohamed Amran | Abdulrahman Ndegwa Karatu | Mnata Resani | Mwimbi Makobo Junior | Rebecca Ryakitimbo | Britone Mwasaru
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Kathleen Siminyu | Kibibi Mohamed Amran | Abdulrahman Ndegwa Karatu | Mnata Resani | Mwimbi Makobo Junior | Rebecca Ryakitimbo | Britone Mwasaru
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Language technologies, particularly speech technologies, are becoming more pervasive for access to digital platforms and resources. This brings to the forefront concerns of their inclusivity, first in terms of language diversity. Additionally, research shows speech recognition to be more accurate for men than for women and more accurate for individuals younger than 30 years of age than those older. In the Global South where languages are low resource, these same issues should be taken into consideration in data collection efforts to not replicate these mistakes. It is also important to note that in varying contexts within the Global South, this work presents additional nuance and potential for bias based on accents, related dialects and variants of a language. This paper documents i) the designing and execution of a Linguists Engagement for purposes of building an inclusive Kiswahili Speech Recognition dataset, representative of the diversity among speakers of the language ii) the unexpected yet key learning in terms of socio-linguistcs which demonstrate the importance of multi-disciplinarity in teams developing datasets and NLP technologies iii) the creation of a test dataset intended to be used for evaluating the performance of Speech Recognition models on demographic groups that are likely to be underrepresented.
2020
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
Wilhelmina Nekoto | Vukosi Marivate | Tshinondiwa Matsila | Timi Fasubaa | Taiwo Fagbohungbe | Solomon Oluwole Akinola | Shamsuddeen Muhammad | Salomon Kabongo Kabenamualu | Salomey Osei | Freshia Sackey | Rubungo Andre Niyongabo | Ricky Macharm | Perez Ogayo | Orevaoghene Ahia | Musie Meressa Berhe | Mofetoluwa Adeyemi | Masabata Mokgesi-Selinga | Lawrence Okegbemi | Laura Martinus | Kolawole Tajudeen | Kevin Degila | Kelechi Ogueji | Kathleen Siminyu | Julia Kreutzer | Jason Webster | Jamiil Toure Ali | Jade Abbott | Iroro Orife | Ignatius Ezeani | Idris Abdulkadir Dangana | Herman Kamper | Hady Elsahar | Goodness Duru | Ghollah Kioko | Murhabazi Espoir | Elan van Biljon | Daniel Whitenack | Christopher Onyefuluchi | Chris Chinenye Emezue | Bonaventure F. P. Dossou | Blessing Sibanda | Blessing Bassey | Ayodele Olabiyi | Arshath Ramkilowan | Alp Öktem | Adewale Akinfaderin | Abdallah Bashir
Findings of the Association for Computational Linguistics: EMNLP 2020
Wilhelmina Nekoto | Vukosi Marivate | Tshinondiwa Matsila | Timi Fasubaa | Taiwo Fagbohungbe | Solomon Oluwole Akinola | Shamsuddeen Muhammad | Salomon Kabongo Kabenamualu | Salomey Osei | Freshia Sackey | Rubungo Andre Niyongabo | Ricky Macharm | Perez Ogayo | Orevaoghene Ahia | Musie Meressa Berhe | Mofetoluwa Adeyemi | Masabata Mokgesi-Selinga | Lawrence Okegbemi | Laura Martinus | Kolawole Tajudeen | Kevin Degila | Kelechi Ogueji | Kathleen Siminyu | Julia Kreutzer | Jason Webster | Jamiil Toure Ali | Jade Abbott | Iroro Orife | Ignatius Ezeani | Idris Abdulkadir Dangana | Herman Kamper | Hady Elsahar | Goodness Duru | Ghollah Kioko | Murhabazi Espoir | Elan van Biljon | Daniel Whitenack | Christopher Onyefuluchi | Chris Chinenye Emezue | Bonaventure F. P. Dossou | Blessing Sibanda | Blessing Bassey | Ayodele Olabiyi | Arshath Ramkilowan | Alp Öktem | Adewale Akinfaderin | Abdallah Bashir
Findings of the Association for Computational Linguistics: EMNLP 2020
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. ‘Low-resourced’-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released at https://github.com/masakhane-io/masakhane-mt.
AI4D - African Language Dataset Challenge
Kathleen Siminyu | Sackey Freshia
Proceedings of the Fourth Widening Natural Language Processing Workshop
Kathleen Siminyu | Sackey Freshia
Proceedings of the Fourth Widening Natural Language Processing Workshop
As language and speech technologies become more advanced, the lack of fundamental digital resources for African languages, such as data, spell checkers and PoS taggers, means that the digital divide between these languages and others keeps growing. This work details the organisation of the AI4D - African Language Dataset Challenge, an effort to incentivize the creation, curation and uncovering to African language datasets through a competitive challenge, particularly datasets that are annotated or prepared for use in a downstream NLP task.
Search
Fix author
Co-authors
- Britone Mwasaru 2
- Rebecca Ryakitimbo 2
- Jade Abbott 1
- Mofetoluwa Adeyemi 1
- Orevaoghene Ahia 1
- Adewale Akinfaderin 1
- Solomon Oluwole Akinola 1
- Jamiil Toure Ali 1
- Abdallah Bashir 1
- Blessing Bassey 1
- Musie Meressa Berhe 1
- Chenai Chair 1
- Idris Abdulkadir Dangana 1
- Kevin Degila 1
- Bonaventure F. P. Dossou 1
- Goodness Duru 1
- Hady Elsahar 1
- Chris Chinenye Emezue 1
- Murhabazi Espoir 1
- Ignatius Ezeani 1
- Taiwo Fagbohungbe 1
- Timi Fasubaa 1
- Sackey Freshia 1
- Salomon Kabongo Kabenamualu 1
- Herman Kamper 1
- Ghollah Kioko 1
- Julia Kreutzer 1
- Ricky Macharm 1
- Mwimbi Makobo Junior 1
- Vukosi Marivate 1
- Laura Martinus 1
- Tshinondiwa Matsila 1
- Kibibi Mohamed Amran 1
- Masabata Mokgesi-Selinga 1
- Shamsuddeen Hassan Muhammad 1
- Abdulrahman Ndegwa Karatu 1
- Wilhelmina Nekoto 1
- Rubungo Andre Niyongabo 1
- Perez Ogayo 1
- Kelechi Ogueji 1
- Lawrence Okegbemi 1
- Ayodele Olabiyi 1
- Christopher Onyefuluchi 1
- Iroro Orife 1
- Salomey Osei 1
- Arshath Ramkilowan 1
- Kathy Reid 1
- Mnata Resani 1
- Freshia Sackey 1
- Blessing Kudzaishe Sibanda 1
- Kolawole Tajudeen 1
- Jason Webster 1
- Daniel Whitenack 1
- Elan van Biljon 1
- Alp Öktem 1