Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

Michael Leventhal, Yacouba Diarra, Nouhoum Coulibaly, Panga Azazia Kamaté


Abstract
We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47% to 37.12% on one and from 36.07% to 32.33% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
Anthology ID:
2026.africanlp-main.18
Volume:
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Everlyn Asiko Chimoto, Constantine Lignos, Shamsuddeen Muhammad, Idris Abdulmumin, Clemencia Siro, David Ifeoluwa Adelani
Venues:
AfricaNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
190–196
Language:
URL:
https://aclanthology.org/2026.africanlp-main.18/
DOI:
Bibkey:
Cite (ACL):
Michael Leventhal, Yacouba Diarra, Nouhoum Coulibaly, and Panga Azazia Kamaté. 2026. Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara. In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), pages 190–196, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara (Leventhal et al., AfricaNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.africanlp-main.18.pdf