Seydou Diallo
2026
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Mamadou K. Keita | Sebastien Diarra | Christopher M Homan | Seydou Diallo
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Mamadou K. Keita | Sebastien Diarra | Christopher M Homan | Seydou Diallo
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: **ZarmaInstruct-50k**, **BambaraInstruct-50k**, and **FulfuldeInstruct-50k**.
Where Are We at with Automatic Speech Recognition for the Bambara Language?
Seydou Diallo | Yacouba Diarra | Panga Azazia Kamaté | Aboubacar Ouattara | Mamadou K. Keita | Adam Bouno Kampo
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Seydou Diallo | Yacouba Diarra | Panga Azazia Kamaté | Aboubacar Ouattara | Mamadou K. Keita | Adam Bouno Kampo
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards; the top-performing system in terms of Word Error Rate (WER) achieved 46.76% and the best Character Error Rate (CER) of 13.00% was set by another model, while several prominent multilingual models exceeded 100% WER due to severe hallucinations. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures likely establish an upper bound for performance in practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
2025
GAIfE: Using GenAI to Improve Literacy in Low-resourced Settings
Allahsera Auguste Tapo | Nouhoum Coulibaly | Seydou Diallo | Sebastien Diarra | Christopher M Homan | Mamadou K. Keita | Michael Leventhal
Findings of the Association for Computational Linguistics: NAACL 2025
Allahsera Auguste Tapo | Nouhoum Coulibaly | Seydou Diallo | Sebastien Diarra | Christopher M Homan | Mamadou K. Keita | Michael Leventhal
Findings of the Association for Computational Linguistics: NAACL 2025
Illiteracy is a predictor of many negative social and personal outcomes. Illiteracy rates are particularly high in countries with underresourced languages, where few books exist that are suitable for children to learn to read from. We present GAIfE (Generative AI for Education), a toolchain and workflow developed through empirical methods, that demonstrates how existing tools can be adapted to address low literacy for an underresourced language. We used GAIfE (a play on the Bambara word for “book”) to construct materials for developing children’s reading competence in Bambara, the vehicular language of Mali. Our approach to the generation and post-generation editing of content skewed by the Global-North-centric bias of available LLMs, enabled us to rapidly multiply the content in Bambara available online by 10 times while maintaining high standards of attractiveness of the material to maintain high engagement, accurate representation of the Malian culture and physical and social environment and language quality. Using our materials, pilot reading programs achieved a 67% reduction in the number of children unable to read Bambara. Our approach demonstrated the power of bias-aware application of generative AI to the problem domain as well as the potential impact the application of this technology could have on reducing illiteracy and improving learning outcomes through native language education.