Michael Samwel Mollel

2026

Despite remarkable progress in multilingual machine translation (MT), the majority of African—especially East African—languages remain significantly underrepresented both in benchmark datasets and state-of-the-art (SOTA) MT models. This persistent exclusion from mainstream technologies not only limits equitable access, but constrains the development of tools that accurately reflect the region’s linguistic and cultural diversity. Recent advances in open-source large language models have demonstrated strong multilingual MT capabilities through data-efficient adaptation strategies. However, little work has explored their potential for low-resource African languages. We introduce AfriMMT-EA, the first highly multilingual benchmark and MT dataset for East African languages. Our datasets comprise 54 local languages across five East African countries. We used these data to fine-tune two multilingual versions of Gemma-3. We compare models’ performance on these languages with larger off-the-shelf baselines. We release our data and models, in the interest of advancing MT for these low-resource languages and their communities.

pdf bib abs

Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma.
Macton Mgonzo | Kezia Oketch | Naome A Etori | Winnie Mang'eni | Elizabeth Fabian Nyaki | Michael Samwel Mollel
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

Automatic Speech Recognition (ASR) systems are gaining increasing attention in both academia and industry. Despite having remarkable performance in high-resource languages, their efficacy is less pronounced in low-resource settings. We present the first ASR system for Sukuma, one of the most severely under-resourced Tanzanian languages, and provide an open-source Sukuma speech corpus comprising 7.47 hours of carefully transcribed audio. The data, sourced primarily from Bible readings, was rigorously annotated to ensure phonetic and orthographic consistency, making it the most linguistically reliable resource currently available for the Sukuma language. To establish baselines, we train lightweight ASR and Text-to-Speech (TTS) models that demonstrate the feasibility of building end-to-end speech systems for this underrepresented language. This work addresses the challenges of developing language and communication tools for speakers of less-represented languages, particularly the scarcity of representative datasets and benchmarks, and highlights future research directions for linguistically challenging languages, such as Sukuma. We make our data and code publicly available to facilitate reproducibility and further research.

2025

pdf bib abs

Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond
Mardhiyah Sanni | Tassallah Abdullahi | Devendra Deepak Kayande | Emmanuel Ayodele | Naome A Etori | Michael Samwel Mollel | Moshood O. Yekini | Chibuzor Okocha | Lukman Enegi Ismaila | Folafunmi Omofoye | Boluwatife A. Adewale | Tobi Olatunji
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Speech technologies are transforming interactions across various sectors, from healthcare to call centers and robots, yet their performance on African-accented conversations remains underexplored. We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. We assess state-of-the-art (SOTA) speaker diarization and ASR systems on long-form, accented speech, comparing their performance with native accents and discover a 10%+ performance degradation. Additionally, we explore medical conversation summarization capabilities of large language models (LLMs) to demonstrate the impact of ASR errors on downstream medical summaries, providing insights into the challenges and opportunities for speech technologies in the Global South. Our work highlights the need for more inclusive datasets to advance conversational AI in low-resource settings.