Emmanuel Bolarinwa


2026

Text-to-Speech (TTS) technology offers potential to improve exam accessibility for visually impaired learners, but existing systems often underperform in underrepresented languages like Yoruba. This study evaluates current Yoruba TTS models in delivering standardized exam content to five visually impaired students through a web-based interface. Before testing, four Yoruba TTS systems were compared; only Facebook’s mms-tts-yor and YarnGPT produced intelligible Yoruba speech. Students experienced exam questions delivered by human voice, Braille, and TTS. All preferred Braille for clarity and independence, some valued human narration, while TTS was least favored due to robotic and unclear output. These results reveal a significant gap between TTS capabilities and the needs of users in low-resource languages. The paper highlights the urgency of developing tone-aware, user-centered TTS solutions to ensure equitable access to digital education for visually impaired speakers of underrepresented languages.

2025

Code-switching (CS) presents a significant challenge for Automatic Speech Recognition (ASR) systems, particularly in low-resource settings. While multilingual ASR models like OpenAI Whisper Large v3 are designed to handle multiple languages, their high computational demands make them less practical for real-world deployment in resource-constrained environments. In this study, we investigate the effectiveness of fine-tuning both monolingual and multilingual ASR models for Yoruba-English CS speech. Our results show that unadapted monolingual ASR models outperform Whisper Large v3 in a zero-shot setting on CS speech. Fine-tuning significantly reduces WER for both monolingual and multilingual models, with monolingual models achieving over a 20% WER reduction on CS and Yoruba speech while maintaining lower computational costs. However, we observe a trade-off, as fine-tuning leads to some degradation in English recognition, particularly for multilingual models. Our findings highlight that while multilingual models benefit from fine-tuning, monolingual models provide a computationally efficient and competitive alternative for CS-ASR, making them a viable choice for resource-constrained environments.
Despite rapid advancements in multimodal large language models (MLLMs), their ability to process low-resource African languages in document-based visual question answering (VQA) tasks remains limited. This paper evaluates three state-of-the-art MLLMs—GPT-4o, Claude-3.5 Haiku, and Gemini-1.5 Pro—on WAEC/NECO standardized exam questions in Yoruba, Igbo, and Hausa. We curate a dataset of multiple-choice questions from exam images and compare model accuracies across two prompting strategies: (1) using English prompts for African language questions, and (2) using native-language prompts. While GPT-4o achieves over 90% accuracy for English, performance drops below 40% for African languages, highlighting severe data imbalance in model training. Notably, native-language prompting improves accuracy for most models, yet no system approaches human-level performance, which reaches over 50% in Yoruba, Igbo, and Hausa. These findings emphasize the need for diverse training data, fine-tuning, and dedicated benchmarks that address the linguistic intricacies of African languages in multimodal tasks, paving the way for more equitable and effective AI systems in education.