Kelechi Ogueji

2025

Large-scale multilingual evaluations, such as MEGA, often include only a handful of African languages due to the scarcity of high-qualityevaluation data and the limited discoverability of existing African datasets. This lack of representation hinders comprehensive LLM evaluation across a diverse range of languages and tasks. To address these challenges, we introduce AFROBENCH—a multi-task benchmark for evaluating the performance of LLMs across 64 African languages, 15 tasks and 22 datasets. AFROBENCH consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task. We present results comparing the performance of prompting LLMs to fine-tuned baselines based on BERT and T5-style models. Our results suggest large gaps in performance between high-resource languages, such as English, and African languages across most tasks; but performance also varies based on the availability of monolingual data resources. Our findings confirm that performance on African languages continues to remain a hurdle for current LLMs, underscoring the need for additional efforts to close this gap.

2024

pdf bib abs
Enhancing Alignment using Curriculum Learning & Ranked Preferences
Pulkit Pattnaik | Rishabh Maheshwary | Kelechi Ogueji | Vikas Yadav | Sathwik Tejaswi Madhusudhan
Findings of the Association for Computational Linguistics: EMNLP 2024

Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (one chosen and rejected response per prompt) to align LLMs to human preferences. In practice, multiple responses could exist for a given prompt with varying quality relative to each other. We propose to utilize these responses to create multiple preference pairs for a given prompt. Our work focuses on aligning LLMs by systematically curating multiple preference pairs and presenting them in a meaningful manner facilitating curriculum learning to enhance the prominent DPO technique. We order multiple preference pairs from easy to hard, according to various criteria thus emulating curriculum learning. Our method, which is referred to as Curri-DPO consistently shows increased performance gains on MTbench, Vicuna bench, WizardLM, highlighting its effectiveness over standard DPO setting that utilizes single preference pair. More specifically, Curri-DPO achieves a score of 7.43 on MTbench with Zephyr-7B, outperforming majority of existing LLMs with similar parameter size. Curri-DPO also achieves the highest win rates on Vicuna, WizardLM, and UltraFeedback test sets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of up to 7.5% when compared to standard DPO. We release the preference pairs used in alignment at: https://huggingface.co/datasets/ServiceNow-AI/Curriculum_DPO_preferences.

2022

pdf bib abs
AfriTeVA: Extending ?Small Data? Pretraining Approaches to Sequence-to-Sequence Models
Odunayo Jude Ogundepo | Akintunde Oladipo | Mofetoluwa Adeyemi | Kelechi Ogueji | Jimmy Lin
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Pretrained language models represent the state of the art in NLP, but the successful construction of such models often requires large amounts of data and computational resources. Thus, the paucity of data for low-resource languages impedes the development of robust NLP capabilities for these languages. There has been some recent success in pretraining encoderonly models solely on a combination of lowresource African languages, exemplified by AfriBERTa. In this work, we extend the approach of “small data” pretraining to encoder– decoder models. We introduce AfriTeVa, a family of sequence-to-sequence models derived from T5 that are pretrained on 10 African languages from scratch. With a pretraining corpus of only around 1GB, we show that it is possible to achieve competitive downstream effectiveness for machine translation and text classification, compared to larger models trained on much more data. All the code and model checkpoints described in this work are publicly available at https://github.com/castorini/afriteva.

pdf bib abs
Intriguing Properties of Compression on Multilingual Models
Kelechi Ogueji | Orevaoghene Ahia | Gbemileke Onilude | Sebastian Gehrmann | Sara Hooker | Julia Kreutzer
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Multilingual models are often particularly dependent on scaling to generalize to a growing number of languages. Compression techniques are widely relied upon to reconcile the growth in model size with real world resource constraints, but compression can have a disparate effect on model performance for low-resource languages. It is thus crucial to understand the trade-offs between scale, multilingualism, and compression. In this work, we propose an experimental framework to characterize the impact of sparsifying multilingual pre-trained language models during fine-tuning.Applying this framework to mBERT named entity recognition models across 40 languages, we find that compression confers several intriguing and previously unknown generalization properties. In contrast to prior findings, we find that compression may improve model robustness over dense models. We additionally observe that under certain sparsification regimes compression may aid, rather than disproportionately impact the performance of low-resource languages.

pdf bib abs
What a Creole Wants, What a Creole Needs
Heather Lent | Kelechi Ogueji | Miryam de Lhoneux | Orevaoghene Ahia | Anders Søgaard
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In recent years, the natural language processing (NLP) community has given increased attention to the disparity of efforts directed towards high-resource languages over low-resource ones. Efforts to remedy this delta often begin with translations of existing English datasets into other languages. However, this approach ignores that different language communities have different needs. We consider a group of low-resource languages, creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma, despite these languages having sizable and vibrant communities. We demonstrate, through conversations with creole experts and surveys of creole-speaking communities, how the things needed from language technology can change dramatically from one language to another, even when the languages are considered to be very similar to each other, as with creoles. We discuss the prominent themes arising from these conversations, and ultimately demonstrate that useful language technology cannot be built without involving the relevant community.

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

2021

pdf bib abs
Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages
Kelechi Ogueji | Yuxin Zhu | Jimmy Lin
Proceedings of the 1st Workshop on Multilingual Representation Learning

Pretrained multilingual language models have been shown to work well on many languages for a variety of downstream NLP tasks. However, these models are known to require a lot of training data. This consequently leaves out a huge percentage of the world’s languages as they are under-resourced. Furthermore, a major motivation behind these models is that lower-resource languages benefit from joint training with higher-resource languages. In this work, we challenge this assumption and present the first attempt at training a multilingual language model on only low-resource languages. We show that it is possible to train competitive multilingual language models on less than 1 GB of text. Our model, named AfriBERTa, covers 11 African languages, including the first language model for 4 of these languages. Evaluations on named entity recognition and text classification spanning 10 languages show that our model outperforms mBERT and XLM-Rin several languages and is very competitive overall. Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages. Code, data and models are released at https://github.com/keleog/afriberta.

We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.1

2020

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. ‘Low-resourced’-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released at https://github.com/masakhane-io/masakhane-mt.