Matimba Shingange


2024

pdf bib
Correcting FLORES Evaluation Dataset for Four African Languages
Idris Abdulmumin | Sthembiso Mkhwanazi | Mahlatse Mbooi | Shamsuddeen Hassan Muhammad | Ibrahim Said Ahmad | Neo Putini | Miehleketo Mathebula | Matimba Shingange | Tajuddeen Gwadabe | Vukosi Marivate
Proceedings of the Ninth Conference on Machine Translation

This paper describes the corrections made to the FLORES evaluation (dev and devtest) dataset for four African languages, namely Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu. The original dataset, though groundbreaking in its coverage of low-resource languages, exhibited various inconsistencies and inaccuracies in the reviewed languages that could potentially hinder the integrity of the evaluation of downstream tasks in natural language processing (NLP), especially machine translation. Through a meticulous review process by native speakers, several corrections were identified and implemented, improving the dataset’s overall quality and reliability. For each language, we provide a concise summary of the errors encountered and corrected and also present some statistical analysis that measures the difference between the existing and corrected datasets. We believe that our corrections enhance the linguistic accuracy and reliability of the data and, thereby, contribute to a more effective evaluation of NLP tasks involving the four African languages. Finally, we recommend that future translation efforts, particularly in low-resource languages, prioritize the active involvement of native speakers at every stage of the process to ensure linguistic accuracy and cultural relevance.

2023

pdf bib
Preparing the Vuk’uzenzele and ZA-gov-multilingual South African multilingual corpora
Richard Lastrucci | Jenalea Rajab | Matimba Shingange | Daniel Njini | Vukosi Marivate
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering South African government speeches (ZA-gov-multilingual), as well as the South African Government newspaper (Vuk’uzenzele), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning massively multilingual pre-trained language model.