Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks

Felermino D. M. A. Ali, Henrique Lopes Cardoso, Rui Sousa-Silva


Abstract
This paper introduces a comprehensive collection of NLP resources for Emakhuwa, Mozambique’s most widely spoken language. The resources include the first manually translated news bitext corpus between Portuguese and Emakhuwa, news topic classification datasets, and monolingual data. We detail the process and challenges of acquiring this data and present benchmark results for machine translation and news topic classification tasks. Our evaluation examines the impact of different data types—originally clean text, post-corrected OCR, and back-translated data—and the effects of fine-tuning from pre-trained models, including those focused on African languages.Our benchmarks demonstrate good performance in news topic classification and promising results in machine translation. We fine-tuned multilingual encoder-decoder models using real and synthetic data and evaluated them on our test set and the FLORES evaluation sets. The results highlight the importance of incorporating more data and potential for future improvements.All models, code, and datasets are available in the https://huggingface.co/LIACC repository under the CC BY 4.0 license.
Anthology ID:
2024.emnlp-main.824
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14842–14857
Language:
URL:
https://aclanthology.org/2024.emnlp-main.824/
DOI:
10.18653/v1/2024.emnlp-main.824
Bibkey:
Cite (ACL):
Felermino D. M. A. Ali, Henrique Lopes Cardoso, and Rui Sousa-Silva. 2024. Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14842–14857, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks (Ali et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.824.pdf