Felermino D. M. A. Ali
2024
Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks
Felermino D. M. A. Ali
|
Henrique Lopes Cardoso
|
Rui Sousa-Silva
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
This paper introduces a comprehensive collection of NLP resources for Emakhuwa, Mozambique’s most widely spoken language. The resources include the first manually translated news bitext corpus between Portuguese and Emakhuwa, news topic classification datasets, and monolingual data. We detail the process and challenges of acquiring this data and present benchmark results for machine translation and news topic classification tasks. Our evaluation examines the impact of different data types—originally clean text, post-corrected OCR, and back-translated data—and the effects of fine-tuning from pre-trained models, including those focused on African languages.Our benchmarks demonstrate good performance in news topic classification and promising results in machine translation. We fine-tuned multilingual encoder-decoder models using real and synthetic data and evaluated them on our test set and the FLORES evaluation sets. The results highlight the importance of incorporating more data and potential for future improvements.All models, code, and datasets are available in the https://huggingface.co/LIACC repository under the CC BY 4.0 license.
Network-based Approach for Stopwords Detection
Felermino D. M. A. Ali
|
Gabriel de Jesus
|
Henrique Lopes Cardoso
|
Sérgio Nunes
|
Rui Sousa-Silva
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
Search