A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Zihao Li, Shaoxiong Ji, Timothee Mickus, Vincent Segonne, Jörg Tiedemann


Abstract
Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community.Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since insights gained from monolingual English models may not necessarily apply to more complex multilingual models.One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology.This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios.We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pretraining objective under the right conditions.We make our code, data, and model weights available at https://github.com/Helsinki-NLP/lm-vs-mt.
Anthology ID:
2024.emnlp-main.888
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15882–15894
Language:
URL:
https://aclanthology.org/2024.emnlp-main.888
DOI:
10.18653/v1/2024.emnlp-main.888
Bibkey:
Cite (ACL):
Zihao Li, Shaoxiong Ji, Timothee Mickus, Vincent Segonne, and Jörg Tiedemann. 2024. A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15882–15894, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives (Li et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.888.pdf