Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only

Jinliang Lu, Yu Lu, Jiajun Zhang


Abstract
Recent studies have revealed the remarkable cross-lingual capability of multilingual pre-trained language models (mPLMs), even when pre-trained without parallel corpora (mono-mPLMs). Intuitively, semantic alignments may be the reason behind such capability but remain under-explored. In this work, we investigate the alignment properties from the token perspective in mono-mPLMs and find that the alignments correspond to the geometric similarity of embedding space across different languages. Nevertheless, mono-mPLMs tend to damage this geometric similarity at the higher layers due to the lack of cross-lingual interactions, thus limiting their cross-lingual transfer capabilities. To address this issue, we introduce token-level and semantic-level code-switched masked language modeling, employing the self-induced token alignments to explicitly improve cross-lingual interactions over layers of mono-mPLMs without relying on parallel sentences. We evaluate our method on various natural language understanding tasks and unsupervised machine translation tasks. The results demonstrate that our methods outperform the strong baselines and achieve comparable performance with mPLMs trained with parallel corpora.
Anthology ID:
2023.findings-emnlp.190
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2891–2907
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.190
DOI:
10.18653/v1/2023.findings-emnlp.190
Bibkey:
Cite (ACL):
Jinliang Lu, Yu Lu, and Jiajun Zhang. 2023. Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2891–2907, Singapore. Association for Computational Linguistics.
Cite (Informal):
Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only (Lu et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.190.pdf