Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Riccardo Bassani, Anders Søgaard, Tejaswini Deoskar


Abstract
Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.
Anthology ID:
2021.mrl-1.3
Volume:
Proceedings of the 1st Workshop on Multilingual Representation Learning
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venue:
MRL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32–40
Language:
URL:
https://aclanthology.org/2021.mrl-1.3
DOI:
10.18653/v1/2021.mrl-1.3
Bibkey:
Cite (ACL):
Riccardo Bassani, Anders Søgaard, and Tejaswini Deoskar. 2021. Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 32–40, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization (Bassani et al., MRL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mrl-1.3.pdf
Video:
 https://aclanthology.org/2021.mrl-1.3.mp4
Data
TyDi QA