The Effects of Corpus Choice and Morphosyntax on Multilingual Space Induction

Vinit Ravishankar, Joakim Nivre


Abstract
In an effort to study the inductive biases of language models, numerous studies have attempted to use linguistically motivated tasks as a proxy of sorts, wherein performance on these tasks would imply an inductive bias towards a specific linguistic phenomenon. In this study, we attempt to analyse the inductive biases of language models with respect to natural language phenomena, in the context of building multilingual embedding spaces. We sample corpora from 2 sources in 15 languages and train language models on pseudo-bilingual variants of each corpus, created by duplicating each corpus and shifting token indices for half the resulting corpus. We evaluate the cross-lingual capabilities of these LMs, and show that while correlations with language families tend to be weak, other corpus-level characteristics, such as type-token ratio, tend to be more strongly correlated. Finally, we show that multilingual spaces can be built, albeit less effectively, even when additional destructive perturbations are applied to the training corpora, implying that (effectively) bag-of-words models also have an inductive bias that is sufficient for inducing multilingual spaces.
Anthology ID:
2022.findings-emnlp.304
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4130–4139
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.304
DOI:
10.18653/v1/2022.findings-emnlp.304
Bibkey:
Cite (ACL):
Vinit Ravishankar and Joakim Nivre. 2022. The Effects of Corpus Choice and Morphosyntax on Multilingual Space Induction. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4130–4139, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
The Effects of Corpus Choice and Morphosyntax on Multilingual Space Induction (Ravishankar & Nivre, Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-emnlp.304.pdf