Identifying the Importance of Content Overlap for Better Cross-lingual Embedding Mappings

Réka Cserháti, Gábor Berend


Abstract
In this work, we analyze the performance and properties of cross-lingual word embedding models created by mapping-based alignment methods. We use several measures of corpus and embedding similarity to predict BLI scores of cross-lingual embedding mappings over three types of corpora, three embedding methods and 55 language pairs. Our experimental results corroborate that instead of mere size, the amount of common content in the training corpora is essential. This phenomenon manifests in that i) despite of the smaller corpus sizes, using only the comparable parts of Wikipedia for training the monolingual embedding spaces to be mapped is often more efficient than relying on all the contents of Wikipedia, ii) the smaller, in return less diversified Spanish Wikipedia works almost always much better as a training corpus for bilingual mappings than the ubiquitously used English Wikipedia.
Anthology ID:
2021.mrl-1.9
Volume:
Proceedings of the 1st Workshop on Multilingual Representation Learning
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venue:
MRL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
96–106
Language:
URL:
https://aclanthology.org/2021.mrl-1.9
DOI:
10.18653/v1/2021.mrl-1.9
Bibkey:
Cite (ACL):
Réka Cserháti and Gábor Berend. 2021. Identifying the Importance of Content Overlap for Better Cross-lingual Embedding Mappings. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 96–106, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Identifying the Importance of Content Overlap for Better Cross-lingual Embedding Mappings (Cserháti & Berend, MRL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mrl-1.9.pdf
Video:
 https://aclanthology.org/2021.mrl-1.9.mp4
Data
WikiMatrix