Are Language-Agnostic Sentence Representations Actually Language-Agnostic?

Yu Chen, Tania Avgustinova


Abstract
With the emergence of pre-trained multilingual models, multilingual embeddings have been widely applied in various natural language processing tasks. Language-agnostic models provide a versatile way to convert linguistic units from different languages into a shared vector representation space. The relevant work on multilingual sentence embeddings has reportedly reached low error rate in cross-lingual similarity search tasks. In this paper, we apply the pre-trained embedding models and the cross-lingual similarity search task in diverse scenarios, and observed large discrepancy in results in comparison to the original paper. Our findings on cross-lingual similarity search with different newly constructed multilingual datasets show not only correlation with observable language similarities but also strong influence from factors such as translation paths, which limits the interpretation of the language-agnostic property of the LASER model. %
Anthology ID:
2021.ranlp-1.32
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
274–280
Language:
URL:
https://aclanthology.org/2021.ranlp-1.32
DOI:
Bibkey:
Cite (ACL):
Yu Chen and Tania Avgustinova. 2021. Are Language-Agnostic Sentence Representations Actually Language-Agnostic?. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 274–280, Held Online. INCOMA Ltd..
Cite (Informal):
Are Language-Agnostic Sentence Representations Actually Language-Agnostic? (Chen & Avgustinova, RANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ranlp-1.32.pdf