Colex2Lang: Language Embeddings from Semantic Typology

Yiyi Chen, Russa Biswas, Johannes Bjerva


Abstract
In semantic typology, colexification refers to words with multiple meanings, either related (polysemy) or unrelated (homophony). Studies of cross-linguistic colexification have yielded insights into, e.g., psychology, historical linguistics and cognitive science (Xu et al., 2020; Brochhagen and Boleda, 2022; Schapper and Koptjevskaja-Tamm, 2022). While NLP research up until now has mainly focused on integrating syntactic typology (Naseem et al., 2012; Ponti et al., 2019; Chaudhary et al., 2019; Üstün et al., 2020; Ansell et al., 2021; Oncevay et al., 2022), we here investigate the potential of incorporating semantic typology, of which colexification is an example. We propose a framework for constructing a large-scale synset graph and learning language representations with node embedding algorithms. We demonstrate that cross-lingual colexification patterns provide a distinct signal for modelling language similarity and predicting typological features. Our representations achieve a 9.97% performance gain in predicting lexico-semantic typological features and expectantly contain a weaker syntactic signal. This study is the first attempt to learn language representations and model language similarities using semantic typology at a large scale, setting a new direction for multilingual NLP, especially for low-resource languages.
Anthology ID:
2023.nodalida-1.67
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
673–684
Language:
URL:
https://aclanthology.org/2023.nodalida-1.67
DOI:
Bibkey:
Cite (ACL):
Yiyi Chen, Russa Biswas, and Johannes Bjerva. 2023. Colex2Lang: Language Embeddings from Semantic Typology. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 673–684, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Colex2Lang: Language Embeddings from Semantic Typology (Chen et al., NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.67.pdf