IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces

Kelly Marchisio, Neha Verma, Kevin Duh, Philipp Koehn


Abstract
The ability to extract high-quality translation dictionaries from monolingual word embedding spaces depends critically on the geometric similarity of the spaces—their degree of “isomorphism.” We address the root-cause of faulty cross-lingual mapping: that word embedding training resulted in the underlying spaces being non-isomorphic. We incorporate global measures of isomorphism directly into the skipgram loss function, successfully increasing the relative isomorphism of trained word embedding spaces and improving their ability to be mapped to a shared cross-lingual space. The result is improved bilingual lexicon induction in general data conditions, under domain mismatch, and with training algorithm dissimilarities. We release IsoVec at https://github.com/kellymarchisio/isovec.
Anthology ID:
2022.emnlp-main.404
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6019–6033
Language:
URL:
https://aclanthology.org/2022.emnlp-main.404
DOI:
10.18653/v1/2022.emnlp-main.404
Bibkey:
Cite (ACL):
Kelly Marchisio, Neha Verma, Kevin Duh, and Philipp Koehn. 2022. IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6019–6033, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces (Marchisio et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.404.pdf