Concept Space Alignment in Multilingual LLMs

Qiwei Peng, Anders Søgaard


Abstract
Multilingual large language models (LLMs) seem to generalize somewhat across languages. We hypothesize this is a result of implicit vector space alignment. Evaluating such alignment, we see that larger models exhibit very high-quality linear alignments between corresponding concepts in different languages. Our experiments show that multilingual LLMs suffer from two familiar weaknesses: generalization works best for languages with similar typology, and for abstract concepts. For some models, e.g., the Llama-2 family of models, prompt-based embeddings align better than word embeddings, but the projections are less linear – an observation that holds across almost all model families, indicating that some of the implicitly learned alignments are broken somewhat by prompt-based methods.
Anthology ID:
2024.emnlp-main.315
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5511–5526
Language:
URL:
https://aclanthology.org/2024.emnlp-main.315
DOI:
Bibkey:
Cite (ACL):
Qiwei Peng and Anders Søgaard. 2024. Concept Space Alignment in Multilingual LLMs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5511–5526, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Concept Space Alignment in Multilingual LLMs (Peng & Søgaard, EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.315.pdf