LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Zhuoyuan Mao, Tetsuji Nakagawa


Abstract
Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.
Anthology ID:
2023.eacl-main.138
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1886–1894
Language:
URL:
https://aclanthology.org/2023.eacl-main.138
DOI:
10.18653/v1/2023.eacl-main.138
Bibkey:
Cite (ACL):
Zhuoyuan Mao and Tetsuji Nakagawa. 2023. LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1886–1894, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation (Mao & Nakagawa, EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.138.pdf
Video:
 https://aclanthology.org/2023.eacl-main.138.mp4