LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.


Introduction
Language-agnostic sentence embedding models (Artetxe and Schwenk, 2019b;Yang et al., 2020;Reimers and Gurevych, 2020;Feng et al., 2022;Mao et al., 2022) align multiple languages in a shared embedding space, facilitating parallel sentence alignment that extracts parallel sentences for training translation systems (Schwenk et al., 2021). Among them, LaBSE (Feng et al., 2022) achieves the state-of-the-art parallel sentence alignment accuracy over 109 languages. However, 471M parameters of LaBSE lead to the computationallyheavy inference. The 768-dimensional sentence embeddings of LaBSE (LaBSE embeddings) make it suffer from computation overhead of downstream tasks (e.g., kNN search). This limits its application on resource-constrained devices. Therefore, we explore training a lightweight model to generate low-dimensional sentence embeddings while retaining the performance of LaBSE.
We first investigate the performance of dimension-reduced LaBSE embeddings and show that it performs comparably with LaBSE. Subsequently, we experiment with various architectures to see whether such effective low-dimensional embeddings can be obtained from a lightweight encoder. We observe that the thin-deep (Romero et al., 2015) architecture is empirically superior for learning language-agnostic sentence embeddings. Diverging from previous work, we show that lowdimensional embeddings based on a lightweight model are effective for parallel sentence alignment of 109 languages.
LaBSE benefits from multilingual language model pre-training, but no multilingual pre-trained models are available for the lightweight architectures. Thus, we propose two knowledge distillation methods to further enhance the lightweight models by forcing the model to extract helpful information from LaBSE. We present three lightweight models improved with distillation: LEALLA-small, LEALLA-base, and LEALLA-large, with 69M, 107M, and 147M parameters, respectively. Fewer model parameters and their 128-d, 192-d, and 256d sentence embeddings are expected to accelerate downstream tasks, while the performance drop of merely up to 3.0, 1.3, and 0.3 P@1 (or F1) points is observed on three benchmarks of parallel sentence alignment. In addition, we show the effectiveness of each loss function through an ablation study.
2 Background: LaBSE LaBSE (Feng et al., 2022) fine-tunes dual encoder models (Guo et al., 2018;Yang et al., 2019) to learn language-agnostic embeddings from a largescale pre-trained language model (Conneau et al., 2020). LaBSE is trained with parallel sentences, and each sentence pair is encoded separately by a 12-layer Transformer encoder. The 768-d encoder outputs are used to compute the training loss and serve as sentence embeddings for downstream tasks. Expressly, assume that the sentence embeddings for parallel sentences in a batch are where N denotes the number of the sentence pairs within a batch. LaBSE trains the bidirectional additive margin softmax (AMS) loss: where the loss for a specific sentence pair in a single direction is defined as: (2) m is a margin for optimizing the separation between translations and non-translations. φ (x i , y i ) is defined as Cosine Similarity between x i and y i .

Light Language-agnostic Embeddings
To address the efficiency issue of LaBSE, we probe the lightweight model for learning languageagnostic embeddings with the following experiments: (1) We directly reduce the dimension of LaBSE embeddings to explore the optimal embedding dimension; (2) We shrink the model size with various ways to explore the optimal architecture.

Evaluation Settings
We employ Tatoeba (Artetxe and Schwenk, 2019b), United Nations (UN) (Ziemski et al., 2016), and BUCC (Pierre Zweigenbaum and Rapp, 2018) benchmarks for evaluation, which assess the model performance for parallel sentence alignment. Following Feng et al. (2022)   models such as LASER (2019b), SBERT (2020), EMS (2022), and LaBSE use 768-d or 1024-d sentence embeddings, and whether a low-dimensional space can align parallel sentences over tens of languages with a solid accuracy (>80%) remains unknown. Thus, we start with the dimension reduction experiments for LaBSE to explore the optimal dimension of language-agnostic sentence embeddings.
We add an extra dense layer on top of LaBSE to transform the dimension of LaBSE embeddings from 768 to lower values. We experiment with seven lower dimensions ranging from 512 to 32. We fine-tune 5k steps for fitting the newly added dense layer, whereas other parameters of LaBSE are fixed. Refer to Appx. B for training details.
As shown in Fig. 1, the performance drops more than 5 points when the dimension is 32 on Tatoeba, UN, and BUCC. Meanwhile, given sentence embeddings with a dimension over 128, they performs slightly worse than 768-d LaBSE embeddings with a performance drop of fewer than 2 points, showing that low-dimensional sentence embeddings can align parallel sentences in multiple languages. Refer to Appx. D for detailed results.

Exploring the Optimal Architecture
Although we revealed the effectiveness of the lowdimensional embeddings above, it is generated from LaBSE with 471M parameters. Thus, we explore whether such low-dimensional sentence embeddings can be obtained from an encoder with less parameters. We first reduce the number of layers (#1 and #2 in Table 1) and the size of hidden states (#3 and #4) to observe the performance. Subsequently, inspired by the effectiveness of Fit-Net (Romero et al., 2015) and MobileBERT (Sun et al., 2020) and taking advantage of the lowdimensional sentence embeddings shown above, we experiment with thin-deep architectures with 24 layers (#5 -#8), leading to fewer encoder parameters. 3 Refer to Appx. B for training details. We report the results in Table 1. First, architectures with fewer layers (#1 and #2) perform worse than LaBSE on all three tasks and can only decrease parameters by less than 15%. Second, increasing the number of layers (#5 and #7) improves the performance of 12-layer models (#3 and #4) with a limited parameter increase less than 10%. Referring to LaBSE (#0), low-dimensional embeddings from thin-deep architectures (#5 -#8) obtain solid results on three benchmarks with performance drops of only 3.4 points at most. Until this point, we showed that thin-deep architecture is effective for learning language-agnostic sentence embeddings.

Knowledge Distillation from LaBSE
Besides the large model capacity, multilingual language model pre-training benefits LaBSE for parallel sentence alignment. As no multilingual pre-trained language models are available for lightweight models we investigated in Section 3.3, we instead explore extracting helpful knowledge from LaBSE.

Methodology
Feature distillation and logit distillation have been proven to be effective paradigms for knowledge distillation (Hinton et al., 2015;Romero et al., 2015;Yim et al., 2017;Tang et al., 2019). In this section,  we propose methods applying both paradigms to language-agnostic sentence embedding distillation. We use LaBSE as a teacher to train students with thin-deep architectures which were discussed in Section 3.3. Feature Distillation We propose applying feature distillation to language-agnostic sentence embedding distillation, which enables lightweight sentence embeddings to approximate the LaBSE embeddings via an extra dense layer. We employ an extra trainable dense layer on top of the lightweight models to unify the embedding dimension of LaBSE and lightweight models to be 768-d, as illustrated in Fig. 2. 45 The loss function is defined as follows: (3) where x t (or y t ) and x s (or y s ) are the embeddings by LaBSE and the lightweight model, respectively. f (·) is a trainable dense layer transforming the dimension from d (d < 768) to 768. Logit Distillation We also propose applying logit distillation to language-agnostic sentence embedding distillation to extract knowledge from the sentence similarity matrix as shown in Fig. 2. Logit distillation forces the student to establish similar similarity relationships between the given sentence pairs as the teacher does. We propose the following   mean squared error (MSE) loss: where T is a distillation temperature, and other notations follow those in Eq. 2 and 3. Combined Loss Finally, we combine two knowledge distillation loss functions with the AMS loss (Eq. 1) to jointly train the lightweight model: Here α, β, and γ are weight hyperparameters, which are tuned with the development data.

Experiments
Training We train three models, LEALLA-small, LEALLA-base, and LEALLA-large, using the thin-deep architectures of #8, #7, and #6 in  average performance difference on three tasks is below 0.3 points. LEALLA-base and LEALLAsmall obtain strong performance for high-resource languages on UN and BUCC, with a performance decrease less than 0.9 and 2.3 points, respectively. They also achieve solid results on Tatoeba with 1.3 and 3 points downgrades compared with LaBSE.
The solid performance of LEALLA on Tatoeba demonstrates that it is effective for aligning parallel sentences for more than 109 languages. Moreover, all the LEALLA models perform better or comparably with previous studies other than LaBSE.
Ablation Study We inspect the effectiveness of each loss component in an ablative manner. First, we compare settings with and without distillation loss functions. As shown in Fig. 3, by adding L f d or L ld , LEALLA trained only with L ams is improved on Tatoeba and UN tasks. By further combining L f d and L ld , LEALLA consistently achieves superior performance. Second, we separately train LEALLA with each loss. Referring to the results reported in Table 3, LEALLA trained only with L f d yields solid performance in the "small" and "base" models compared with L ams , showing that distillation loss benefits parallel sentence alignment. L f d and L ld perform much worse in the "small" model, which may be attributed to the discrepancy in the capacity gaps between the teacher model (LaBSE) and the student model ("small" or "base"). 6 Refer to Appx. F for all detailed results in this section.

Conclusion
We presented LEALLA, a lightweight model for generating low-dimensional language-agnostic sentence embeddings. Experimental results showed that LEALLA could yield solid performance for 109 languages after distilling knowledge from LaBSE. Future work can focus on reducing the vocabulary size of LaBSE to shrink the model further and exploring the effectiveness of lightweight model pre-training for parallel sentence alignment.

Limitations
In this study, we used the same training data as LaBSE (refer to Fig. 7

A Evaluation Benchmarks
Tatoeba (Artetxe and Schwenk, 2019b) supports the evaluation across 112 languages and contains up to 1,000 sentence pairs for each language and English. The languages of Tatoeba that are not included in the training data of LaBSE and LEALLA serve as the evaluation for unseen languages. UN (Ziemski et al., 2016) is composed of 86,000 aligned bilingual documents for en-ar, en-es, en-fr, en-ru, and en-zh. Following Feng et al. (2022), we evaluate the model performance for es, fr, ru, and zh on the UN task. There are about 9.5M sentence pairs for each language with English after deduping. BUCC shared task (Pierre Zweigenbaum and Rapp, 2018) is a benchmark to mine parallel sentences from comparable corpora. We conduct the evaluation using BUCC2018 tasks for en-de, en-fr, en-ru, and enzh, following the setting of Reimers and Gurevych (2020). 7 For the results of LaBSE reported in Ta Table 4: Results of comparisons among three feature distillation objectives. L df and L syn indicate "Distillationfirst" and "Synchronized" objectives in Fig. 4.

C Discussion about Feature Distillation
We additionally investigate another two patterns for feature distillation. As illustrated in Fig. 4  AMS loss, it is denoted as "Synchronized". For "Synchronized", it requires a fixed dense layer to conduct the dimension reduction for the LaBSE embeddings, for which we utilize the pre-trained model introduced in Section 3.2. We denote these two patterns of feature distillation as L df and L syn .
As reported in Table 4, L ams + L f d (L f d is feature distillation introduced in the main text) consistently outperforms L ams + L df and L ams + L syn in all the three LEALLA models. L ams + L df and L ams + L syn perform comparably on Tatoeba with the models trained without distillation loss. L ams + L df obtains performance gains for highresource languages on UN and BUCC compared with L ams , but still underperforms L ams + L f d .
L df forces the lightweight model to approximate the teacher embeddings first in the intermediate part of the model, on top of which the low-dimensional sentence embeddings are generated for computing the AMS loss, while L f d (Eq. 3) is calculated after computing the AMS loss. As the AMS loss directly indicates the evaluation tasks, we suppose L f d is a more flexible objective for feature distillation. In addition, L syn is not beneficial because it depends on a dimension-reduced LaBSE, which is a less robust teacher compared with LaBSE.

D Results of Dimension-reduction Experiments
We report all the results of Section 3.2 in Table 6.  BUCC for models #0 -#8, we provide the results of a further smaller thin-deep architecture (#9) and

E Results of Thin-deep and MobileBERT-like Architectures
MobileBERT-like (Sun et al., 2020) thin-deep architectures (#10 -#12). The 64-d thin-deep architecture contains only 33M parameters. However, its performance on three evaluation benchmarks downgrades by up to 7.4 points compared with #5 -#8, which demonstrates that 128-d may be a lower bound as universal sentence embeddings for aligning parallel sentences for 109 languages. Moreover, #10 -#12 show the results of MobileBERT-like architectures whose feed-forward hidden size is identical to hidden size. They have fewer parameters than #5 -#8, but they perform worse than #5 -#8, respectively (e.g., compare #10 with #6). Therefore, we did not employ MobileBERT-like architectures for LEALLA.

F Results of Ablation Study
We report all the results of the ablation study (Section 4.2) in Table 8.