Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching

Previous studies have proved that cross-lingual knowledge distillation can significantly improve the performance of pre-trained models for cross-lingual similarity matching tasks. However, the student model needs to be large in this operation. Otherwise, its performance will drop sharply, thus making it impractical to be deployed to memory-limited devices. To address this issue, we delve into cross-lingual knowledge distillation and propose a multi-stage distillation framework for constructing a small-size but high-performance cross-lingual model. In our framework, contrastive learning, bottleneck, and parameter recurrent strategies are combined to prevent performance from being compromised during the compression process. The experimental results demonstrate that our method can compress the size of XLM-R and MiniLM by more than 50\%, while the performance is only reduced by about 1%.


Introduction
On the internet, it is widespread to store texts in dozens of languages in one system. Cross-lingual similar text matching in multilingual systems is a great challenge for many scenarios, e.g., search engines, recommendation systems, question-answer robots, etc. (Cer et al., 2017;Hardalov et al., 2020;Asai et al., 2021).
In the monolingual scenario, benefiting from the robust performance of the pre-trained language models (PLMs) (e.g., BERT (Devlin et al., 2019), RoBERTa , T5 (Raffel et al., 2020), etc.), significant success has been achieved in textsimilarity matching tasks. For example, Reimers and Gurevych (2019) proposed the SBERT model trained with similar text pairs and achieved the state-of-the-art performance in the supervised similarity matching. In unsupervised scenarios, Gao * * Contribution during internship at Tencent Inc. Drawing on the success in the monolingual scenario, researchers began to introduce pre-training technology into cross-lingual scenarios and proposed a series of multilingual pre-trained models, e.g., mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), XLM-R (Conneau et al., 2020), etc. Due to the vector collapse issue , the performances of these cross-lingual models on similarity matching tasks are still not satisfactory. Reimers and Gurevych (2020) injected the similarity matching ability of SBERT into the cross-lingual model through knowledge distillation, which alleviated the collapse issue and improved the performance of cross-lingual matching tasks.
Although the cross-lingual matching tasks have achieved positive results, the existing cross-lingual models are huge and challenging to be deployed in devices with limited memory. We try to distill the SBERT model into an XLM-R with fewer layers following Reimers and Gurevych (2020). However, as shown in Figure 1, the performance will be significantly reduced as the number of layers decreases. This phenomenon indicates that cross-lingual capabilities are highly dependent on the model size, and simply compressing the number of layers will bring a serious performance loss.
In this work, we propose a multi-stage distillation compression framework to build a small-size but high-performance model for cross-lingual similarity matching tasks. In this framework, we design three strategies to avoid semantic loss during compression, i.e., multilingual contrastive learning, parameter recurrent, and embedding bottleneck. We further investigate the effectiveness of the three strategies through ablation studies. Besides, we respectively explore the performance impact of reducing the embedding size and encoder size. Experimental results demonstrate that our method effectively reduces the size of the multilingual model with minimal semantic loss. Finally, our code is publicly available 1 .
The main contributions of this paper can be summarized as follows: • We validate that cross-lingual capability requires a larger model size and explore the semantic performance impact of shrinking the embedding or encoder size.
• A multi-stage distillation framework is proposed to compress the size of cross-lingual models, where three strategies are combined to reduce semantic loss.
• Extensive experiments examine the effectiveness of these three strategies and multi-stages used in our framework.
2 Related work 2.1 Multilingual models Existing multilingual models can be divided into two categories, namely Multilingual general model and Cross-lingual representation model. In the first category, transformer-based pretrained models have been massively adopted in multilingual NLP tasks (Huang et al., 2019;Chi et al., 2021;Luo et al., 2021;Ouyang et al., 2021). mBERT (Devlin et al., 2019) was pre-trained on Wikipedia corpus in 104 languages, achieved significant performance in the downstream task. XLM (Conneau and Lample, 2019) presented the translation language modeling (TLM) objective to improve the cross-lingual transferability by leveraging parallel data. XLM-R (Conneau et al., 2020) was built on RoBERTa (Liu et al., 2019) using Com-monCrawl Corpus. In the second category, LASER (Artetxe and Schwenk, 2019) used an encoder-decoder architecture based on a Bi-LSTM network and was trained on the parallel corpus obtained by neural machine translation. Multilingual Universal Sentence Encoder (mUSE) (Chidambaram et al., 2019; adopted a bi-encoder architecture and was trained with an additional translation ranking task. LaBSE (Feng et al., 2020) turned the pretrained BERT into a bi-encoder mode and was optimized with the objectives of mask language model (MLM) and TLM. Recently,  presented a lightweight bilingual sentence representation method based on the dual-transformer architecture.

Knowledge distillation
However, Multilingual models do not necessarily have cross-lingual capabilities, especially in the first category, in which vector spaces of different languages are not aligned. Knowledge distillation (Hinton et al., 2015) used knowledge from a teacher model to guide the training of a student model, which can be used to compress the model and align its vector space at the same time.
For model compression, knowledge distillation aimed to transfer knowledge from a large model to a small model. BERT-PKD (Sun et al., 2019) extracted knowledge from both last layer and intermediate layers at fine-tuning stage. DistilBERT (Sanh et al., 2019) performed distillation at pre-training stage to halve the depth of BERT. TinyBERT (Jiao et al., 2020) distilled knowledge from BERT at both pre-training and fine-tuning stages. Mobile-BERT (Sun et al., 2020) distilled bert into a model with smaller dimensions at each layer. MiniLM  conducted deep self-attention distillation.
Unlike previous works presenting general distillation frameworks, we focus on compressing multilingual pre-trained models while aligning their cross-lingual vector spaces. In addition, we take inspiration from Reimers and Gurevych (2020), which successfully aligned the vector space of the multilingual model through cross-lingual knowledge distillation (X-KD). Our framework combines the advantages of X-KD for aligning vectors and introduces three strategies and an assistant model

Student Model
Hidden Figure 2: The overview of the model architecture and the multi-stage distillation. It consists of four stages and aims to obtain a small multilingual student model. For convenience, we take the English SBERT as the teacher model, XLM-R as the assistant model. < s i , t i > is a pair of parallel sentences in two different language. N is the batch size. MSE is the mean squared error loss function.
to prevent performance from being compromised during compression.

Method
In this section, we will introduce our method in detail. First, we exhibit the model architecture, and then introduce the multi-stage distillation strategy for the model training. An overview of our approach is shown in Figure 2.

Model architecture
Given a large-size monolingual model as teacher T and a small-size multilingual model as student S, our goal is to transfer semantic similarity knowledge from T to S and simultaneously compress the size of S with m parallel sentences P = {< s 1 , t 1 >, < s 2 , t 2 >, · · · < s m , t m >}.

Teacher model
In this work, we use SBERT (Reimers and Gurevych, 2019) as the teacher model, which has been proven to perform well on monolingual semantic similarity tasks. SBERT adopts a siamese network structure to fine-tune a BERT (Devlin et al., 2019) encoder, and applies a mean pooling operation to its output to derive sentence embedding.

Assistant model
Mirzadeh et al. (2020) proved that when the gap between the student and teacher is large, the performance of the student model will decrease. We hope to get a small student model with cross-lingual capabilities, while the teacher is a large monolingual model. To alleviate the gaps, we introduce an assistant model A (Mirzadeh et al., 2020), which is a large multilingual model with cross-lingual ability.

Student model
Inspired by ALBERT (Lan et al., 2020), we design the student model with Parameter Recurrent and Embedding Bottleneck strategy. Since there is no available multilingual ALBERT, we need to design from scratch. Parameter recurrent. We choose the first M layers of the assistant model as a recurring unit (RU).
The role of RU is to initialize the student model with layers from the assistant model. Concretely, the RU is defined as, where L i is the i th transformer layer. Embedding bottleneck. Multilingual pre-trained models usually require a large vocabulary V to support more languages, which leads to large embedding layer parameters. We add a bottleneck layer (He et al., 2016;Lan et al., 2020;Sun et al., 2020) of size B between embedding layer and hidden layer H. In this way, the embedding layer is

Multi-stage distillation
Multi-stage Distillation is the key for enabling the small-size student model with cross-lingual matching ability.

Stage 1. Teaching assistant
As the Stage 1 in Figure 2, we use the teacher model and parallel corpus to align vector space between different languages through the loss function in (2), enabling its cross-lingual ability (Reimers and Gurevych, 2020).
where N is the batch size, and s i and t i denotes the parallel sentences in a mini batch.

Stage 2. Align student embedding
As the Stage 2 in Figure 2, we align the embedding bottleneck layer with the assistant embedding space through the loss function in (3)

Stage 3. Teaching student
In the Stage 3, the student model is trained to imitate the output of the assistant model with loss function in (4),

Stage 4. Multilingual contrastive learning
After the above three stages, we can get a small multilingual sentence embedding model. However, as shown in Figure 1, when the model size decrease, its cross-lingual performance decreases sharply. Therefore, in this stage, we propose multilingual contrastive learning (MCL) task further to improve the performance of the small student model. Assuming the batch size is N , for a specific translation sentence pair (s i , t i ) in one batch, the meanpooled sentence embedding of the student model is (h si S , h ti S ). The MCL task takes parallel sentence pair (h si S , h ti S ) as positive one, and other sentences in the same batch (h si S , h tj S )|j ∈ [1, N ] , j = i as negative samples. Considering that the MCL task needs to be combined with knowledge distillation. Unlike the previous work Feng et al., 2020;, the MCL task does not directly apply the temperature-scaled crossentropy loss function.
Here, we introduce the implementation of the MCL task. For each pair of negative examples (s i , t j ) in the parallel corpus, the MCL task first unifies (s i , t j ) into the source language (s i , s j ), then uses the fine-grained distance between h si T and h sj T in the teacher model to push away the semantic different pair (h si S , h tj S ) in the student model. For positive examples, the MCL task pull semantically similar pair (h si S , h ti S ) together. The MCL task loss is (5), where φ is the distance function. Following prior work Feng et al., 2020), we set φ(x, y) = cosine(x, y). we also add the knowledge distillation task for multilingual sentence representation learning. The knowledge distillation loss is defined as, In stage 4, the total loss function is added by 1 and 2 .  4 Experimental results

Evaluation setup
Dataset. The semantic text similarity (STS) task requires models to assign a semantic similarity score between 0 and 5 to a pair of sentences. Following Reimers and Gurevych (2020), we evaluate our method on two multilingual STS tasks, i.e., STS2017 (Cer et al., 2017) and STS2017-extend (Reimers and Gurevych, 2020), which contain three monolingual tasks (EN-EN, AR-AR, ES-ES) and six cross-lingual tasks (EN-AR, EN-ES, EN-TR, EN-FR, EN-IT, EN-NL).
Metric. Spearman's rank correlation ρ is reported in our experiments. Specifically, we first compute the cosine similarity score between two sentence embeddings, then calculate the Spearman rank correlation ρ between the cosine score and the golden score.

Implementation details
Mean pooling is applied to obtain sentence embeddings, and the max sequence length is set to 128. We use AdamW (Loshchilov and Hutter, 2019) optimizer with a learning rate of 2e-5 and a warm-up of 0.1. In stage1, stage2, and stage3, the models are trained for 20 epochs with a batch size of 64, while in stage 4, the student model is trained for 60 epochs. The mBERT, XLM-R used in this work are base-size model obtained from Huggingface's transformers package (Wolf et al., 2020), and the MiniLM refers to MiniLM-L12-H384 2 . Our implementation is based on UER(Zhao et al., 2019).

Performance comparison
We compare the model obtained from our multistage distillation with the previous state-of-theart models, and results are shown in Table 1 and  Table 2. In Pre-trained Model, mBERT(mean) and XLM-R(mean) are mean pooled mBERT and XLM-R models. mBERT-nli-stsb and XLM-R-nli-stsb are mBERT and XLM-R fine-tuned on the NLI and STS training sets. LASER and LaBSE are obtained from Artetxe and Schwenk (2019) and Feng et al. (2020). In Knowledge Distillation, we use the paradigm of Student←Teacher to represent the Student model distilled from the Teacher model. There  are two teacher models, i.e., SBERT-nli-stsb and SBERT-paraphrases, which are released by UKPLab 3 . The former is fine-tuned on the English NLI and STS training sets, and the latter is trained on more than 50 million English paraphrase pairs. The student models include mBERT, XLM-R, DistilmBERT (Sanh et al., 2019) and MiniLM . Table 1 and Table 2 show the evaluation results on monolingual and multilingual STS task, respectively. For the XLM-R, our method compresses the embedding size by 83.2% with 0.3% worse monolingual performance and 0.9% worse crosslingual performance, compresses the encoder size by 75% with slightly higher (0.4%) monolingual performance and 0.5% worse cross-lingual performance. When compressing the embedding layer and the encoder simultaneously, the model size is reduced by 80.6%, its monolingual performance drop by 2% and cross-lingual performance drop by 4%, but it still outperforms the pre-trained models.

EN-AR EN-DE EN-TR EN-ES EN-FR EN-IT EN-NL Avg. Embedding size Encoder size
For comparison with other distillation methods, MiniLM← SBERT-paraphrases is taken as a strong baseline. Our framework can further compress its embedding size by 66.7% with 0.6% worse in monolingual performance and 1.1% worse in crosslingual performance. Its encoder size is further compressed by 75% with slightly higher monolin-3 https://github.com/UKPLab/sentence-transformers   Table 4: Results of ablation studies on STS2017-extend cross-lingual task gual (0.1%) and cross-lingual (0.4%) performance. In addition, our compressed XLM-R(b = T rue, bs = 128, |RU | = 12) achieves higher monolingual(0.8%) and cross-lingual(1.9%) performance with the same model size.

Ablation study
Among the three key strategies, multilingual contrastive learning (MCL) and parameter recurrent (Rec.) are two crucial mechanisms to improve model performance. The bottleneck is used to compress the model. In this section, ablation studies is performed to investigate the effects of MCL and Rec.. The effects of the bottleneck will be dis-  cussed in section 4.7. XLM-R(b=True, bs=128, |RU | = 3) is selected as the basic model. We consider three different settings: 1) training without MCL task. 2) training without parameter recurrent. 3) training without both. The monolingual results and multilingual results are presented in Table 3 and Table 4.
It can be observed that: 1) without MCL task, the model performs poorer on the cross-lingual tasks. 2) without parameter sharing, the model performs poorer on all datasets. 3) MCL task can significantly improve the cross-lingual performance on EN-AR, EN-ES, EN-FR, EN-NL. It can be concluded that both MCL task and parameter recurrent play a key role in our method.

Effect of contrastive learning
To investigate the effects of contrastive learning in stage 4, we select XLM-R(b=True, bs=128, |RU | = 3), modify the original objective in (5) into three different settings, namely, Bool, CE and w/o CL.
In the CE setting, the objective in (5) is replaced with temperature-scaled cross-entropy, as (9), where φ T = cos(h si T , h sj T ), φ S = cos(h si S , h tj S ), τ = 0.05 is a hyperparameter called temperature.
In the w/o CL setting, the contrastive learning is removed in Stage 4. Table 5 presents the model performance of crosslingual semantic similarity task with different settings. It can be observed that all the above training objectives can improve the model performance on the cross-lingual task, compared with the w/o CL settings. Model trained with (8) and (9) underperform that trained with (5), especially on EN-AR, EN-ES, EN-FR, EN-NL task.
We plot the convergence process of different settings in Figure 3. On EN-AR, EN-ES, EN-FR tasks, our setting outperform other settings. It is worth mentioning that on the EN-TR task, our setting underperform the CE setting according to Table 5. However, our setting reaches the same level as CE setting during the 30 to 40 epoch.

Effect of multi-stages
To verify the effectiveness of multi-stages, we shows the performance comparison of using different stage settings in Table 6. In the Single-stage setting, we first initialize the shrunk student model in two ways: (1) Random Initialize: Adding the untrained embedding bottleneck layers to the student model.
(2) Pre-Distillation: The student model with bottleneck layer is initialized by distillation using XLM-R and the same corpus as section 4.1. Then we follow Reimers and Gurevych (2020) to align vector space between different languages. In the Multi-stage setting, the performance of the student model is reported after each stage.
As shown in Table 6, the Multi-stage setting outperforms the single-stage one, indicating that our multi-stage framework with an assistant model is effective. Adding stage3 and stage4 further improves the student model performance, suggesting that multi-stage training are necessary.

Effect of bottleneck and recurrent unit
In this section, we study the impact of embedding bottleneck and recurrent unit strategies on multilingual semantic learning. We consider three settings for each strategy, as shown in Table 7 and Table 8.
First, we found that both XLM-R and MiniLM perform better as the bottleneck hidden size bs increases. The performance is best when the entire embedding layer is retained, The MiniLM(b=False) can outperform its original model in Table 1 and  Table 2. But the benefit of increasing bs is not obvious unless the entire embedding layer is retained.
Second, by increasing the number of recurrent unit layers |RU |, XLM-R and MiniLM have been steadily improved on these two tasks. The increase in model size caused by the |RU | is less than the bs. For example, the performance of MiniLM on cross-lingual tasks increased by 8%, while its size only increased by 15.9M.
Finally, it can be observed that when using the bottleneck layer (b=True), the model performance will increase steadily as |RU | increases. The smaller the encoder hidden size, the more significant effect caused by |RU | increasing (∆MiniLM>∆XLM-R). However, the increase of bs can not improve performance significantly but make the embedding size larger. Therefore, an effective way to compress the multilingual model is reducing bs while increasing |RU |. In this way, we shrink XLM-R by 58%, MiniLM by 55%, with less than 1.1% performance degradation.

Conclusion
In this work, we realize that the cross-lingual similarity matching task requires a large model size. To obtain a small-size model with cross-lingual matching ability, we propose a multi-stage distillation framework. Knowledge distillation and contrastive learning are combined in order to compress model with less semantic performance loss.
Our experiments demonstrate promising STS results with three monolingual and six cross-lingual pairs, covering eight languages. The empirical results show that our framework can shrink XLM-R or MiniLM by more than 50%. In contrast, the performance is only reduced by less than 0.6% on monolingual and 1.1% on cross-lingual tasks. If we slack the tolerated loss performance in 4%, the size of XLM-R can be reduced by 80%.