Lightweight Cross-Lingual Sentence Representation Learning

Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations. We explore different training tasks and observe that current cross-lingual training tasks leave a lot to be desired for this shallow architecture. To ameliorate this, we propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task. We further augment the training task by the introduction of two computationally-lite sentence-level contrastive learning tasks to enhance the alignment of cross-lingual sentence representation space, which compensates for the learning bottleneck of the lightweight transformer for generative tasks. Our comparisons with competing models on cross-lingual sentence retrieval and multilingual document classification confirm the effectiveness of the newly proposed training tasks for a shallow model.

The above-mentioned models can be categorized into two classes. On one hand, global fine-tuning methods like mBERT (Devlin et al., 2019) and XLM (Conneau and Lample, 2019) require being fine-tuned globally which results in a significant overhead of its own. On the other hand, fixeddimensional methods like LASER (Artetxe and Schwenk, 2019b) fix the sentence representations during the pre-training phase, and subsequently the fine-tuning for specific downstream tasks without back-propagating to the pre-trained model will be extremely computationally-lite. Lightweight models have been sufficiently explored for the former group by either shrinking the model (Lan et al., 2020) or training a student model (Sanh et al., 2019;Jiao et al., 2020;Reimers and Gurevych, 2020;Sun et al., 2020). However, the lightweight models for the latter group have not been explored before, which may have a more promising future for deploying task-specific fine-tuning onto edge devices.
In this work, we propose a variety of training tasks for a lightweight cross-lingual sentence model while retaining the robustness. To improve the computational efficiency, we utilize a lightweight dual-transformer architecture with just 2 layers, significantly decreasing the memory consumption and accelerating the training to further improve the efficiency. Our model uses significantly less number of parameters compared to both global fine-tuning methods like mBERT, and fixeddimensional representation methods like LASER,

Method
Architecture d h d f c attn h Enc. Dec. Params.
mBERT (Devlin et al., 2019) Transformer 768 3,072 12 12 N/A 110M LASER (Artetxe and Schwenk, 2019b) Bi-LSTM 512×2 N/A N/A 5 5 154M T-LASER (Li and Mak, 2020) Transformer 1,024 4,096 16 6 1 246M Ours Transformer 512 1,024 8 2 N/A 30M and T-LASER (Li and Mak, 2020) (see Table 1). Given a fixed training-set and model architecture, the robustness of the sentence representation is dependent on the training task. It is much more difficult for a lightweight model to learn robust representations merely with existing generative tasks (see Section 2 and Section 4.5), which could be attributed to its smaller size. In order to ameliorate this problem, we redesign a crosslingual language model by combining the singleword masked language model (SMLM) with crosslingual token-level reconstruction (XTR). Furthermore, we introduce two contrastive learning methods as auxiliary tasks to compensate for the learning bottleneck of lightweight transformer for generative tasks. Following the state-of-the-art fixeddimensional model LASER, we proceed to learn cross-lingual sentence representations from parallel sentences, where we employ 2-layer dualtransformer encoders to shrink the model architecture. By introducing the above-stated training tasks, we establish a computationally-lite framework for training cross-lingual sentence models.
We evaluate the learned sentence representations on cross-lingual tasks including multilingual document classification (MLDoc) (Schwenk and Li, 2018) and XSR. Our results confirm the ability of our lightweight model to yield robust sentence representations. We also do a systematic study on the performance of our model in an ablative manner. The contributions of this work can be summarized as follows: • We implement fixed-dimensional crosslingual sentence representation learning in a lightweight model, achieving improved training efficiency and competitive performance of the learned sentence representations.
• Our proposed novel generative and contrastive tasks allow cross-lingual sentence representa-tion efficiently trainable by the lightweight model. The contribution from each task is empirically analyzed.

Related Work
A majority of training tasks for learning fixeddimensional cross-lingual sentence representations can be ascribed to one of the following 2 categories: generative or contrastive. In this section, we revisit the previous work in these 2 categories, which is crucial for designing a cross-lingual representation model. Generative Tasks. Generative tasks measure a generative probability between predicted tokens and real tokens by training a language model. BERT-style MLM (Devlin et al., 2019) masks and predicts contextualized tokens within a given sentence. For the cross-lingual scenario, cross-lingual supervision is implemented by shared cognates and joint training (Devlin et al., 2019), concatenating source sentences in multiple languages (Conneau and Lample, 2019;Conneau et al., 2020a) or explicitly predicting the translated token (Ren et al., 2019). The [CLS] embedding or pooled embedding of all the tokens is introduced as the classifier embedding, which can be used as sentence embedding for sentence-level tasks (Reimers and Gurevych, 2019). Sequence to sequence methods (Schwenk and Douze, 2017;España-Bonet et al., 2017;Artetxe and Schwenk, 2019b;Li and Mak, 2020) autoregressively reconstruct the translation of the source sentence. The intermediate state between the encoder and the decoder are extracted as sentence representations. Particularly, the cross-lingual sentence representation quality of LASER (Artetxe and Schwenk, 2019b) benefits from a massively multilingual machine translation task covering 93 languages. In our work, we revisit the BERT-style training tasks and introduce a novel generative loss enhanced by KL-Divergence based token distribution prediction. Our proposed generative task performs effectively for the lightweight dual-transformer framework while other generative tasks should be implemented via a large-capacity model. Contrastive Tasks. Contrastive tasks measure (contrast) the similarities of sample pairs in the representation space. Negative sampling, which is a typical feature of the contrastive methods is first introduced in the work of word representation learning (Mikolov et al., 2013). Subsequently, contrastive tasks gradually emerged in many NLP tasks in various ways: negative sampling in knowledge graph embedding learning (Bordes et al., 2013;Wang et al., 2014), next sentence prediction in BERT (Devlin et al., 2019), token-level discrimination in ELECTRA (Clark et al., 2020), sentence-level discrimination in DeCLUTR (Giorgi et al., 2020), and hierarchical contrastive learning in HICTL . For the cross-lingual sentence representation training, typical ones include using correct and wrong translation pairs introduced by Guo Feng et al. (2020) or utilizing similarities between sentence pairs by introducing a regularization term (Yu et al., 2018). As another advantage, contrastive methods have proven to be more efficient than generative methods (Clark et al., 2020). Inspired by previous work, for our lightweight model, we propose a robust sentence-level contrastive task by leveraging similarity relationships arising from translation pairs.

Methodology
We perform cross-lingual sentence representation learning by a lightweight dual-transformer framework. Concerning the training tasks, we propose a novel cross-lingual language model, which combines SMLM and XTR. Moreover, we introduce two sentence-level self-supervised learning tasks (sentence alignment and sentence similarity losses) to leverage robust parallel level supervision to better conduct the cross-lingual sentence representation space alignment.

Architecture
We employ the dual transformer sharing parameters without any decoder as the basic unit to encode parallel sentences respectively, to avoid the loss in efficiency caused by the presence of a decoder.
Unlike XLM (Conneau and Lample, 2019), we utilize a dual model architecture rather than a single transformer to encode sentence pairs, because it can force the encoder to capture more cross-lingual characteristics (Reimers and Gurevych, 2019;Feng et al., 2020). Moreover, we decrease the number of layers and embedding dimension to accelerate the training phase, as shown in Table 1.
The architecture of the proposed method is illustrated in Figure 1 (left). We build sentence representations on the top of 2-layer transformer (Vaswani et al., 2017) encoders by a mean-pooling operation from the final states of all the positions within a sentence. Pre-trained sentence representations for downstream tasks are denoted by u and v, which are used to compute the loss for the sentencelevel contrastive task. Moreover, we add a fullyconnected layer before computing the loss of the cross-lingual language model inspired by . This linear layer can enhance our lightweight model by a nontrivial margin, because the hidden state for computing loss for the generative task is far different from the sentence presentation we aim to train. Two transformer encoders and linear layers share parameters, which has been proved effective and necessary for cross-lingual representation learning (Conneau et al., 2020b).

Generative Task
SMLM. SMLM is proposed by Sabet et al. (2019), which is a variant of the standard MLM in BERT (Devlin et al., 2019). SMLM can enforce the monolingual performance, because the prediction of a number of masked tokens in MLM is too complicated for the shallow transformer encoder to learn. 2 Inspired by this, we implement SMLM by a dual transformer architecture. The transformer encoder for language l 1 predicts a masked token in a sentence in l 1 as the monolingual loss. The language l 2 encoder sharing all the parameters with l 1 encoder predicts the same masked token by the corresponding sentence (translation in l 2 ) as the cross-lingual loss, as shown in Figure 1 (top right). Specifically, for a parallel corpus C and language l 1 and l 2 , the loss of SMLM computed from l 1 encoder E l 1 and   2019); XTR and UGT are our proposed methods. q 1 and q 2 respectively denote 2 distributions at the top of left sub-figure, the token distributions that we introduce as labels for the model to learn. In the bottom right sub-figure, n denotes the size of a mini-batch. and 2 represent language l 1 and l 2 , respectively. i in indicates the sentence representation of the i-th l 1 sentence in the mini-batch and same for j in 2.
l 2 encoder E l 2 is formulated as: where w t is the word to be predicted, S l 1 \{wt} is a sentence in which w t is masked, S = (S l 1 , S l 2 ) denotes a parallel sentence pair, θ represents the parameters to be trained in E l 1 and E l 2 , and the classification probability P is computed by Softmax on the top of the embedding layer. XTR. Inspired by LASER, we also use a reconstruction loss. However, introducing a decoder to implement the translation loss like LASER will increase the computational overhead associated with our model, which contradicts with our objective to design a computationally-lite model architecture.
To implement the reconstruction loss with just the encoder, we propose a XTR loss by which we jointly enforce the encoder to reconstruct the word distribution of corresponding target sentence as shown by q in Figure 1 (top right). Specifically, we utilize the following KL-Divergence based formulation as the training loss: where D KL denotes KL-Divergence based loss, p (h S l ; θ) represents the hidden state on the top of encoder E l as shown in Figure 1 (left) under the input S l , and w S l indicates the set that contains all the tokens in S l . We utilize discrete uniform distribution for the tokens in target language to define q for w S l . Specifically, q (w S l ) is defined as: where N w i indicates the number of words w i in sentence S l and S l indicates the length of S l . 3 Unified Generative Task (UGT). Finally, we unify SMLM (Eq. (1)) and XTR (Eq. (2)) by redefining the label distribution q (w S l ) for KL-Divergence based loss. As shown in Figure 1 (top right), the model is forced to learn under the supervision of a biased cross-lingual probability distribution of tokens. It is formulated the same as Eq. (3) if the token w t is masked from S l , else if w t is masked within S l :

Sentence-Level Contrastive Task
Meanwhile, as shown in Figure 1 (bottom right), we introduce two auxiliary similarity-based training tasks to strengthen sentence-level supervision. We construct these two assisting tasks on the basis of mean pooled sentence representations, aiming to capture sentence similarity information across languages. Inspired by Guo et al. (2018); Yang et al. (2019); Feng et al. (2020), we propose a sentence alignment loss. The sentence alignment loss aims to force the transformer model to recognize the sentence pair, where one sentence is the translation of the other. One positive and other negative samples contribute to the gradient update in a single batch, which provides contrastive training patterns for the model training. For contrastively discriminating positive and negative samples, we use (batchsize − 1) × 2 negative samples. 4 This indicates all the sentences within a batch except the positive one will be negative samples.
More precisely, assuming the mean pooled sentence representations of S l 1 and S l 2 are u(S l 1 ) and v(S l 2 ). Assume that B i is a specific batch of several paired sentences, u ij and v ij respectively indicate the representation of j-th sentence l 2 ) in language l 1 and l 2 within batch B i . Note that the masked token w t is omitted in the following equations. The above-proposed in-batch sentence alignment loss to align sentence pairs is defined as: 4 For each language, there are batchsize − 1 negative samples. Note that this contrastive task is different from those in  and Feng et al. (2020), where they utilize cosine similarity while we directly use the inner product to accelerate the model.
We further introduce a sentence similarity loss to better align similarities for all the sentence pairs throughout a batch. By constructing these similarity-based sentence-level contrastive tasks, we hope that it can force the sentence representations to be competent for sentence-level alignment downstream tasks. Specifically, in-batch sentence similarity loss, L sim is formulated as: where S (k) , S (j) ∈ B i . 5 In summary, Eq. (5) optimizes a loss for the contrastive task by discriminating correct translation from others for a given sentence, as shown in Figure 1 (L align in bottom right). Eq. (6) aligns the cross similarities between every sentence pairs within a batch, as shown in Figure 1 (L sim in bottom right). The similarity score matrix generated by the inner product between sentence pairs in a batch will be trained to be a symmetrical matrix with diagonal elements approximate to 1 after the Softmax operation.

Weighted Loss for Generative and Contrastive Tasks
We jointly minimize the loss of the generative task and two auxiliary contrastive tasks with the weight combination of (1, 2, 2): 6 L(ω 0 , ω 1 , ω 2 ) = L XM LM + 2L align + 2L sim where L XM LM denotes the loss of Eq.
(2) and the label distribution for KL-Divergence based loss is the unified reconstruction distribution formulated by Eq. (4). L align and L sim represent the losses in Eq. (5) and Eq. (6), respectively. 5 With regard to Eq. 6, log cos is employed for implementing a regression loss because we focused on the hidden states after Softmax that indicate the probabilities. We will consider using MSE loss on the states before Softmax in future exploration.

Experiments
We evaluate our cross-lingual sentence representation models by cross-lingual document classification and bitext mining for these 2 main downstream tasks belong to 2 groups: unrelated and related to the training task. For the former, we select ML-Doc (Schwenk and Li, 2018) to evaluate the classifier transfer ability of the cross-lingual model, while for the latter we conduct sentence retrieval on another parallel dataset Europarl 7 to evaluate the performance of our models.  We build our PyTorch implementation on top of HuggingFace's Transformers library (Wolf et al., 2020). Training data is composed of the ParaCrawl 8 (Bañón et al., 2020) v5.0 datasets for each language pair. We experiment on English-French, English-German, English-Spanish and English-Italian. We filter the parallel corpus for each language pair by removing sentences that cover tokens out of 2 languages. Raw and filtered number of the parallel sentences for each pair are shown in Table 2. 10,000 sentences are selected for validation on each language pair. We tokenize sentences by SentencePiece 9 (Kudo, 2018) and build a shared vocabulary with the size of 50k for each language pair.

Configuration Details
For each encoder, we use the transformer architecture with 2 hidden layers, 8 attention heads, hidden size of 512 and filter size of 1,024, and the parameters of two encoders are shared with each other. The sentence representations generated are 512 dimensional. For the training phase, it minimizes the weighted losses for our proposed crosslingual language model jointly with 2 auxiliary tasks. We train 12 epochs for each language pair (30 epochs for English-Italian because of nearly half number of parallel sentences) with the Adam optimizer, learning rate of 0.001 with warm-up strategy for 3 epochs (6 epochs for English-Italian) and dropout-probability of 0.1 on a single TITAN X Pascal GPU with the batch size of 128 paired sentences. Training loss for each language pair can converge within 10 GPU (12GB)×days, which is far more efficient than most cross-lingual sentence representation learning methods. 10

Baselines
For evaluation on the MLDoc benchmark, we use the state-of-the-art fixed-dimensional word representation methods MultiCCA+CNN method (Schwenk and Li, 2018) and Bi-Sent2Vec (Sanh et al., 2019), the representative fixed-dimensional sentence representation methods (Yu et al., 2018), LASER (Artetxe and Schwenk, 2019b), and T-LASER (Li and Mak, 2020) as baselines. In addition, as reference only, we present the results of the global finetuning methods, mBERT (Devlin et al., 2019) and the state-of-the-art BERT-based variant, Multi-Fit (Eisenschlos et al., 2019).
Note that T-LASER and LASER are trained on 223M parallel sentences on 93 languages, which uses significantly more training data than ours.
We also show the results by comparing with  in Appendix A, which is a recent work using global fine-tuning methods to generate multilingual sentence representations.

MLDoc: Zero-shot Cross-lingual Document Classification
The MLDoc task, which consists of news documents given in 8 different languages, is a benchmark to evaluate cross-lingual sentence representations. We conduct our evaluations in a zeroshot scenario: we train and validate a new linear  Table 3: MLDoc benchmark results (zero-shot scenario). We compare our models primarily with fixeddimensional models in which Bi-Sent2vec and LASER are state-of-the-art bag-of-words based and contextual sentence representation models, respectively. We also compare with global fine-tuning style methods here for reference. Each result is the mean value of 5 runs.  Table 4: Cross-lingual sentence retrieval results. We report P@1 scores of 2,000 source queries when searching among 200k sentences in the target language. Here global fine-tuning style methods are not considered, because they require training data to be fine-tuned. Best performances among bilingual representation methods are in bold.
classifier on the top of the pre-trained sentence representations in the source language, and then evaluate the classifier on the test set for the target language. We implement the evaluation by facebook's MLDoc library. 11 As shown in Table 3, our lightweight transformer model obtains the best results for most language pairs compared with previous fixed-dimensional word and sentence representation learning methods. Our methods yield only slightly worse performance even when compared with the state-of-the-art global fine-tuning style method, MultiFit (Eisenschlos et al., 2019), on this task. This is because the entire model will be updated in the fine-tuning phase, which indicates more parameters will be task-specific after fine-tuning. For fixed-dimensional methods, just an 11 https://github.com/facebookresearch/ MLDoc additional dense layer will be trained, which leads to their higher efficiency.

XSR: Cross-lingual Sentence Retrieval
We also conduct an evaluation to gauge the quality of our cross-lingual sentence representations on the bitext mining task, which is identical to some components of the training task. Specifically, given 2,000 sentences in the source language, we conduct the corresponding sentence retrieval from 200K sentences in the target language. P@1 scores of our lightweight models and previous bilingual representation methods calculated by Artetxe and Schwenk (2019a) are reported. As shown in Table 4, we observe that our lightweight models outperform the bilingual pooling-based representation learning methods by a significant margin, which reflects the basic ability of the contextualized rep-   resentations generated by our lightweight models. However, our lightweight models underperform LASER, which can be attributed to our lightweight capacities and bilingual settings. Note that LASER uses significantly larger multilingual training data (see Section 4.2).

Analyses
We perform ablation experiments to confirm the efficiency and the effectiveness of each training task for our models. Analyses for other hyperparameter configurations of batch size, sentence representation dimension, and training corpus size are presented in Appendix A. Relation among Number of Layers, Efficiency, and Performances. We report the efficiency statistics and performances of our proposed methods trained by different layer number settings. As shown in Table 5, we observe a linear increase of memory occupation and training time per 10,000 training steps by increasing the number of transformer encoder layers. Specifically, a 6-layer transformer encoder occupies nearly 2.5 times memory and costs 1.8 times training time compared to our 2-layer model. Therefore, given the same memory occupation (by adjusting the batch size), theoretically our lightweight model can be implemented over 4 times (≈ 2.5 × 1.8) faster than the 6-layer model. Concerning the respective performances on MLDoc and XSR, we see that lightweight model with 2 transformer layers obtains the peak performance on MLdoc, and the performances decrease when we add more layers. This indicates that the 2-layer transformer encoder is an ideal structure for our proposed training tasks on the document classification task. On the other hand, performances on XSR keep increasing gradually with more layers, where the 1-layer model can even yield decent performance on this task.
Our proposed training tasks perform well from the 2-layer model, while 6 layers are required for standard MLM and 5 LSTM layers are required for LASER. This is why we use 2-layer as the basic unit for our model. Effectiveness of Different Generative Tasks. We report the results with different generative tasks in Table 6. We observe that XTR outperforms other generative tasks by a significant margin on both  MLDoc and XSR downstream tasks. XTR yields further improvements when unified with SMLM, which is introduced as the generative task in our model. This demonstrates the necessity of a welldesigned generative task for the lightweight dualtransformer architecture.
Effectiveness of the Contrastive Tasks. In Table 7, we study the contribution of the sentencelevel contrastive tasks. We observe that a higher performance on MLDoc is yielded by the vanilla model while more sentence-level contrastive tasks improve the performance on XSR. This can be attributed to the similar nature between the supervision provided by sentence-level contrastive tasks and XSR process. In other words, contrastive-style tasks have a detrimental effect on the document classification downstream task. In future work, we will explore how to train a balanced sentence representation model with contrastive tasks.

Conclusion
In this paper, we presented a lightweight dualtransformer based cross-lingual sentence representation learning method. For the fixed 2-layer dualtransformer framework, we explored several generative and contrastive tasks to ensure the sentence representation quality and facilitate the improvement of the training efficiency. In spite of the lightweight model capacity, we reported substantial improvements on MLDoc compared to fixeddimensional representation methods and we obtained comparable results on XSR. In the future, we plan to verify whether our proposed methods can be combined with knowledge distillation.    representations yield good performance on bitext mining but perform poorly on classification tasks. This demonstrates the importance of exploring taskagnostic multilingual sentence representations like LASER and ours. Batch Size. We investigate the effect of the batch size for contrastive tasks, where different batch sizes indicate the discrepancy of the negative sample numbers. As shown in Table 9, larger batch harms the lightweight model based sentence representation learning and 128 is reported as the best batch size setting for our lightweight model. Furthermore, batch size of 128 allows the training to be assigned on 12GB GPU card while a larger batch size requires more GPU memory. Corpus Size. We show the impact of the size of the parallel corpus on English-French in Table 10. For MLDoc, we observe higher accuracy on larger corpus while for XSR, a small fraction of the large corpus suffices to yield effective results. This indicates that more parallel data improves the performance on MLDoc. Sentence Representation Dimension. In Figure 2, we present the effect of the sentence representation dimension. 512-dimensional sentence representations significantly outperform 256dimensional ones in our lightweight model. Moreover, representation size of 512 yields better performance without increasing the training time.