Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast

In this paper, we propose to align sentence representations from different languages into a unified embedding space, where semantic similarities (both cross-lingual and monolingual) can be computed with a simple dot product. Pre-trained language models are fine-tuned with the translation ranking task. Existing work (Feng et al., 2020) uses sentences within the same batch as negatives, which can suffer from the issue of easy negatives. We adapt MoCo (He et al., 2020) to further improve the quality of alignment. As the experimental results show, the sentence representations produced by our model achieve the new state-of-the-art on several tasks, including Tatoeba en-zh similarity search (Artetxe andSchwenk, 2019b), BUCC en-zh bitext mining, and semantic textual similarity on 7 datasets.


Introduction
Pre-trained language models like BERT (Devlin et al., 2019) and GPT (Radford and Narasimhan, 2018) have achieved phenomenal successes on a wide range of NLP tasks. However, sentence representations for different languages are not very well aligned, even for pre-trained multilingual models such as mBERT (Pires et al., 2019;. This issue is more prominent for language pairs from different families (e.g., English versus Chinese). Also, previous work  has shown that the out-of-box BERT embeddings perform poorly on monolingual semantic textual similarity (STS) tasks.
There are two general goals for sentence representation learning: cross-lingual representations should be aligned, which is a crucial step for tasks like bitext mining (Artetxe and Schwenk, 2019a), unsupervised machine translation (Lample et al., 2018b), and zero-shot cross-lingual transfer (Hu et al., 2020) etc. Another goal is to induce a metric space, where semantic similarities can be com-puted with simple functions (e.g., dot product on L 2 -normalized representations).
Translation ranking (Feng et al., 2020; can serve as a surrogate task to align sentence representations. Intuitively speaking, parallel sentences should have similar representations and are therefore ranked higher, while non-parallel sentences should have dissimilar representations. Models are typically trained with in-batch negatives, which need a large batch size to alleviate the easy negatives issue (Chen et al., 2020a). Feng et al. (2020) use cross-accelerator negative sampling to enlarge the batch size to 2048 with 32 TPU cores. Such a solution is hardware-intensive and still struggles to scale.
Momentum Contrast (MoCo)  decouples the batch size and the number of negatives by maintaining a large memory queue and a momentum encoder. MoCo requires that queries and keys lie in a shared input space. In selfsupervised vision representation learning, both queries and keys are transformed image patches. However, for translation ranking task, the queries and keys come from different input spaces. In this paper, we present dual momentum contrast to solve this issue. Dual momentum contrast maintains two memory queues and two momentum encoders for each language. It combines two contrastive losses by performing bidirectional matching.
We conduct experiments on the English-Chinese language pair. Language models that are separately pre-trained for English and Chinese are fine-tuned using translation ranking task with dual momentum contrast. To demonstrate the improved quality of the aligned sentence representations, we report state-of-the-art results on both cross-lingual and monolingual evaluation datasets: Tatoeba similarity search dataset (accuracy 95.9% → 97.4%), BUCC 2018 bitext mining dataset (f1 score 92.27% → 93.66%), and 7 English STS datasets (average Spearman's correlation 77.07% → 78.95%). We also carry out several ablation studies to help understand the learning dynamics of our proposed model.  Figure 1: Illustration of dual momentum contrast. sg denotes "stop gradient". x and y are sentences from two different languages.
Dual Momentum Contrast is a variant of the MoCo proposed by . Our method fits into the bigger picture of contrastive learning for self-supervised representation learning (Le-Khac et al., 2020). Given a collection of parallel sentences {x i , y i } n i=1 , as illustrated in Figure  1, we first encode each sentence using languagespecific BERT models (base encoder), then apply mean pooling on the last-layer outputs and L 2 normalization to get the representation vector Each BERT encoder has a momentum encoder, whose parameters θ are updated by exponential moving average of the base encoder as follows: Where t is the iteration step. Two memory queues are maintained for each language to store K vectors encoded by the corresponding momentum encoder from most recent batches. The oldest vectors are replaced with the vectors from the current batch upon each optimization step. The momentum coefficient m ∈ [0, 1] is usually very close to 1 (e.g., 0.999) to make sure the vectors in the memory queue are consistent across batches. K can be very large (>10 5 ) to provide enough negative samples for learning robust representations.
To train the encoders, we use the InfoNCE loss (Oord et al., 2018): τ is a temperature hyperparameter. Intuitively, Equation 2 is a (K+1)-way softmax classification, where the translation sentence y = y 0 is the positive, and the negatives are those in the memory queue {y i } K i=1 . Note that the gradients do not backpropagate through momentum encoders nor the memory queues.
Symmetrically, we can get L(y, x). The final loss function is the sum: After the training is done, we throw away the momentum encoders and the memory queues, and only keep the base encoders to compute the sentence representations. In the following, our model is referred to as MoCo-BERT.
Application Given a sentence pair (x i , y j ) from different languages, we can compute cross-lingual semantic similarity by taking dot product of L 2normalized representations h x i · h y j . It is equivalent to cosine similarity, and closely related to the Euclidean distance.
Our model can also be used to compute monolingual semantic similarity. Given a sentence pair (x i , x j ) from the same language, assume y j is the translation of x j , if the model is well trained, the representations of x j and y j should be close to each other: h x j ≈ h y j . Therefore, we have h x i · h x j ≈ h x i · h y j , the latter one is cross-lingual similarity which is what our model is explicitly optimizing for.

Setup
Data Our training data consists of English-Chinese corpora from UNCorpus 1 , Tatoeba, News Commentary 2 , and corpora provided by CWMT 2018 3 . All parallel sentences that appear in the evaluation datasets are excluded. We sample 5M sentences to make the training cost manageable.
Hyperparameters The encoders are initialized with bert-base-uncased (English) for fair comparison, and RoBERTa-wwm-ext 4 (Chinese version). Using better pre-trained language models is orthogonal to our contribution. Following Reimers and Gurevych (2019), sentence representation is computed by the mean pooling of the final layer's outputs. Memory queue size is 409600, temperature τ is 0.04, and the momentum coefficient is 0.999. We use AdamW optimizer with maximum learning rate 4 × 10 −5 and cosine decay. Models are trained with batch size 1024 for 15 epochs on 4 V100 GPUs. Please checkout the Appendix A for more details about data and hyperparameters.

Model
Accuracy mBERT base (Hu et al., 2020) 71.6% LASER (Artetxe and Schwenk, 2019b) 95.9% VECO (Luo et al., 2020) 82  (Hu et al., 2020) 50.0% LASER (Artetxe and Schwenk, 2019b) 92.27% VECO (Luo et al., 2020) 78.5% SBERT base -p † 87.8% LaBSE (Feng et al., 2020) 89.0% MoCo-BERT base 93.66% Tatoeba cross-lingual similarity search Introduced by Artetxe and Schwenk (2019b), Tatoeba corpus consists of 1000 English-aligned sentence pairs. We find the nearest neighbor for each sentence in the other language using cosine similarity. Results for both forward and backward directions are listed in Table 1. MoCo-BERT achieves an accuracy of 97.4%. BUCC 2018 bitext mining aims to identify parallel sentences from a collection of sentences in two languages (Zweigenbaum et al., 2018). Following Artetxe and Schwenk (2019a), we adopt the margin-based scoring by considering the average cosine similarity of k nearest neighbors (k = 3 in our experiments): sim(x, y) = margin(cos(x, y), (4) We use the distance margin function: margin(a, b) = a − b, which performs slightly better than the ratio margin function (Artetxe and Schwenk, 2019a). All sentence pairs with scores larger than threshold λ are identified as parallel. λ is searched based on the validation set. The F1 score of our system is 93.66%, as shown in Table  2.

Monolingual STS Evaluation
We evaluate the performance of MoCo-BERT for STS without training on any labeled STS data, following the procedure by Reimers and Gurevych (2019). All results are based on BERT base . Given a pair of English sentences, the semantic similarity is computed with a simple dot product. We also report the results using labeled natural language inference (NLI) data. A two-layer MLP with 256 hidden units and a 3-way classification head is added on top of the sentence representations. The training set of SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) are used for multitask training. See Appendix B for the detailed setup.
As pointed out by Gao et al. (2021), existing works follow inconsistent evaluation protocols, and thus may cause unfair comparison. We report results for both "weighted mean" (wmean) and "all" settings (Gao et al., 2021) in Table 3 and 8 respectively.

Model Analysis
We conduct a series of experiments to better understand the behavior of MoCo-BERT. Unless explicitly mentioned, we use a memory queue size 204800 for efficiency.
Memory queue size One primary motivation of MoCo is to introduce more negatives to improve  the quality of the learned representations. In Figure  2, as expected, the performance consistently increases as the memory queue becomes larger. For visual representation learning, the performance usually saturates with queue size ∼ 65536 , but the ceiling is much higher in our case. Also notice that the model can still reach 72.03 with a small batch size 256, which might be because the encoders have already been pre-trained with MLM.  loss makes the model focus more on the hard negative examples, but it also risks over-fitting label noises. Table 4 shows that τ could dramatically affect downstream performance, with τ = 0.04 getting the best results on both STS and BUCC bitext mining tasks. The optimal τ is likely to be task-specific.

Momentum Update
We also empirically verify if the momentum update mechanism is really necessary. Momentum update provides a more consistent matching target but also complicates the training procedure. In Table 5 Table 6. Consistent with Reimers and Gurevych (2019), mean pooling has a slight but pretty much negligible advantage over other methods.
In Appendix C, we also showcase some visualization and sentence retrieval results.

Related Work
Multilingual representation learning aims to jointly model multiple languages. Such representations are crucial for multilingual neural machine translation (Aharoni et al., 2019), zero-shot cross-lingual transfer (Artetxe and Schwenk, 2019b), and cross-lingual semantic retrieval  etc. Multilingual BERT (Pires et al., 2019) simply pre-trains on the concatenation of monolingual corpora and shows good generalization for tasks like cross-lingual text classification (Hu et al., 2020). Another line of work explicitly aligns representations from language-specific models, either unsupervised (Lample et al., 2018a) or supervised (Reimers and Gurevych, 2020;Feng et al., 2020).

adopts a similar variant of MoCo for open-domain question answering.
Semantic textual similarity is a long-standing NLP task. Early approaches (Seco et al., 2004;Budanitsky and Hirst, 2001) use lexical resources such as WordNet to measure the similarity of texts. A series of SemEval shared tasks (Agirre et al., 2012(Agirre et al., , 2014 provide a suite of benchmark datasets that is now widely used for evaluation. Since obtaining large amounts of high-quality STS training data is non-trivial, most STS models are based on weak supervision data, including conversations (Yang et al., 2018), NLI (Conneau et al., 2017;Reimers and Gurevych, 2019), and QA pairs (Ni et al., 2021).

Conclusion
This paper proposes a novel method that aims to solve the easy negatives issue to better align crosslingual sentence representations. Extensive experiments on multiple cross-lingual and monolingual evaluation datasets show the superiority of the resulting representations. For future work, we would like to explore other contrastive learning methods (Grill et al., 2020;Xiong et al., 2020), and experiment with more downstream tasks including paraphrase mining, text clustering, and bilingual lexicon induction etc.

B Multi-task with NLI
Given a premise x p and a hypothesis x h , the sentence representations are computed as stated in the paper. Then, a two-layer MLP with 256 hidden units, ReLU activation, and a 3-way classification head is added on top of the sentence representations. Dropout 0.1 is applied to the hidden units. The loss function L nli (x p , x h ) is simply the crossentropy between gold label and softmax outputs. The model is jointly optimized with the following: Where α is used to balance different training objectives, we set α = 0.1 empirically. The batch size for NLI loss is 128. The training set is the union of SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) dataset (~1M sentence pairs).

C Visualization of Sentence Representations
To visualize the learned sentence representations, we use t-SNE (Maaten and Hinton, 2008) for dimensionality reduction. In Figure 3, we can see the representations of parallel sentences are very close, indicating that our proposed model is successful at aligning cross-lingual representations.
In Table 10, we illustrate the results of monolingual sentence retrieval. Most top-ranked sentences indeed share similar semantics with the given query, this paves the way for potential applications like paraphrase mining. 0.718 They have the right to have their case heard by a jury. 0.647 Every defendant charged with a felony has a right to be charged by the Grand Jury. 0.580 Everyone has the right to be educated. Table 10: Examples of sentence retrieval using learned representations. Given a query, we use cosine similarity to retrieve the 3 nearest neighbors (excluding exact match). The first column is the cosine similarity score between the query and retrieved sentences. The corpus is 1M random English sentences from the training data.