Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation

We propose a method to distill a language-agnostic meaning embedding from a multilingual sentence encoder. By removing language-specific information from the original embedding, we retrieve an embedding that fully represents the sentence’s meaning. The proposed method relies only on parallel corpora without any human annotations. Our meaning embedding allows efficient cross-lingual sentence similarity estimation by simple cosine similarity calculation. Experimental results on both quality estimation of machine translation and cross-lingual semantic textual similarity tasks reveal that our method consistently outperforms the strong baselines using the original multilingual embedding. Our method consistently improves the performance of any pre-trained multilingual sentence encoder, even in low-resource language pairs where only tens of thousands of parallel sentence pairs are available.

In the latest QE competitions at the conference on machine translation (WMT) (Specia et al., 2020), all top-ranked systems (Ranasinghe et al., 2020;Fomicheva et al., 2020a;Nakamachi et al., 2020) employed pre-trained multilingual sentence encoders, such as multilingual BERT (mBERT) (Devlin et al., 2019) and XLM-RoBERTa (XLM-R) (Conneau and Lample, 2019;Conneau et al., 2020). These multilingual sentence encoders form a single self-attention network pre-trained on monolingual corpora in over 100 languages with the objective function of masked language modelling. Fine-tuning with a human-annotated corpus is mandatory for these models to enable them to estimate the semantic similarity between sentences across languages. Otherwise, these models are not sensitive to semantic similarity.
A sentence encoder that can estimate semantic similarity across languages without fine-tuning for the target task is desirable because bilingual corpora with human annotations are unavailable in most language pairs. Figure 1 plots embeddings of parallel sentences in three languages extracted from mBERT without fine-tuning. This visualisation implies that the mBERT embeddings form clusters by language rather than by meaning.
We propose a method for distilling language-agnostic meaning embeddings by removing language-specific information from sentence embeddings generated by off-the-shelf multilingual sentence encoders. Our embeddings allow efficient cross-lingual sentence similarity estimation using simple cosine similarity. Our method does not require human annotations specific to the target task and is based solely on the bilingual corpora. Experimental results on both the WMT20 QE task (Specia et al., 2020) and the SemEval-2017 cross-lingual STS task (Cer et al., 2017) in unsupervised settings revealed that our method consistently outperformed the strong baselines using the existing pre-trained multilingual sentence encoders.
2 Related Work

Multilingual Sentence Encoders
Early multilingual sentence encoders, such as LASER (Artetxe and Schwenk, 2019a,b), were encoder-decoder models based on recurrent neural networks. Similar to the evolution of monolingual sentence encoders (Kiros et al., 2015;Logeswaran and Lee, 2018;Cer et al., 2018;Reimers and Gurevych, 2019), multilingual sentence encoders have now been replaced by encoder-only models based on self-attention networks (Vaswani et al., 2017) for computational efficiency and improved performance in downstream tasks. Recent multilingual sentence encoders, such as mBERT (Devlin et al., 2019) and XLM-R (Conneau and Lample, 2019;Conneau et al., 2020), are single selfattention networks pre-trained on monolingual corpora in over 100 languages for the objective function of masked language modelling (Devlin et al., 2019;Liu et al., 2019). LaBSE (Feng et al., 2020) is a state-of-the-art multilingual sentence encoder for parallel text retrieval trained in both masked language modelling and translation language modelling (Conneau and Lample, 2019). LaBSE is trained using a maximum of 100 million sentence pairs in each language, with a total of 6 billion sentence pairs of bilingual corpora. We extended these SAN-based multilingual sentence encoders for unsupervised cross-lingual similarity estimation. The multilingual version of Sentence-BERT (SBERT) (Reimers and Gurevych, 2020) was obtained by knowledge distillation from the English version of SBERT (Reimers and Gurevych, 2019). Although this model achieves the best performance in cross-lingual STS tasks, it is not fully unsupervised because SBERT is fine-tuned for STS tasks.

Unsupervised Methods for Cross-lingual
Sentence Similarity Estimation Libovický et al. (2020) extracts language-neutral embeddings (centered and projection) from pretrained multilingual sentence encoders. The centered method subtracts the mean embedding for each language from the sentence embedding. The projection method involves bilingual projections using a parallel corpus and map embeddings in other languages into the space of English. BERTScore (Zhang et al., 2020) estimates the semantic similarity between sentences by matching token embeddings from BERT (Devlin et al., 2019). Although the BERTScore of its original form is a reference-based automatic evaluation method, it can be applied to an unsupervised cross-lingual similarity estimation by using multilingual sentence encoders instead of BERT.
D-TP and D-Lex-Sim (Fomicheva et al., 2020b) are unsupervised QE methods; however, they use neural machine translation (NMT) systems that are the targets of QE. D-TP uses a sequence-level translation probability normalised by sentence length. D-Lex-Sim calculates the METEOR score (Banerjee and Lavie, 2005) based on the lexical variation between the translation hypotheses. These methods are useful for white-box machine translation systems; however, in general, users can access only the output sentences.
Prism (Thompson and Post, 2020) and BGT (Wieting et al., 2020) are state-of-the-art unsupervised methods for QE and STS, respectively. These are NMT models that train encoder-decoder structures of SANs on bilingual corpora. Prism uses the generation probability of force-decoding a target sentence as the QE score. BGT disentangles languagespecific and language-agnostic embeddings from input sentences based on an auto-encoding mechanism. By calculating the cosine similarity between such language-agnostic embeddings, BGT estimates cross-lingual sentence similarity. The need for large-scale bilingual corpora to train NMT models limits the language pairs that these models can support. While multilingual sentence encoders cover over 100 languages, Prism covers only 39 languages. Although we extract both languagespecific and language-agnostic embeddings, the decoder-free architecture of our model supports to support low-resource language pairs. In other words, our method is sufficiently efficient to support the massively multilingual scenario. Although multilingual sentence encoders are useful for cross-lingual NLU, their embeddings are highly biased by language-specific information, which separates sentence embeddings into multiple languages, as shown in Figure 1. We distil language-agnostic meaning embedding from multilingual sentence embeddings to estimate crosslingual sentence similarity in an unsupervised manner. By training with bilingual corpora, we unite embeddings of semantically similar sentences from pre-trained multilingual sentence encoders. Our model is an autoencoder comprising two multi-layer perceptrons (MLPs), MLP M and MLP L , as shown in Figure 3. The former is responsible for extracting meaning, while the latter extracts language-specific information, and then, these outputs are summed to reconstruct the input sentence embedding.
We train these MLPs using multilingual and multitask learning using three loss functions: where L R for reconstruction (Section 3.1), L M is used for extracting the meaning (Section 3.2), and L L for extracting language information (Section 3.3). Figure 2 presents an overview of our multitask training, for which we input a pair of bilingual sentences, (a) and (b), as well as randomly selected sentences of each language, (c) and (d). Sentences closer, while the meaning embeddings derived from (a) and (c) (also (b) and (d)) become distant. In contrast, the constraints of L L make the language embeddings derived from (a) and (c) as well as (b) and (d) to come closer, respectively. In addition, it further acts as a constraint on how language embeddings retain language-specific information using language identification.
We perform multitask learning in a multilingual manner, that is, by mixing all languages to support the target task. All parameters of MLP M are shared (same for MLP L ). Obviously, our model is trained using only parallel corpora without any human annotations, such as QE labels.

Reconstruction Loss
The reconstruction loss L R in Equation (1) is the basis of the autoencoder training, which ensures that meaning and language embeddings,ê M ∈ R d andê L ∈ R d , respectively, can reconstruct the input sentence embedding e ∈ R d (d is the dimension of the sentence embedding). We define reconstruction loss as: The embeddings ofê M andê L are derived from e using the meaning encoder MLP M (·) and the language encoder MLP L (·) as follows:

Meaning Embedding Loss
The constraint of L M in Equation (1) is such that MLP M (·) extracts language-agnostic meaning representation asê M . To achieve this, L M considers a pair of parallel sentences ((a) and (b) in Figure 2) and random sentences of each language ((c) and (d) in Figure 2). The meaning embeddings of the former should be closer, while those of the latter should be distant, which is achieved by losses of L x M and L m M , respectively: L x M takes the meaning embeddings of parallel sentences, that is, an embedding of a source sentenceŝ M ∈ R d and an embedding of a target sentencet M ∈ R d , and computes the cosine distance.
In contrast, L m M takes the meaning embeddings of the same languageŝ M andŝ M ∈ R d . Because these sentences are randomly paired, their meaning embeddings should be distant. The same constraint applies to the meaning embeddings of the other languages,t M andt M ∈ R d . We define L m M as: Language Pair Number of sentence pairs

Language Embedding Loss
The constraint of L L in Equation (1) is such that MLP L (·) extracts language-specific information asê L . To achieve this, L L consists of two subloss functions: language embedding loss L m L and language identification loss L i L : The constraint of L m L is such that language embeddings of the same language come closer. In addition, the constraint of L i L is such that language embeddings become useful for language identification and for avoiding collection of random noises.
L m L takes language embeddings of the same language: for one languageŝ L andŝ L , and anothert L andt L . Then, L m L computes the cosine distances of each pair of language embeddings: By minimising the distance between language embeddings of non-parallel sentences, the indirect constraint of L m L is such that meaning and language-specific information are clearly separated. Such non-parallel sentences are written in the same language, but their meanings are different. The constraint of our meaning embedding loss makes these non-parallel sentences distant, while the constraint of our language embedding loss makes language embeddings come closer. In other words, meaning and language embedding losses operate in opposite directions for non-parallel sentences. We expect that this training helps clearly separate the meaning and language embeddings. In contrast, language-specific embeddings in BGT (Wieting et al., 2020) are trained with only parallel sentences, which may allow meaning information to leak into the language-specific embeddings.
L i L computes the loss for language identification. We conduct language identification using an MLP:  whereê L is eitherŝ L ort L and softmax(·) is a softmax function. L i L computes the multiclass cross-entropy loss as:

Training Details
All the MLPs in our model, MLP M , MLP L , and MLP I , are a single-layer feedforward networks. We used mBERT 2 (Devlin et al., 2019), XLM-R 3 (Conneau et al., 2020), and LaBSE 4 (Feng et al., 2020), which are state-of-the-art pre-trained multilingual sentence encoders (Wolf et al., 2020). We froze the parameters of these multilingual sentence encoders and trained only the MLPs using parallel corpora. We used the output embedding of the [CLS] token for sentence embedding. We trained our model with a batch size of 512. As an optimiser, we used Adam (Kingma and Ba, 2015) with a learning rate of 1e − 4 for all the models. We employed early stopping for training with a patience of 15 using a validation loss. The validation set was created by randomly sampling 10% from the training set.

Evaluation
We evaluated the effectiveness of the proposed method in two regression tasks: the WMT20 QE task (Specia et al., 2020) and SemEval-2017 crosslingual STS task (Cer et al., 2017). As shown in Figure 4, the meaning embeddings of each input sentence are extracted using our meaning encoder. In this experiment, we evaluated the correlation between the cosine similarity of meaning embeddings and human labels. Following the official evaluation metrics, we used Pearson correlation for both tasks implemented in the SciPy 5 package.

Setting
In this task, we trained our model on the publicly available bilingual corpora 6 that were used to train the target machine translation systems 7  for QE. The dataset contains sentence pairs for English-German (en-de), English-Chinese (enzh), Romanian-English (ro-en), Estonian-English (et-en), Nepalese-English (ne-en), and Sinhala-English (si-en). 8 To train our model, we randomly sampled 5% of parallel sentence pairs for each lan-guage pair. 9 Table 1 lists the numbers of parallel sentence pairs used in this experiment.
We compared previous unsupervised QE methods based on pre-trained multilingual sentence encoders. The method proposed by Libovický et al. (2020) obtains language-neutral embeddings from mBERT, denoted as mBERT (centered) and mBERT (projection). 10 Owing to the lack of a development set to determine which layer to use in these methods, we used the 8th layer, which was reported to perform consistently well in the original paper. LASER 11 (Artetxe and Schwenk, 2019a,b) is a multilingual encoderdecoder model. BERTScore 12 (Zhang et al., 2020) is a method for estimating sentence similarity by matching token embeddings from a sentence encoder. In this experiment, we used BERTScore with xlm-roberta-large, which has been reported to have the highest performance. Table 2 shows the Pearson correlation coefficients of the models compared. The first set of rows shows the scores of the original mBERT, XLM-R, and LaBSE, and their meaning embeddings using our method. Our method consistently improved the QE performance on all of the multilingual sentence encoders. Among them, meaning embeddings of LaBSE achieved the best performance. While the meaning embeddings of mBERT and XLM-R are inferior to those of LaBSE, improvements over the original models are noticeable.

Result
Our method outperformed the original LaBSE, even though the bilingual corpora we used are three orders of magnitude smaller than those used to train the LaBSE. This implies that the benefits of our method are not simply due to the use of bilingual corpora, but also due to the effectiveness of the distillation method.
The second set of rows shows the performance of previous methods. The meaning embeddings of LaBSE outperformed these methods for both highand low-resource language pairs.
The last set of rows shows the performance of other QE models that do not use pre-trained multilingual sentence encoders. These methods achieve 9 As mBERT does not support Sinhala, mBERT-based models were trained on only for language pairs other than si-en. 10 https://github.com/jlibovicky/ assess-multilingual-bert 11 https://github.com/facebookresearch/ LASER 12 https://github.com/Tiiiger/bert_score  higher performance than ours in high-resource language pairs, but not in low-resource language pairs. D-TP and D-Lex-Sim (Fomicheva et al., 2020b) are unsupervised QE methods using the NMT models that are the targets of QE. It is unlikely that we will always be allowed to use these NMT parameters in practice, while our method can conduct QE for black-box NMT systems. Prism 13 (Thompson and Post, 2020) is the current state-of-the-art unsupervised QE method, which is based on an encoder-decoder model trained on large-scale bilingual corpora. In contrast, our meaning embeddings of LaBSE efficiently support low-resource language pairs. The last row shows the performance of the Predictor-Estimator (Kim et al., 2017), which is the supervised QE model. 14 Predictor-Estimator is regarded as the strong baseline for supervised QE tasks. It is notable that the meaning embeddings of LaBSE outperformed the supervised Predictor-Estimator in both medium-and low-resource language pairs.

Result
The first and second sets of rows in Tables 4 and  5 show the Pearson correlation coefficients of the original multilingual sentence encoders, the meaning embeddings by our model, and baselines, measured on the cross-lingual STS task, respectively. Similar to the evaluations of the QE task, our 16 https://tatoeba.org method consistently improved the performance of mBERT, XLM-R, and LaBSE for languages other than Spanish on LaBSE. Our method substantially improved STS performance not only of language pairs for which large-scale training data are available in Table 3, but also of language pairs with fewer data, such as Arabic (en-ar) and Dutch (ennl). Table 5 implies that our method improves performance not only for cross-lingual but also for monolingual tasks. The last sets of rows in Tables 4 and 5 show the performance of state-of-the-art models: the multilingual version of SBERT (Reimers and Gurevych, 2020). It uses knowledge distillation by setting SBERT trained with AllNLI (SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018)), and STSB (Cer et al., 2017)   XLM-R as a student. 17 As the teacher model is exposed to the training of the STS task, it is expected that this model will achieve higher performance. Nonetheless, our meaning embeddings of LaBSE showed competitive performance without any supervision of STS. The last row shows the performance of BGT (Wieting et al., 2020) that disentangles language-agnostic and language-specific representations using an encoder-decoder model. In contrast to this model, which trains a decoder using a large-scale bilingual corpus, our method achieves higher performance in low-resource language pairs.

Analysis
We further analyse our method through an ablation study and visualisation of sentence embeddings. Table 6 shows the performance in the QE task when each meaning loss (Section 3.2) and language loss (Section 3.3) is removed from our method. We observe that the model's performance tends to worsen without either constraint. In particular, removing meaning loss has a serious impact on QE perfor-mance. We conjecture that this is because meaning loss allows learning semantic equivalence and inequivalence, which is useful for conducting QE. Figure 5 shows the sentence embeddings from mBERT for randomly sampled 1, 000 parallel sentences in English and Romanian, where dimensions were reduced by principal component analysis (Maćkiewicz and Ratajczak, 1993). Despite these parallel sentence pairs representing the same meaning, their embeddings from the original mBERT (left) form clusters by language rather than by meaning, as shown in Figure 1. By applying our method, the meaning embeddings (center) became language-agnostic. Besides, the language embeddings (right) are more clearly divided. Similar analyses in other languages are shown in Figure 6. The same tendency can be observed regardless of the language pair.

Conclusion
To achieve unsupervised language-agnostic sentence similarity estimation, we distilled the meaning embeddings using pre-trained multilingual sentence encoders. We trained the autoencoder consisting of two MLPs, that is, meaning encoder and language encoder, in a multitask and multilingual manner. Our method successfully distils languageagnostic (i.e., meaning embedding) information by removing language-specific (i.e., language embedding) information from the original sentence embedding.
Our method has following advantages: (1) It can be trained using only parallel corpora without any human annotations. (2) Based on pre-trained multilingual sentence encoders, our single model can cover more than 100 languages. Experimental results in both the QE and crosslingual STS tasks revealed that our method consistently improves the performance of original multilingual sentence encoders, such as mBERT, XLM-R, and LaBSE. Substantial improvements were obtained even from tens of thousands of parallel sentence pairs, achieving the highest performance in QE for low-resource language pairs.