Language-agnostic BERT Sentence Embedding

While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5% achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.


Introduction
In this paper, we systematically explore using pretraining language models in combination with the best of existing methods for learning cross-lingual sentence embeddings.Such embeddings are useful for clustering, retrieval, and modular use of text representations for downstream tasks.While existing cross-lingual sentence embedding models incorporate large transformer models, using large pretrained language models is not well explored.Rather in prior work, encoders are trained directly on translation pairs (Artetxe and Schwenk, 2019b;Guo et al., 2018;Yang et al., 2019a), or on translation pairs combined with monolingual inputresponse prediction (Chidambaram et al., 2019;Yang et al., 2019b).
In our exploration, as illustrated in figure 1, we make use of dual-encoder models, which have been demonstrated as an effective approach for learning bilingual sentence embeddings (Guo et al., 2018;Yang et al., 2019a).However, diverging from prior work, rather than training encoders from scratch, we investigate using pre-trained encoders based on large language models.We contrast models with and without additive margin softmax (Yang et al., 2019a)  1 .Figure 2 illustrates where our work stands (shaded) in the field of LM pre-training and sentence embedding learning.
Our massively multilingual models outperform the previous state-of-the-art on large bi-text retrieval tasks including the United Nations (UN) corpus (Ziemski et al., 2016) and BUCC (Zweigenbaum et al., 2018).Table 1 compares our best model with other recent multilingual work.
Both the UN corpus and BUCC cover resource rich languages (fr, de, es, ru, and zh).We further evaluate our models on the Tatoeba retrieval task (Artetxe and Schwenk, 2019b) that covers 112 languages.Compare to LASER (Artetxe and Schwenk, 2019b), our models perform significantly better on low-resource languages, boosting the overall accuracy on 112 languages to 83.7%, from the 65.5% achieved by the previous state-of-art.Surprisingly, we observe our models performs well on 30+ Tatoeba languages for which we have no explicit monolingual or bilingual training data.Finally, our embeddings perform competitively on the SentEval sentence embedding transfer learning benchmark (Conneau and Kiela, 2018).
The contributions of this paper are: • A novel combination of pre-training and dualencoder finetuning to boost translation ranking performance, achieving a new state-of-theart on bi-text mining.
• A publicly released multilingual sentence embedding model spanning 109+ languages.
• Thorough experiments and ablation studies to understand the impact of pre-training, negative sampling strategies, vocabulary choice, data quality, and data quantity.

Cross-lingual Sentence Embeddings
Dual encoder models are an effective approach for learning cross-lingual embeddings (Guo et al., 2018;Yang et al., 2019a).Such models consist of paired encoding models that feed a scoring function.The source and target sentences are encoded separately.Sentence embeddings are extracted from each encoder.Cross-lingual embeddings are trained using a translation ranking task with inbatch negative sampling: (1) The embedding space similarity of x and y is given by φ(x, y), typically φ(x, y) = xy T .The loss attempts to rank y i , the true translation of x i , over all N −1 alternatives in the same batch.Notice that L is asymmetric and depends on whether the softmax is over the source or the target sentences.For bidirectional symmetry, the final loss can sum the source-to-target, L, and target-to-source, L , losses (Yang et al., 2019a): Dual encoder models trained using a translation ranking loss directly maximize the similarity of translation pairs in a shared embedding space.

Additive Margin Softmax
Additive margin softmax extends the scoring function φ by introducing margin m around positive pairs (Yang et al., 2019a): The margin, m, improves the separation between translations and nearby non-translations.Using φ (x i , y j ) with the bidirectional loss Ls , we obtain the additive margin loss

MLM and TLM Pre-training
Only limited prior work has combined dual encoders trained with a translation ranking loss with encoders initialized using large pre-trained language models (Yang et al., 2021).We contrast using a randomly initialized transformer, as was done in prior work (Guo et al., 2018;Yang et al., 2019a), with using a large pre-trained language model.For pre-training, we combined Masked language modeling (MLM) (Devlin et al., 2019) and Translation language modeling (TLM) (Conneau and Lample, 2019 Multilingual pre-trained models such as mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019) and XLM-R (Conneau et al., 2019) have led to exceptional gains across a variety of cross-lingual natural language processing tasks (Hu et al., 2020).However, without a sentence level objective, they do not directly produce good sentence embeddings.As shown in Hu et al. (2020), the performance of such models on bitext retrieval tasks is very weak, e.g XLM-R Large gets 57.3% accuracy on a selected 37 languages2 from the Tatoeba dataset compared to 84.4% using LASER (see performance of more models in table 5).We contribute a detailed exploration that uses pre-trained language models to produce useful multilingual sentence embeddings.

Corpus
We use bilingual translation pairs and monolingual data in our experiments 3 .
Monolingual Data We collect monolingual data from CommonCrawl 4 and Wikipedia 5 .We use the 2019-35 version of CommonCrawl with heuristics from Raffel et al. (2019) to remove noisy text.Additionally, we remove short lines < 10 characters and those > 5000 characters. 6The wiki data is extracted from the 05-21-2020 dump using WikiExtractor7 .An in-house tool splits the text into sentences.The sentences are filtered using a sentence quality classifier. 8After filtering, we obtain 17B monolingual sentences, about 50% of the unfiltered version.The monolingual data is only used in customized pre-training.

Bilingual Translation Pairs
The translation corpus is constructed from web pages using a bitext mining system similar to the approach described in Uszkoreit et al. (2010).The extracted sentence pairs are filtered by a pre-trained contrastivedata-selection (CDS) scoring model (Wang et al., 2018).Human annotators manually evaluate sentence pairs from a small subset of the harvested pairs and mark the pairs as either GOOD or BAD translations.The data-selection scoring model threshold is chosen such that 80% of the retained pairs from the manual evaluation are rated as GOOD.We further limit the maximum number of sentence pairs to 100 million for each language to balance the data distribution.Many languages still have far fewer than 100M sentences.The final corpus contains 6B translation pairs.9The translation corpus is used for both dual encoder training and customized pre-training.

Configurations
In this section, we describe the training details for the dual encoder model.A transformer encoder is used in all experiments (Vaswani et al., 2017).We train two versions of the model, one uses the public BERT multilingual cased vocab with vocab size 119,547 and a second incorporates a customized vocab extracted over our training data.For the customized vocab, we employ a wordpiece tokenizer (Sennrich et al., 2016), with a cased vocabulary extracted from the training set using TF Text. 10he language smoothing exponent for the vocab generation tool is set to 0.3 to counter imbalances in the amount of data available per language.The final vocabulary size is 501,153.
The encoder architecture follows the BERT Base model, with 12 transformer blocks, 12 attention

Core i
In-batch Negative Sampling Cross-Accelerator Negative Sampling heads and 768 per-position hidden units.The encoder parameters are shared for all languages.Sentence embeddings are extracted as the l 2 normalized [CLS] token representations from the last transformer block. 11Our models are trained on Cloud TPU V3 with 32-cores using a global batch size of 4096 with a max sequence length of 128, using the AdamW (Loshchilov and Hutter, 2019) optimizer with initial learning rate 1e-3, and linear weight decay.We train for 50k steps for models with pre-training, and 500k steps for models without pre-training.We observe that even further training did not change the performance significantly.The default margin value for additive margin softmax is set to 0.3.Hyperparameters are tuned on a held-out development set.

Cross-Accelerator Negative Sampling
Cross-lingual embedding models trained with inbatch negative samples benefit from large training batch sizes (Guo et al., 2018).Resource intensive models like BERT, are limited to small batch sizes due to memory constraints.While data-parallelism does allow us to increase the global batch size by using multiple accelerators, the batch-size on an individual cores remains small.For example, a 4096 batch run across 32 cores results in a local batch size of 128, with each example then only receiving 127 negatives.
We introduce cross-accelerator negative sam-11 During training, the sentence embeddings after normalization are multiplied by a scaling factor.Following Chidambaram et al. (2018), we set the scaling factor to 10.We observe that the scaling factor is important for training a dual encoder model with the normalized embeddings.
pling, which is illustrated in figure 3.12 Under this strategy each core encodes its assigned sentences and then the encoded sentence representations from all cores are broadcast as negatives to the other cores.This allows us to fully realize the benefits of larger batch sizes while still distributing the computationally intensive encoding work across multiple cores.
Note the dot-product scoring function makes it efficient to compute the pairwise scores in the same batch with matrix multiplication.In figure 3, the value in the grids indicates the ground truth labels, with all positive labels located in diagonal grids.A softmax function is applied on each row.

Pre-training
The encoder is pre-trained with Masked Language Model (MLM) (Devlin et al., 2019) and Translation Language Model (TLM) (Conneau and Lample, 2019) 13 training on the monolingual data and bilingual translation pairs, respectively.For an L layer transformer encoder, we train using a 3 stage progressive stacking algorithm (Gong et al., 2019), where we first learn a L 4 layers model and then L 2 layers and finally all L layers.The parameters of the models learned in the earlier stages are copied to the models for the subsequent stages.Pre-training uses TPUv3 with 512-cores and a batch size of 8192.The max sequence length is set to 512 and 20% of tokens (or 80 tokens at most) per sequence are masked for MLM and TLM predictions.For the three stages of progressive stacking, we respectively train for 400k, 800k, and 1.8M steps using all monolingual and bilingual data.

Bitext Retrieval
We evaluate models on three bitext retrieval tasks: United Nations (UN), Tatoeba, and BUCC.All tasks are to retrieve the correct English translation for each non-English sentence.
United Nations (UN) contains 86,000 sentence aligned bilingual documents over five language pairs: en-fr, en-es, en-ru, en-ar and en-zh (Ziemski et al., 2016).A total of 11.3 million 14 aligned sentence pairs can be extract from the document pairs.The large pool of translation candidates makes this data set particularly challenging.
Tatoeba evaluates translation retrieval over 112 languages (Artetxe and Schwenk, 2019b).The dataset contains up to 1,000 sentences per language along with their English translations.We evaluate performance on the original version covering all 112 languages, and also the 36 languages version from the XTREME benchmark (Hu et al., 2020).
BUCC is a parallel sentence mining shared task (Zweigenbaum et al., 2018).We use the 2018 shared task data, containing four language pairs: fren, de-en, ru-en and zh-en.For each pair, the task provides monolingual corpora and gold true translation pairs.The task is to extract translation pairs from the monolingual data, which are evaluated against the ground truth using F1.Since the ground truth for the BUCC test data is not released, we follow prior work using the BUCC training set for evaluation rather than training (Yang et al., 2019b;Hu et al., 2020).Sentence embedding cosine similarity is used to identify the translation pairs. 15

Downstream Classification
We also evaluate the transfer performance of multilingual sentence embeddings on downstream classification tasks from the SentEval benchmark (Conneau and Kiela, 2018).We evaluate on select tasks from SentEval including: (MR) movie reviews (Pang and Lee, 2005)), (SST) sentiment 14 About 9.5 million after de-duping. 15Reranking models can further improve performance (e.g.margin based scorer (Artetxe and Schwenk, 2019a) and BERT based classifier (Yang et al., 2019a)).However, this ss tangential to assessing the raw embedding retrieval performance.

Results
Table 2 shows the performance on the UN and Tatoeba bitext retrieval tasks and compares against the prior state-of-the-art bilingual models Yang et al. (2019a), LASER (Artetxe and Schwenk, 2019b), and the multilingual universal sentence encoder (m-USE) (Yang et al., 2019b) 16 .Row 1-3 show the performance of baseline models, as reported in the original papers.
Row 4-7 shows the performance of models that use the public mBERT vocabulary.The baseline model shows reasonable performance on UN ranging from 57%-71% P@1.It also perform well on Tatoeba with 92.8% and 79.1% accuracy for the 36 language group and all languages, respectively.Adding pre-training both helps models converge faster (see details in section 6.2) and improves performance on the UN retrieval using both vocabularies.Pre-training also helps on Taoeba, but only using the customized vocabulary.17Additive margin softmax significantly improves the performance on all model variations.
The last two rows contain two models using the customized vocab.Both of them are trained with additive margin softmax given the strong evidence from the experiments above.Both models outperform the mBERT vocabulary based models, and the model with pre-training performs best of all.The top model (Base w/ Customized Vocab + AMS + PT) achieves a new state-of-the-art on 3 of the 4 languages, with P@1 91.1, 88.3, 90.8 for en-es, en-fr, en-ru respectively.It reaches 87.7 on zh-en, only 0.2 lower than the best bilingual en-zh model and nearly 9 points better than the previous best multilingual model.On Tatoeba, the best model also outperform the baseline model by a large margin, with +10.6 accuracy on the 36 language group from XTREME and +18.2 on all languages.
It is worth noting that all our models perform similarly on Tatoeba but not on UN.This suggests it is necessary to evaluate on large scale bitext retrieval tasks to better discern differences between competing models.In the rest of the paper we refer to LaBSE as the best performing model here Base w/ Customized Vocab + AMS + PT, unless otherwise specified.
Table 3 provides LaBSE's retrieval performance on BUCC, comparing against strong baselines from Artetxe and Schwenk (2019a) and Yang et al. (2019a).Following prior work, we perform both forward and backward retrieval.Forward retrieval treats en as the target and the other language as the source, and backward retrieval is vice versa.LaBSE not only systematically outperforms prior work but also covers all languages within a single model.The previous state-of-the-art required four separate bilingual models (Yang et al., 2019a).

Results on Downstream Classification Tasks
Table 4 gives the transfer performance achieved by LaBSE on the SentEval benchmark (Conneau and Kiela, 2018), comparing against other state-of-theart sentence embedding models.Despite its massive language coverage in a single model, LaBSE still obtains competitive transfer performance with monolingual English embedding models and the 16 language m-USE model.

Additive Margin Softmax
The above experiments show that additive margin softmax is a critical factor in learning good crosslingual embeddings, which is aligned with the findings from Yang et al. (2019a).We further investi- gate the effect of margin size on our three model variations, as shown in figure 4. The model with an additive margin value 0 performs poorly on the UN task with ∼60 average P@1 across all three model variations.With a small margin value of 0.1, the model improves significantly compare to no margin with 70s to 80s average P@1.Increasing the margin value keeps improving performance until it reaches 0.3.The trend is consistent on all models.

Low Resource Languages and Languages without Explicit Training Data
We evaluate performance through further experiments on Tatoeba for comparison to prior work and 18 We note that it is relative easy to get 200M parallel examples for many languages from public sources like Paracrawl, TED58, while obtaining 1B examples is generally much more challenging.
to identify broader trends.Besides the 36 language group and all-languages group, two more groups of 14 languages (selected from the languages covered by m-USE, and 82 languages group (covered by the LASER training data) are evaluated.Table 5 provides the macro-average accuracy achieved by LaBSE for the four language groupings drawn from Tatoeba, comparing against LASER and m-USE.All three models perform well on the 14 major languages support by m-USE, with each model achieving an average accuracy >93%.Both LaBSE and LASER perform moderately better than m-USE, with an accuracy of 95.3%.As more languages are included, the averaged accuracy for both LaBSE and LASER decreases, but with a notably more rapid decline for LASER.LaBSE systematically outperforms LASER on the groups of 36 languages (+10.6%),82 languages (+11.4%), and 112 languages (+18.2%).
Figure 6 lists the Tatoeba accuracy for languages where we don't have any explicit training data.There are a total of 30+ such languages.The performance is surprisingly good for most of the languages with an average accuracy around 60%.Nearly one third of them have accuracy greater than 75%, and only 7 of them have accuracy lower than 25%.One possible reason is that language mapping is done manually and some languages are close to those languages with training data but may be treated differently according to ISO-639 standards and other information.Additional, since automatic language detection is used, some limited amount of data for the missing languages might be included during training.We also suspect that those well performing languages are close to some language that we have training data.For example yue and wuu are related to zh (Chinese) and fo has similarities to is (ICELANDIC).Multilingual generalization across so many languages is only possible due to the massively multilingual nature of LaBSE.

Semantic Similarity
The Semantic Textual Similarity (STS) benchmark (Cer et al., 2017) measures the ability of models to replicate fine-grained grained human judgements on pairwise English sentence similarity.Models are scored according to their Pearson correlation, r, on gold labels ranging from 0, unrelated meaning, to 5, semantically equivalent, with intermediate values capturing carefully defined degrees of meaning overlap.STS is used to evaluate the quality of sentence-level embeddings by assessing the degree to which similarity between pairs of sentence embeddings aligns with human perception of sentence meaning similarity.Table 6 reports performance on the STS benchmark for LaBSE versus existing sentence embedding models.Following prior work, the semantic similarity of a sentence pair according to LaBSE is computed as the arc cosine distance between the pair's sentence embeddings. 19For comparison, we include numbers for SentenceBERT when it is finetuned for the STS task as well as ConvEmbed when an additional affine transform is trained to fit the embeddings to STS.We observe that LaBSE performs worse on pairwise English semantic similar-Model dev test SentenceBERT (Reimers and Gurevych, 2019) -79.2 m-USE (Yang et al., 2019b) 83.7 82.5 USE (Cer et al., 2018) 80.2 76.6 ConvEmbed (Yang et al., 2018) 81.4 78.2 InferSent (Conneau et al., 2017) 80.1 75.6 LaBSE 74.3 72.8 STS Benchmark Tuned SentenceBERT-STS (Reimers and Gurevych, 2019) -86.1 ConvEmbed (Yang et al., 2018) 83.5 80.8 Table 6: Semantic Textual Similarity (STS) benchmark (Cer et al., 2017) performance as measured by Pearson's r.
ity than other sentence embedding models.We suspect training LaBSE on translation pairs biases the model to excel at detecting meaning equivalence, but not at distinguishing between fine grained degrees of meaning overlap.
Recently Reimers and Gurevych (2020) showed one can distill a English sentence representation model to a student multilingual model using a language alignment loss.The distilled model performs well on (multilingual-)STS benchmarks but underperforms on bitext retrieval tasks when compared to state-of-the-art models.Our approach is complimentary and can be combined with their method to distill better student models.

Mining Parallel Text from CommonCrawl
We use the LaBSE model to mine parallel text from CommonCrawl, a large-scale multilingual web corpus, and then train NMT models on the mined data.We experiment with two language pairs: English-to-Chinese (en-zh) and English-to-German (en-de).We mine translations from monolingual CommonCrawl data processed as described above for self-supervised MLM pretraining.After processing, there are 1.17B, 0.6B, 7.73B sentences for Chinese (zh), German (de), and English (en), respectively.LaBSE embeddings are used to pair each non-English sentence with its nearest English neighbor, dropping pairs with a similarity score < 0.6. 20For en-de and en-zh, we train a model with Transformer-Big (Vaswani et al., 2017) in the following way: First we train the model on the mined data as is for 120k steps with batch size 20 The threshold 0.6 is selected by manually inspecting a data sample, where pairs greater or equal to this threshold are likely to be translation or partial translation of each other.This results in 715M and 302M sentence pairs for en-zh and en-de, respectively.Note that the pairs may still be noisy, but we resort to data selection to select sentence pairs in higher quality for training NMT models.10k.Then we select the best 20% using Wang et al. (2018)'s data selection method, and train for another 80k steps.

Langs
Results in table 7 show the effectiveness of the mined training data.By referencing previous results (Edunov et al., 2018), we see the mined data yields performance that is only 2.8 BLEU away from performance of the best system that made use of the WMT17 en-de parallel data.Compare to prior en-zh results (Sennrich et al., 2017), we see that the model is as good as a WMT17 NMT model (Sennrich et al., 2017) that is trained on the WMT en-zh parallel data.The table also gives BLEU performance on the TED test set (Qi et al., 2018), with performance being comparable with models trained using CCMatrix (Schwenk et al., 2019). 21

Conclusion
This paper presents a language-agnostic BERT sentence embedding (LaBSE) model supporting 109 languages.The model achieves state-of-the-art performance on various bi-text retrieval/mining tasks compare to the previous state-of-the-art, while also providing increased language coverage.We show the model performs strongly even on those languages where LaBSE doesn't have any explicit training data, likely due to language similarity and the massively multilingual natural of the model.Extensive experiments show additive margin softmax is a key factor for training the model, parallel data quality matters, but the effect of increased amounts of parallel data diminishes when a pre-trained language model is used.The pretrained model is released at https://tfhub.dev/google/LaBSE.

A LaBSE Large
Motivated by the recent progress of giant models, we also train a model with increased model capacity.Following BERT Large , we develop LaBSE Large using a 24 layers transformer with 16 attention heads and 1024 hidden size.Constrained by computation resource, we train 1M steps one stage pre-training instead of the progressive multi-stage pre-training used when training LaBSE model.Fine-tuning configs are exact the same as the base LaBSE model.We suspect that the translate matching training objective is too easy, the model cannot learn more information from the current in-batch negative sampling approach.An improved negative contrast could help the larger model to learn better representations.We experimented with one type of hard negatives in the section below, but more types of hard negatives could be explored as described in (Lu et al., 2020).We leave this as a future work.

B Hard Negative Mining
Since their introduction into models that make use of dual encoders to learn cross-lingual embeddings, hard negatives (Guo et al., 2018) have become the de facto data augmentation method for learning cross-lingual sentence embeddings (Chidambaram et al., 2019;Yang et al., 2019a).To get the hard negatives, a weaker dual encoder model is trained using a similar model but with less parameters and less training data.For each training example, those incorrect translations that are semantically similar to the correct translation are retrieved as "hardnegatives" from a candidates pool.Semantically similarity is determined using the cosine similarity of the embeddings generated by the weaker model.It is challenging to apply hard negative to large datasets as it is very time consuming and computationally costly .
We investigate hard negative mining closely following Guo et al. (2018).By contacting the original authors, we obtained their negative mining pipeline, which employs a weaker dual encoder that uses a deep averaging network trained to identify translation pairs.Similar to the cross-accelerator negatives, the mined negatives are also appended to each example.
We only experiment using hard negative for Spanish (es) as it is very costly to get hard negative for all languages.Due to memory constraints, we only append 3 mined hard negatives in es for each en source sentence.Since the amount of examples increased 4x per en sentence in es batches,we also decrease batch size from 128 to 32 in the hard negative experiment.For languages other than es, the training data was the same as other the experiments but with batch size decreased to 32 together.Other languages are trained as usual.Table 9 shows the results of these models on UN.The accuracy of all four languages went down, even for en-es where we have the hard negatives.We suspect the worse performance is caused by the decreasing of batch size due to the memory constrain with more hard negative per example.

C Supported Languages
The supported langauges is listed in table 10.The distribution for each supported language is shown in figure 7.

Figure 1 :
Figure 1: Dual encoder model with BERT based encoding modules.

Figure 3 :
Figure 3: Negative sampling example in a dual encoder framework.[Left]: The in-batch negative sampling in a single core; [Right]: Synchronized multi-accelerator negative sampling using n TPU cores and batch size 8 per core with examples from other cores are all treated as negatives.

Figure 4 :
Figure4: Average P@1 (%) on UN retrieval task of models trained with different margin values.
Figure5: Average P@1 (%) on UN retrieval task of models trained with training different steps.

Table 3 :
(Zweigenbaum et al., 2018)F]-score of BUCC training set score with cosine similarity scores.The thresholds are chosen for the best F scores on the training set.Following the naming of BUCC task(Zweigenbaum et al., 2018), we treat en as the target and the other language as source in forward search.Backward is vice versa.

Table 4 :
Performance on English transfer tasks from SentEval

Table 7 :
The number of source / target sentences and number of mined parallel text from CommonCrawl.

Table 8 :
Table 8 shows the UN performance of the LaBSE Large model compared to LaBSE model.The results are mixed, and the average performances are very close.We also evaluate the model on Tatoeba, and the average performances across all languages are also very close: 83.7 (LaBSE) v.s.83.8 (LaBSE Large ).P@1 on UN (en→xx) .

Table 9 :
P@1 on UN (en→xx) with hard negative examples in en-es.