Universal Sentence Representation Learning with Conditional Masked Language Model

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin, e.g. 10% improvement upon baseline models on cross-lingual semantic search. We explore the same language bias of the learned representations, and propose a simple, post-training and model agnostic approach to remove the language identifying information from the representation while still retaining sentence semantics.


Introduction
Sentence embeddings map sentences into a vector space. The vectors capture rich semantic information that can be used to measure semantic textual similarity (STS) between sentences or train classifiers for a broad range of downstream tasks (Conneau et al., 2017;Subramanian et al., 2018;Logeswaran and Lee, 2018;Cer et al., 2018;Reimers and Gurevych, 2019;Yang et al., 2019a,e). Stateof-the-art models are usually trained on supervised tasks such as natural language inference (Conneau et al., 2017), or with semi-structured data like question-answer pairs (Cer et al., 2018) and translation pairs (Subramanian et al., 2018;, * Work done during internship at Google Research. 2019a). However, labeled and semi-structured data are difficult and expensive to obtain, making it hard to cover many domains and languages. Conversely, recent efforts to improve language models include the development of masked language model (MLM) pre-training from large scale unlabeled corpora (Devlin et al., 2019;Lan et al., 2020;. While internal MLM model representations are helpful when fine-tuning on downstream tasks, they do not directly produce good sentence representations, without further supervised (Reimers and Gurevych, 2019) or semi-structured (Feng et al., 2020) fine-tuning.
In this paper, we explore an unsupervised approach, called Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations from large scale unlabeled corpora. The CMLM model architecture is illustrated in Fig. 1, which integrates sentence representation learning into MLM training by conditioning on sentence level representations produced by adjacent sentences. The model therefore needs to learn effective sentence representations in order to perform good MLM. Since CMLM is fully unsupervised, it can be easily extended to new languages. We explore CMLM for both English and multilingual sentence embeddings for 100+ languages. Our English CMLM model achieves state-of-the-art performance on SentEval (Conneau and Kiela, 2018), even outperforming models learned using (semi-)supervised signals. Moreover, models training on the English Amazon review data using our multilingual vectors exhibit strong multilingual transfer performance on translations of the Amazon review evaluation data to French, German and Japanese, outperforming existing multilingual sentence embedding models by > 5% for non-English languages and by > 2% on English.
We further extend the multilingual CMLM to cotrain with parallel text (bitext) retrieval task, and finetune with cross-lingual natural language infer- ence (NLI) data, inspired by the success of prior work on multitask sentence representation learning (Subramanian et al., 2018;Yang et al., 2019a;Reimers and Gurevych, 2020) and NLI learning (Conneau et al., 2017;Reimers and Gurevych, 2019). We achieve performance 3.6% better than the previous state-of-the-art multilingual sentence representation model (see details in Section 4.2). On cross-lingual semantic search task, our model outperforms baseline models by 10% on average over 36 languages. Language agnostic representations require semantically similar cross-lingual pairs to be closer in representation space than unrelated same-language pairs (Roy et al., 2020). While we find our original sentence embeddings do have a bias for same language sentences, we discover that removing the first few principal components of the embeddings eliminates the self language bias. The rest of the paper is organized as follows. Section 2 describes the architecture for CMLM unsupervised learning. In Section 3 we present CMLM trained on English data and evaluation results on SentEval. In Section 4 we apply CMLM to learn sentence multilingual sentence representations. Multitask training strategies on how to effectively combining CMLM, bitext retrieval and crosslingual NLI finetuning are explored. In Section 5, we investigate self language bias in multilingual representations and propose a simple but effective approach to eliminate it. The pre-trained models are released at https://tfhub.dev/s?q= universal-sentence-encoder-cmlm.

Conditional Masked Language Modeling
We introduce Conditional Masked Language Modeling (CMLM) as a novel architecture for combin-ing next sentence prediction with MLM training. By "conditional", we mean the MLM task for one sentence depends on the encoded sentence level representation of the adjacent sentence. This builds on prior work on next sentence prediction that has been widely used for learning sentence level representations (Kiros et al., 2015;Logeswaran and Lee, 2018;Cer et al., 2018;Yang et al., 2019a), but has thus far produced poor quality sentence embeddings within BERT based models using MLM loss (Reimers and Gurevych, 2019).
While existing MLMs like BERT include next sentence prediction tasks, they do so without any inductive bias to try to encode the meaning of a sentence within a single embedding vector. We introduce a strong inductive bias for learning sentence embeddings by structuring the task as follows. Given a pair of ordered sentences, the first sentence is fed to an encoder that produces a sentence level embedding. The embedding is then provided to an encoder that conditions on the sentence embedding in order to better perform MLM prediction over the second sentence. This is notably similar to Skip-Thought (Kiros et al., 2015), but replaces the generation of the complete second sentence with the MLM denoising objective. It is also similar to T5's MLM inspired unsupervised encode-decoder objective (Raffel et al., 2019), with the second encoder acting as a sort of decoder given the representation produced for the first sentence. Our method critically differs from T5's in that a sentence embedding bottleneck is used to pass information between two model components and in that the task involves denoising a second sentence when conditioning on the first rather than denoising a single text stream. The first sentence s 1 is tokenized and input to a transformer encoder and a sentence vector v ∈ R d is computed from the sequence outputs by average pooling. 1 The sentence vector v is then projected into N spaces with one of the projections being the identity mapping, i.e. v p = P (v) ∈ R d×N . Here we use a three-layer MLP as the projection P (·). Details of P (·) are available in the supplementary material. One motivation for the projections of s 1 is that MLM of s 2 then can attend to various representations of s 1 instead of only 1. In Section 5.1, we explore various different configurations of CMLM, including the number of projection spaces N .
The second sentence s 2 is then masked following the procedure described in the original BERT paper, including random replacement and the use of unchanged tokens. The second encoder shares the same weights with the encoder used to embed s 1 2 . Tokens in the masked s 2 are first converted into token vectors. The masked language modeling of s 2 depends on s 1 such that the process involves cross-attention between s 2 token vectors and v p . In practice, this is implemented by concatenating token embeddings of s 2 with v p 3 . Other implementations are also experimented (see Section 5.1) and we empirically find concatenation works the best. The concatenated representations are then provided to the transformer encoder to predict the masked tokens in s 2 .
At inference time, we keep the first encoding module and discard the subsequent MLM prediction. Similar to skip-thought, CMLM trains the encoder to produce sentence embeddings useful for predicting material in the adjunct sentences. CMLM adapts this existing idea to MLM training. Appending multiple projections performs well due to fine-grained attention between tokens and the different views of the sentence embeddings. Note that CMLM differs from SkipThought in the following aspects: (a) SkipThought relies on an extra decoder network while CMLM only has the encoder. (b) SkipThought predicts the entire sentence while CMLM predicts masked tokens only so the 1 One can equivalently choose other pooling methods, such as max pooling or use the vector output corresponding to a special token position such as the [CLS] token.
2 The dual-encoder sharing encoder weights for different inputs can be also referred as "siamese encoder" 3 Representation concatenation has been used in previous work for enabling cross attention between global vectors and local token embeddings to help the representations learning of long/structured inputs (Ainslie et al., 2020;Manzil Zaheer, 2020). predictions can be done in parallel. These two differences make CMLM more efficient to train than SkipThought.

Learning English Sentence
Representations with CMLM For training English sentence encoders with CMLM, we use three Common Crawl dumps. The data are filtered by a classifier which detects whether a sentence belongs to the main content of the web page or not. We use WordPiece tokenization and the vocabulary is the same as public English uncased BERT. In order to enable the model to learn bidirectional information, for two consecutive sequences s 1 and s 2 , we swap their order for 50% of the time. This order-swapping process echos with the preceding and succeeding sentences prediction in Skip-Thought (Kiros et al., 2015). The length of s 1 and s 2 are set to be 256 tokens (the maximum length). The number of masked tokens in s 2 are 80 (31.3%), moderately higher than classical BERT. This change in the ratio of masked tokens is to make the task more challenging, due to the fact that in CMLM, language modeling has access to extra information from adjacent sentences. We train with batch size of 2048 for 1 million steps. The optimizer is LAMB (You et al., 2020) with learning rate of 10 −3 , β 1 = 0.9, β 2 = 0.999, warm-up in the first 10,000 steps and linear decay afterwards. We explore two transformer configurations same as in the original BERT paper, i.e., base and large. The number of projections N is 15 by experimenting with multiple choices.
To evaluate the possible improvements coming from training data and processes, we train standard BERT models (English BERT base/large (CC)) on the same Common Crawl Corpora that CMLM is trained on. Similarly, we also train QuickThought, a competitive unsupervised sentence representations learning model, on the same Common Crawl Corpora (denoted as "QuickThought (CC)"). To further address the possible advantage from using Transformer encoder, we use a Transformer encoder as the sentence encoder in QuickThought (CC). The representations for BERT are computed by averaging the sequence outputs (we also explore options including [CLS] vector and max pooling and the results are available in the appendix).

Results
Evaluation results are presented in Table 1. The numbers are averaged over 5 runs and the performance variances are provided in the appendix. CMLM outperforms existing models overall, besting MLM (both English BERT and English BERT (CC)) using both base and large configurations. The closest competing model is SBERT, which uses supervised NLI data rather than a purely unsupervised approach. Interestingly, CMLM outperforms SBERT on the SICK-E NLI task even the later model is trained with a NLI task. We also evaluate on Semantic Textual Similarity (STS) datasets. As shown in Table 2, CMLM exhibits competitive performance compared with BERT and GloVe. One interesting observation is that CMLM base significantly outperforms other baselines (including CMLM large) on the STS Benchmark dataset.

Learning Multilingual Sentence Representations with CMLM
As a fully unsupervised method, CMLM can be conveniently extended to multilingual modeling even for less well resourced languages. Learning good multilingual sentence representations is more challenging than monolingual ones, especially when attempting to capture the semantic alignment between different languages. As CMLM does not explicitly address cross-lingual alignment, we explore several modeling approaches besides CMLM: (1) Co-training CMLM with a bitext retrieval task; (2) Fine-tuning with cross-lingual NLI data.

Multilingual CMLM
We follow the same configuration used to learn English sentence representations with CMLM, but extend the training data to include more languages. Results below will show that CMLM again exhibits competitive performance as a general technique to learn from large scale unlabeled corpora.

Multitask Training with CMLM and Bitext Retrieval
Besides the monolingual pretraining data, we collect a dataset of bilingual translation pairs {(s i , t i )} using a bitext mining system (Feng et al., 2020). The source sentences {s i } are in English and the target sentences {t i } covers over 100 languages. We build a retrieval task with the translation parallel data, identifying the corresponding translation of the input sentence from candidates in the same batch. Concretely, incorporating Additive Margin Softmax (Yang et al., 2019b), we compute the bitext retrieval loss L s br for the source sentences as: Above φ(s i , t j ) denotes the the inner products of sentence vectors of s i and t j (embedded by the transformer encoder); m and B denotes the additive margin and the batch size respectively. Note the way to generate sentence embeddings is the same as in CMLM. We can compute the bitext retrieval loss for the target sentences L t br by normalizing over source sentences, rather than target sentences, in the denominator. 4 The final bitext retrieval loss L br is given as L br = L s br + L t br . There are several ways to incorporate the monolingual CMLM task and bitext retrieval (BR). We explore the following multistage and multitask pretraining strategies: S1. CMLM+BR: Train with CMLM and BR from the start; S2. CMLM → BR: Train with CMLM in the first stage and then train with on BR;  S3. CMLM → CMLM+BR: Train with only CMLM in the first stage and then with both tasks. When training with both CMLM and BR, the optimization loss is a weighted sum of the language modeling and the retrieval loss L br , i.e. L = L CM LM +αL br . We empirically find α = 0.2 works well. As shown in Table 4, S3 is found to be the most effective. Unless otherwise denoted, our models trained with CMLM and BR follow S3. We also discover that given a pre-trained transformer encoder, e.g. mBERT, we can improve the quality of sentence representations by finetuning the transformer encoder with CMLM and BR. As shown in Table 4, the improvements of f-mBERT (finetuned mBERT) upon mBERT are significant.

Finetuning with Cross-lingual Natural Language Inference
Finetuning with NLI data has proved to be an effective method to improve the quality of embeddings for English models. We propose to leverage cross-lingual NLI finetuning in multilingual representations. Given a premise sentence u in language l 1 and a hypothesis sentence v in language l 2 , we train a 3-way classifier on the concatenation of [u, v, |u − v|, u * v]. Weights of transformer encoders are also updated in the finetuning process. Different from previous work also using multilingual NLI data (Yang et al., 2019a), the premise u and hypothesis v are in different languages. The cross-lingual NLI data are generated by translating Multi-Genre NLI Corpus (Williams et al., 2018) into 14 languages using Google Translate API.

Configurations
Monolingual training data for CMLM are generated from 3 versions of Common Crawl data in 113 languages. The data cleaning and filtering is the same as the English-only ones. A new cased vocabulary is built from the all data sources using the WordPiece vocabulary generation library from Tensorflow Text. The language smoothing exponent from the vocab generation tool is set to 0.3, as the distribution of data size for each language is imbalanced. The final vocabulary size is 501,153. The number of projections N is set to be 15, the batch size B is 2048 and the positive margin is 0.3.   For CMLM only pretraining, the number of steps is 2 million. In multitask learning, for S1 and S3, the first stage is of 1.5 million and the second stage is of 1 million steps; for S2, number of training steps is 2 million. The transformer encoder uses the BERT base configuration. Initial learning rate and optimizer chosen are the same as the English models. Motivations for choosing such configurations, training details and potential limitations of CMLM are discussed in the appendix.

XEVAL: Multilingual Benchmarks for Sentence Representations Evaluation
Evaluations in previous multilingual literature focused on the cross-lingual transfer learning ability from English to other languages. However, this evaluation protocol that treats English as the "anchor" does not equally assess the quality of non-English sentence representations with English ones.
To address the issue, we prepare a new benchmark for multilingual sentence vectors, XEVAL, by translating SentEval (English) to other 14 languages with Google Translate API. The reliability of XE-VAL is discussed in the appendix. Results of models trained with monolingual data are shown in Table 3. Baseline models include mBERT (Devlin et al., 2019), XLM-R (Ruder et al., 2019) and a transformer encoder trained with MLM on the same Common Crawl data (MLM(CC), again this is to control the effects of training data).
The method to produce sentence representations for mBERT and XLM-R is chosen to be average pooling after exploring options including [CLS] representations and max pooling. The multilingual model CMLM trained on monolingual data outperform all baselines in all 15 languages.
Results of models trained with cross-lingual data are presented in Table 4. Baseline models for comparison include LASER (Artetxe and Schwenk (2019), trained with parallel data) and multilingual USE ( (Yang et al., 2019a), trained with crosslingual NLI. Note it only supports 16 languages). Our model (S3) outperforms LASER in all 15 languages. Notably, finetuning with NLI in the crosslingual way produces significant improvement (S3 + NLI v.s. S3). Multitask learning with CMLM and BR can also be used to increase the performance of pretrained encoders, e.g. mBERT. mBERT trained with CMLM and BR (f-mBERT) has a significant improvement upon mBERT.

Amazon Reviews
We conduct a zero-shot transfer learning evaluation on Amazon reviews dataset (Prettenhofer and Stein, 2010). Following Chidambaram et al. (2019), the original dataset is converted to a classification benchmark by treating reviews with strictly more than 3 stars as positive and negative otherwise. We split 6000 English reviews in the original training set into 90% for training and 10% for development. The two-way classifier, upon the concatena-   tion of [u, v, |u − v|, u * v] (following works e.g. Reimers and Gurevych (2019)), is trained on the English training set and then evaluated on English, French, German and Japanese test sets (each has 6000 examples). The same multilingual encoder and classifier are used for all the evaluations. We also experiment with whether freezing the encoder weights or not during training. As presented in Table 6, CMLM alone has already outperformed baseline models, including Multi-task Dual-Encoder (MTDE, Chidambaram et al. (2019)), mBERT and XLM-R. Training with BR and cross-lingual NLI finetuning further boost the performance.

Tatoeba: Semantic Search
We test on Tatoeba dataset proposed in Artetxe and Schwenk (2019) to asses the ability of our models on capturing cross-lingual semantics. The task is to find the nearest neighbor for the query sentence in the other language. The experiments is conducted on the 36 languages as in XTREME (Hu et al., 2020). The evaluation metric is retrieval accuracy. Results are presented in Table 5. Our model CMLM+BR outperforms all baseline models in 30 out of 36 languages and has the highest average performance. One interesting observation is that finetuning with NLI actually undermines the model performance on semantic search, in contrary with the significant improvements from CMLM+BR to CMLM+BR+NLI on XEVAL (Table 4). We speculate this is because unlike semantic search, NLI inference is not a linear process. Finetuning with NLI is not expected to help the linear retrieval by nearest neighbor search.

Ablation Study
We explore different configurations of CMLM, including the number of projection spaces N (Table 7). Projecting the sentence vector into N = 15 spaces produces highest overall performance. We also try a different CMLM architecture. Besides the concatenation with token embeddings of s 2 before input to the transformer encoder, the projected vectors are also concatenated with the sequence outputs of s 2 for the masked token prediction. This version of architecture is denoted as "skip" and the model performance is actually worse. Note that the projected vector can also be used to produce the sentence representation v s , e.g. using the average of projected vectors v s = 1   projection. This version is denoted as "proj" in Table 7. Sentence representations produced in this way still yield competitive performance, which further confirm the effectiveness of the projection.

Language Agnostic Properties
Language Agnosticism has been a property of great interest for multilingual representations.
However, there has not been a qualitative measurement or rigid definition for this property. We propose that "language agnostic" refers to the property that sentences representations are neutral w.r.t their language information. E.g., two sentences with similar semantics should be close in embedding space whether they are of the same languages or not. To capture this intuition, we convert the PAWS-X dataset (Yang et al., 2019c) to a retrieval task to measure the language agnostic property. Specifically, PAWS-X consists of English sentences and their translations in other six languages. Given a query, we inspect the language distribution of the retrieved sentences. The similarity between a query v l 1 in language l 1 and a candidate v l 2 in language l 2 is computed as the cosine similarity v T l 1 v l 2 v l 1 2 v l 2 2 . In Fig. 2, representations of mBERT have a strong self language bias, i.e. sentences in the language matching the query are dominant. In contrast, the bias is much weaker in our model, probably due to the cross-lingual retrieval pretraining. We also discover that removing the first principal component of each monolingual space from sentence representations effectively eliminates the self language bias. Given a monolingual space M l 1 ∈ R N ×d , where each row of M l 1 is a embedding in language l 1 . For example, in the evaluation on Tatoeba dataset, the monolingual space matrix M l 1 is computed with texts in language l 1 in Tatoeba. The principal component c l 1 is the first right singular vector of M l 1 . Given a representation v l 1 in language l 1 , the projection of v l 1 onto c l 1 is removed:v l 1 = v l 1 − v T l 1 c l 1 v l 1 2 . The similarity score between v l 1 and v l 2 for cross-lingual retrieval is computed as:v T l 1v l 2 v l 1 2 v l 2 2 . As shown in the second and the fourth column in Fig. 2, with principal component removal (PCR), the language distribution of retrieved texts is much more uniform. We also explore PCR on the Tatoeba dataset. Table 8 shows the retrieval accuracy of multilingual model with and w/o PCR. PCR increases the overall retrieval performance for both models. This suggests the first principal components in each monolingual space primarily encodes language identification information.
We also visualize sentence embeddings on Tatoeba dataset in Fig. 3. Our model shows both weak and strong semantic alignment (Roy et al., 2020). Representations are close to others with similar semantics regardless of their languages (strong alignment), especially for French and Russian, where representations form several distinct clusters. Also representations from the same language tend to cluster (weak alignment). While representations from mBERT generally exhibit weak alignment.

Conclusion
We present a novel sentence representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpus. CMLM outperforms the previous stateof-the-art English sentence embeddings models, including those trained with (semi-)supervised signals. For multilingual representations, we discover that co-training CMLM with bitext retrieval and cross-lingual NLI finetuning achieves state-of-theart performance. We also find that multilingual representations have the same language bias and principal component removal can eliminate the bias by separating language identity information from semantics.

A Methods for Representations
We evaluate different representations method in Transformer-base models, including CMLM and BERT base (using the model on official Tensorflow Hub). The experiments are conducted on SentEval. Results in Table 9 show that MEAN representation exhibit better performance than CLS and MAX representations.

B Experiments with different Masking ratios
We test with different masking ratios in CMLM training data. Specifically, We tried masking 40, 60, 80 and 100 tokens of 256 tokens in the CMLM data. Performance of obtained models on SentEval are presented in Appendix B.

C Training Configurations and Implementation Details
Projection P in the CMLM modeling. Let h denote the dimension of the input sentence vector (e.g. h = 768 in BERT base; h = 1024 in BERT large). Let F C(h 1 , h 2 , n) denote a fully connected layer with input dimension h 1 , output dimension h 2 and nonlinearity function n. The three layers are F C(h, 2×h, ReLU), F C(2×h, 2×h, ReLU), F C(2 × h, h, None). We tried projections without intermediate layers and observed a drop in training LM accuracy. Adding more layers doesn't improve the MLM accuracy or downstream tasks performance. Using 2 × h is empirically chosen based on preliminary experiments. Other hidden sizes are also explored.
Configurations for multilingual representations learning. In general, larger batch sizes improve performance until we reach 2048, since each example will see more "mismatched" examples. After 2048, we do not see obvious improvements in performance from increasing batch size. We'll add detailed results on this in the final version. The training steps for different stages are decided on a validation set.
Training Data and infrastructure. English pretraining takes 5 days on 64 TPUs using 1TB of data from Common Crawl dumps 2020-1, 2020-05, 2020-10. More data could be beneficial, but would increase training time.

D Reliability of XEVAL
In this section, we want to discuss the reliability of XEVAL. XEVAL contains sentence-level data and we expect its translation not to be too challenging. Inspection by in-house bilingual speakers also confirms the high quality of translation. Human translation is always preferred but we are limited by budget and annotator resources (especially for low-resource languages).

E CMLM's Comparison with Next
Sentence Prediction (NSP) and Potential Limitations.
We tried MLM (CC) with and w/o NSP and it does not make much difference on SentEval. Training NSP accuracy quickly converge to 95%, indicating that NSP is not a challenging task. Sentence embedding methods like CMLM can be less effective for sequence labeling (e.g., NER) and natural language generation (NLG) and question answering (Q&A).

F Performance Variances
We provide the performance variances of CMLM base and CMLM large on SentEval dataset in Table 11.

Model
MR CR SUBJ MPQA SST TREC MRPC SICK-E SICK-R Avg.