Multi-Granularity Contrasting for Cross-Lingual Pre-Training

Cross-lingual pre-training aims at providing effective prior representations for the inputs from multiple languages. With the modeling of bidirectional contexts, recently prevalent language modeling approaches such as XLM achieve better performance than traditional methods based on embedding alignment, which strives to assign similar vector representations to semantic-equivalent units. However, such approaches like XLM capture cross-lingual information based solely on shared BPE vocabulary, resulting in the absence of ﬁne-grained supervision induced by embedding alignment. Inheriting the advantages of the above two paradigms, this work presents a multi-granularity contrasting framework, namely MGC, to learn language-universal representations. While predicting the masked words based on bidirectional contexts, the proposal also encodes semantic equivalents from different languages into similar representations to introduce more ﬁne-grained and explicit cross-lingual information. Two effective contrasting strategies are further proposed, which can be built upon semantic units of multiple granularities covering words, span, and sentences. Extensive experiments demonstrate that our approach can achieve signiﬁcant performance gains in various down-stream tasks, including machine translation and cross-lingual language understanding.


Introduction
Cross-lingual pre-training (Lample and Conneau, 2019) has achieved striking success in the field of natural language processing. By providing effective prior representations for the inputs from different languages, it has boosted performance on various downstream tasks such as machine translation and cross-lingual language understanding. Early efforts regarding cross-lingual pre-training mainly focus on embedding alignment (Mikolov et al., 2013b;, which is targeted at the assignment of similar vector representations to semantic-equivalent units (e.g., the parallel bilingual word or sentence pairs). For instance, Mikolov et al. (2013b) attempt to project pre-trained monolingual word embeddings from two languages into a common semantic space with a simple linear transformation, so that parallel bilingual words share the same representation. This allows the introduction of explicit fine-grained supervision to guarantee the representational similarity of semantic equivalents, but neglects the modeling of bidirectional contexts. Going a step further, recently prevalent approaches of language modeling such as XLM (Lample and Conneau, 2019) remedy this by predicting the masked tokens based on bidirectional contexts (Devlin et al., 2019), and also benefit from larger model capacity (Vaswani et al., 2017). However, the cross-lingual information captured by these language modeling approaches comes solely from the shared BPE vocabulary (Sennrich et al., 2016), resulting in the absence of more fine-grained explicit supervision induced by embedding alignment.
In light of the pros and cons of the above two paradigms, we propose a multi-granularity contrasting (MGC) framework for cross-lingual pretraining. In addition to modeling context bidirectionality with the widely used masked language modeling (MLM) (Devlin et al., 2019), our approach draws upon contrastive learning (Gutmann and Hyvärinen, 2010) to introduce more finegrained cross-lingual alignment information. The core idea is to enhance the consistency between representations of semantic equivalents (e.g., the aligned word pairs such as "cat" in English and "chat" in French). To this end, we propose two effective contrasting strategies: hard contrasting which constructs pseudo-parallel bilingual word pairs via external word aligner (Dyer et al., 2013), and soft contrasting which employs multi-head attention (Vaswani et al., 2017) to provide posterior approximation for the representations of the desired semantic equivalents. Considering the inherent multi-granularity of natural language expressions, we build the proposed contrasting framework upon semantic units of various granularities (including word, span, and sentence) to further enrich cross-lingual information and enhance the model's capability of encoding multi-granularity representations.
We conduct experiments on a variety of downstream scenarios, including multiple machine translation and cross-lingual language understanding tasks. Comprehensive experimental results demonstrate that our proposed approach can achieve significant performance gains over baselines. To be more specific, our MGC raises the average accuracy of our implemented XLM-R  from 74.4 to 76.0 on XNLI under the setting of cross-lingual transfer and also surpasses various baselines on representative translation tasks such as WMT14 EN-DE and EN-FR.

Methodology
In order to introduce more fine-grained and explicit cross-lingual supervision, we propose a multigranularity contrasting (MGC) framework to learn language-universal representations. We first elaborate on the proposed approach based on word-level contrasting, and then extend it to span-level and sentence-level to further enrich cross-lingual information.

Overview
We denote a pair of parallel bilingual instance as (x, y), where x = (x 1 , · · · , x m ) and y = (y 1 , · · · , y n ) refer to the source and target sentence, respectively. Then, the transformer (Vaswani et al., 2017) encodes x to obtain its hidden representations h x = (h x 1 , · · · , h xm ). The hidden representations h y = (h y 1 , · · · , h yn ) of y can be obtained in the same way. In order to introduce more fine-grained and explicit cross-lingual supervision similar to embedding alignment, we expect the semantic-equivalent units (e.g., "cat" in English and "chat" in French) from different languages to exhibit similar vector representations. Meanwhile, the representations of units with different semantics (e.g., "cat" in English and "car" in English or "voiture" in French) should be distinguished from each other to capture their discriminative specific information.
Motivated by this, we employ contrastive learning (Gutmann and Hyvärinen, 2010) to model such training objectives. Without loss of generality, we elaborate on our proposed approach with the units in the source language as anchors. Formally, we use u to represent the representation of one unit (e.g., "cat" in English) in x. The representation of its corresponding semantic equivalent (e.g., "chat" in French) in y is denoted as v + . The set of negative representations exhibiting different semantic with u is denoted as where k is the number of negative representations. Then, the contrastive loss for the representation tuple (u, v + , v − ) can be defined as: is the normalization factor and τ is the temperature controlling the concentration level of the sample distribution. The above equation corresponds to the negative log-likelihood loss of a softmax-based classifier measuring semantic similarity by the dot product. The classifier treats each unit as a distinct class, and aims at classifying u to the class of its semantic equivalent v + and vice versa. By maximizing the consistency between the representations of semantic equivalents with such a training objective, the pre-trained models are encouraged to introduce more fine-grained explicit alignment supervision, thereby enhancing their capability of learning language-invariant representations. Meanwhile, the representations of units exhibiting different semantics are penalized to be kept distinguished from each other, so that the model is equipped with the ability to capture specific features of the source inputs.

Word-Level Contrasting
The word-level contrasting strives to integrate the word-alignment information contained in parallel bilingual instance (x, y). However, an intractable challenge is that ideal semantic-equivalent word pairs tend to be unavailable in practice. To remedy this, here we propose two effective solutions: hard contrasting and soft contrasting, detailed as follows.
Hard contrasting The hard contrasting aims at constructing pseudo-parallel bilingual word pairs via an external word aligner. Specifically, for each word x in the source sentence x, its aligned word in y is defined as: where aligner(·|·) denotes the alignment probability that can be computed by the word aligners such as fast align (Dyer et al., 2013). Considering that there exists no semantic equivalent for some words (e.g., "the" in English), we construct the semantic-equivalent word sets N word (x, y) as mutually aligned word pairs in (x, y). For each aligned word pair (x, y) ∈ N word (x, y), the repre- (1) can be computed as: where 2 represents 2 -normalization and y\y denotes words in y other than the word y. Finally, the word-level hard contrasting loss for the source sentence x is formalized as: The loss L word (y) for the target sentence y can be computed in a similar way by swapping (x, y) to (y, x). Due to space limitations, here we omit the related details.
Soft contrasting Due to the strict requirements on the quality of constructed pseudo-parallel bilingual word pairs, hard contrasting is prone to suffer from potential error propagation induced by external word aligners. In addition, some source words may correspond to multiple target words, which conflicts with the strict one-to-one alignment of hard contrasting. To tackle the above issues, we propose soft contrasting, aiming at learning word alignment implicitly and jointly to approximate semantic equivalents via the attention mechanism (Vaswani et al., 2017). Specifically, for each word x in the source sentence x, the aggregated representation MHA(h x , h y ) can be obtained by performing multi-head attention 1 with h x serving as the query and h y serving as the keys/values. Since multi-head attention naturally assign larger weights to the words in y that are aligned to x, MHA(h x , h y ) can be regarded as an approximation of the representation of semantic equivalent of x. Therefore, the representations (u, v + , v − ) in Eq.
(1) can be defined as: where x\x refers to the remaining words in x except x. Soft contrasting not only alleviates the dependence on external word aligners, but also frees the model from the limitations of one-to-one alignment. Additionally, by maximizing the semantic consistency between h x and MHA(h x , h y ), the model is encouraged to learn word alignment in an implicit manner, introducing richer cross-lingual information.

Span-Level Contrasting
Previous work (Joshi et al., 2019) has demonstrated the superiority of span-level representations over word-level (Devlin et al., 2019) representations due to its strength in language understanding and reasoning. Therefore, we also perform contrasting based on semantic-equivalent spans. Since span gets rid of the limitation that the semantic equivalents of the two languages must share the same number of words, here we focus on the application of hard contrasting. To be specific, given the bilingual instance (x, y), we induce the phrase table via statistical machine translation tools to obtain span-level semantic equivalents N span (x, y). For each aligned span pair (x,ȳ) ∈ N span (x, y) wherē x ⊂ x is a span of x andȳ ⊂ y is a span of y, (1) can be formulated as: where MP(·) represents the mean-pooling layer 2 employed to aggregate all hidden representations of multiple words in a span. hx = (hx 1 , · · · , hx l ) are hidden representations of spanx = (x 1 , · · · ,x l ) and similarly for hȳ. The span-level contrastive loss of a given bilingual sentence pair (x, y) is defined as the sum of the losses corresponding to all spans (x,ȳ) in N span (x, y), whose calculation is similar to Eq. (4).

Sentence-Level Contrasting
In order to improve the quality of learned sentence embeddings, we also perform sentence-level contrastive learning to obtain the global supervision signals aggregating all token representations of the entire source input. Our proposed approach strives to pull the sentence representation of x towards that of its corresponding translation y, and push it away from sentence representations of other instances. However, the direct application of artificially constructed parallel bilingual sentence pairs tends to result in a significant boundary between positive and negative samples, which may lead to vanishing contrasting signals. To remedy this problem, we make use of back-translation to infuse noise in original positive samples to obtain more competitive cross-lingual information. To be more specific, we define the sentence-level semantic equivalents N sent (x, y) as: wherex is the noisy version of x obtained by means of back-translation 3 and so isŷ. By sampling from the original x and the back-translatedx, both sentences from the two languages for contrasting contain a certain amount of noise. This blurs the boundary between the positive and negative representations to some extent, thereby effectively alleviating the vanish of contrasting signals.
As with span-level contrasting, we adopt the mean-pooling layer to aggregate all token representations of a given sentence into its corresponding sentence representation. For each sentence pair (x, y) ∈ N sent (x, y), we define the representations (1) for sentence-level contrasting as: where z = y means that the negative representations used for contrasting are derived from other instances in the same mini-batch.
Following XLM (Lample and Conneau, 2019) and XLM-R , to learn from bidirectional contexts, we also adopt masked language modeling (MLM) as one of pre-training tasks. The MLM task aims at predicting the masked words based on the corrupted input. We concatenate a parallel bilingual sentence pair into a single sentence and randomly select 15% tokens as candidates for performing corruption. Of these selected tokens, 80% are replaced with special token [MASK], 10% are kept unchanged, and the remaining are replaced by randomly selected vocabulary tokens. The final training objective is defined as the sum of the above-mentioned multiple contrastive losses as well as the cross-entropy of MLM.

Experiments
We conduct experiments on a variety of downstream tasks, which can be divided into two categories: machine translation and cross-lingual language understanding tasks.

Pre-Training
Datasets We pre-train our model on large-scale datasets involving the 15 languages of XNLI (Conneau et al., 2018) 4 : English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu. Following , we reconstruct Common-Crawl Corpus to obtain monolingual training datasets while the bilingual data is obtained from the OPUS website 5 . We also conduct up/down-sample for all pre-training data with a smoothing parameter. The sentence-piece model (SPM) (Kudo and Richardson, 2018) provided by  6 is employed to tokenize all training data.
Model architecture We implement the proposed approach based on the Transformer (Vaswani et al., 2017) encoder with 12 identical layers, each of which consists of a multi-head attention module and a feed-forward network. The model dimension and the number of heads are set to 768 and 12, respectively, with the inner size of the feed-forward network being 3072. We choose GeLU (Hendrycks and Gimpel, 2016) as our activation function. We use the sentence-piece vocabulary provided by , whose size is 250K.
Training parameters We apply Adam (Kingma and Ba, 2015) optimizer with a learning rate of 5 × 10 −4 and adopt invert linear decay schedule to pre-train our models. We employ a dropout with probability to 0.1 for both the hidden states and the attention distribution. The temperature τ in Eq. (1) is set to 0.1 and the coefficients of all contrastive losses are set to 1. We take advantage of the gradient accumulation technique to simulate the batch size of 512. Our model is initialized with the pre-trained checkpoint released by   7 and then pre-trained on 8 × 32GB NVIDIA V100 GPUs with mixed-precision floating-point arithmetic. Overall, it took about 3 weeks to converge.

Machine Translation
Datasets We conduct experiments on three widely-used machine translation datasets of various training data sizes: IWSLT14 De-En (160K), WMT14 En-De (4.5M), and WMT14 En-Fr (36M). The sentence-piece tokenization with the same vocabulary as pre-training are used to tokenize all translation samples. The BLEU score computed by the multibleu.perl script 8 is used as the evaluation metric for all translation tasks.
Model architecture We use the pre-trained model to initialize a 12-layer encoder. The decoder is implemented as a standard 6-layer Transformer (Vaswani et al., 2017) decoder, each layer consisting of a multi-head self-attention module, a multi-head cross-attention module and a feedforward network. The entire decoder is initialized randomly, and it is jointly trained with the pretrained encoder. Other model hyper-parameters including the hidden size, the number of heads, the inner size of the feed-forward network, and the choice of activation function are identical to the encoder.
Training parameters We also adopt mixedprecision floating-point arithmetic to train our models on the machine translation task. The experiments are conducted on 8 × 32GB NVIDIA V100 GPUs. We use an Adam optimizer with β 1 = 0.9 and β 2 = 0.98 to optimize our model during training. The learning rate is warmed up to 1 × 10 −4 linearly in the first 4000 updates and then decays at a rate proportional to the inverse square root of the update number. We average the last 10 checkpoints as the final model and perform beam search with a beam size of 5 during inference. The probability of dropout is set to 0.1 to avoid over-fitting. The length penalty is set to 1.0.

Cross-Lingual Language Understanding
Dataset In order to verify the effectiveness of our approach on cross-lingual language understanding, we conduct evaluation on XNLI  dataset. It is an extension of the English natural language inference dataset MultiNLI  where the development and test sets come in 15 different languages. The training set contains ∼392K English samples and the test set for each language contains 5, 010 samples. Training parameters During fine-tuning, the base transformer model is optimized along with the extra linear classifier using Adam with β 1 = 0.9, β 2 = 0.999 and a learning rate of 0.000025. The dropout rate is set to 0.1. We fine-tune our model on 4 NVIDIA GeForce RTX 2080Ti GPUs with a batch size of 8 sequences per GPU.

Results and Analysis
This section presents the detailed experiment results of different systems. We perform the evaluation on a comprehensive suite of benchmark tasks, covering cross-lingual classification and machine translation.

Cross-Lingual Classification
Following Lample and Conneau (2019), we perform evaluation on the cross-lingual natural language inference (XNLI) benchmark, where the model needs to determine the relation (entailment, contradiction and neutral) between the given premise and hypothesis sentences. We compare different systems under two settings. (1) Cross-Lingual Transfer: we fine-tune the model on the  Table 1: The performance of different systems on XNLI task. "#M=N" indicates that each language is assigned a separate model based on the performance of the respective dev set, while "#M=1" means only one model is used for all languages. Results with "*" and " † " are taken from  and Chi et al. (2020), respectively. "(reimpl)" means our own implementation using the same training strategy. The best performance is bolded.
English training set and then directly evaluate on the test sets of the 15 languages.
(2) Translate-Train-All: we fine-tune the model on the combined data consisting of English training data and pseudo data that are translated from English to other languages. As indicated by the results in Table 1, our method manages to maintain a consistent improvement over our implemented XLM-R under both settings, raising the average accuracy from 74.4% to 76.0% for cross-lingual transfer. The similar conclusions can be drawn from the translate-trainall setting, where our approach boosts XLM-R by an increment of 1.8 average accuracy and also outperforms all other baselines. By means of multigranularity contrasting, our approach succeeds in introducing more fine-grained and explicit alignment supervision, which enhances the capability of the model to learn language-universal representations.
In addition, we can also note that there is no definite conclusion about the superiority of the two contrasting strategies for word-level contrastive loss. Hard contrasting attempts to integrate explicit cross-lingual supervision from external word aligners, while soft contrasting aims at enabling the pretrained model to learn word alignment implicitly via semantic attention. Both strategies contribute to the introduction of more fine-grained and explicit cross-lingual information, thereby lifting the performance of the pre-trained model in downstream scenarios. Table 2 presents the comparison between our approach and several representative systems on machine translation. The results once again confirm that large-scale pre-training can effectively accomplish model transferring and advance the performance of machine translation, as all pre-trained models outperform the unpretrained transformer. In addition, we observe the significant performance gain for our approach compared to the baselines. For instance, it achieves a 1.6% improvement of BLEU score over the base architecture XLM (Lample and Conneau, 2019) on the IWSLT14 DE-EN task. It also surpasses other competitive baselines such as mBERT (Devlin et al., 2019) and MASS (Song et al., 2019) on all three translation benchmarks by a wide margin. The results effectively demonstrates the ability of our approach to learn better representations for semantic equivalents across languages, as well as the versatility of our approach, which can be applied to both language understanding and generation tasks.

Ablation Study
We conduct an ablation study on several major components of our approach to explore their influence on cross-lingual pre-training, including the multi-granularity contrastive losses and sentencelevel semantic equivalent augmentation with backtranslation.

Effect of multi-granularity contrastive losses
To understand how much different levels of contrasting account for the overall performance improvement, we train the same model but with different contrastive losses. First, as shown in Table 3, all three levels of contrasting contribute to the superiority of our model. This demonstrates that the incorporation of contrastive learning can truly introduce training signals that are beneficial for cross-lingual pre-training on multiple granularities. Among them, sentence-level contrasting has the largest impact, the removal of which results in a drop of 1.3 BLEU score on WMT14 EN-DE. The reason behind this phenomenon may be that this loss makes up for the relative lack of explicit sentence-level training signals in XLM pretraining.
Effect of sentence-level semantic equivalent augmentation To investigate whether augmenting the semantic equivalents by means of back-translation improves sentence-level contrasting, we compare our model (BTSET SENT-CONTRAST) against a variant where only the original source and target sentence are used to compute the sentence-level contrastive loss (BIPAIR SENT-CONTRAST). The results are presented in Table 4. As can be seen, back-translation leads to an improvement of 0.7 BLEU on WMT14 DE-EN, illustrating its efficacy in alleviating vanishing contrasting signals and boosting cross-lingual pretraining.

Related Work
The existing efforts performing cross-lingual pretraining mainly consist of two typical paradigms: traditional embedding alignment and the recent prevalent language modeling.

Embedding Alignment
Early endeavors regarding cross-lingual pretraining mainly focus on embedding alignment, which aims at learning similar vector representations for semantic-equivalent units. Representative approaches can be categorized into four research lines: regression model, hinge loss, canonical analysis, and linear projection. Based on the observation that the monolingual word embeddings share similar geometric properties across languages, simple but effective linear projection approaches have become mainstream, which aim at aligning two disjoint monolingual vector spaces through a linear transformation. For instance, Mikolov et al. (2013a) propose to learn the desired linear projection by minimizing the mean squared error between the projected source embeddings and the target embeddings. Xing et al. (2015) refine this method by imposing orthogonality constraint and maximizing the cosine similarity. Artetxe et al. (2017) explore the bilingual induction in extremely lowresource scenarios via an effective self-learning framework. Furthermore, unsupervised embedding alignment Yang et al., 2019) completely eliminates the dependence on parallel data, which aims to learn cross-lingual word embeddings in the absence of any aligned word pairs. The related approaches can be summarized as: GAN-based distribution matching (Zhang et al., 2017;, non-adversarial distribution matching, heuristic alignment, generalized Pruck analysis and so on. However, the traditional embedding alignment can only learn noncontextualized word representations, which suffers from intractable polysemy problem. Compared with the following language modeling that captures bidirectional contexts and employs large-capacity  transformer, it tends to result in suboptimal performance in downstream tasks.

Language Modeling
This research line focuses on masked language modeling (MLM), which aims to predict the masked words based on the corrupted input. In terms of model architecture, one paradigm attempts to capture language-universal representations via a single encoder. For instance, Multilingual BERT (Devlin et al., 2019) applies byte-pair encoding (BPE) to merge tokens from 104 different languages into a shared vocabulary and performs MLM on the monolingual sentences. XLM (Lample and Conneau, 2019) extends it to translation language modeling (TLM), which strives to predict the masked words by attending to both source and target sentences. With the mutual attention of bilingual contexts, the model is expected to align representations from two languages in an implicit manner. Unicoder  introduces several more pre-training tasks such as crosslingual word recovery, illustrating that these tasks can boost model performance by learning interlingual mapping from more perspectives. ALM (Yang et al.) constructs large-scale instances for masked language modeling by alternatively selecting words from source and target languages. Ren et al. (2019) task the model with predicting the translation of masked n-grams, with the phrase table inferred from monolingual corpora in advance as ground truth.  pre-train their model using more than two terabytes of filtered Common-Crawl data, demonstrating that large-scale dataset can lead to significant performance gains. The other research line draws on the idea of the encoder-decoder framework and aims to mimic autoregressive generation by generating the target texts based on the given source input. For instance, MASS (Song et al., 2019) jointly trains the encoder and decoder by reconstructing the desired sentence fragment based on the remaining part of the sentence, which enhances the capabilities of the model in feature extraction and language modeling. XNLG (Chi et al., 2019) strives to learn language-universal representations by extending monolingual masked language modeling and denoising autoencoding to cross-lingual settings. mBART (Liu et al., 2020) pre-trains the encoderdecoder by reconstructing the original text based on the corrupted input with an arbitrary noising function, which can be used directly to initialize text generation models or as a denoising strategy for language understanding. However, both lines mentioned above for language modeling focus on projecting the input from different languages into the same semantic space through shared vocabulary and representation models. Compared with traditional embedding alignment, it lacks the introduction of cross-lingual information with more explicit and fine-grained (e.g., word-level) alignment.
Our proposed approach effectively inherits the advantages of both embedding alignment and language modeling, while avoiding their limitations. It not only captures bidirectional contexts with largecapacity transformer model and MLM task, but also introduces more fine-grained cross-lingual supervision by applying contrastive learning on semantic units of multiple granularities, thereby obtaining significant performance gains.

Conclusion
This paper presents a multi-granularity contrastive cross-language pre-training framework, which combines traditional embedding alignment and the recent prevalent language modeling to learn language-universal prior representations . Different from previous work focusing on masked language modeling to capture bidirectional contexts, the proposed approach introduces more fine-grained and explicit cross-lingual supervision by maximizing the representational consistency of semantic equivalents from different languages. Two effective contrasting strategies are proposed, which can be built upon semantic units with different granularity covering word, span, and sentence. Comprehensive empirical evidence illustrates that our approach can achieve consistent improvement on a variety of downstream tasks including machine translation and cross-lingual language understanding.