Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-label word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectation-maximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rate on the alignment benchmarks. The code and pretrained parameters are available at github.com/CZWin32768/XLM-Align.


Introduction
Despite the current advances in NLP, most applications and resources are still English-centric, making non-English users hard to access. Therefore, it is essential to build cross-lingual transferable models that can learn from the training data in highresource languages and generalize on low-resource languages. Recently, pretrained cross-lingual language models have shown their effectiveness for cross-lingual transfer. By pre-training on monolingual text and parallel sentences, the models provide significant improvements on a wide range of crosslingual end tasks (Conneau and Lample, 2019;Conneau et al., 2020;Chi et al., 2021b).
Cross-lingual language model pre-training is typically achieved by learning various pretext tasks on * Contribution during internship at Microsoft Research. monolingual and parallel corpora. By simply learning masked language modeling (MLM; Devlin et al. 2019) on monolingual text of multiple languages, the models surprisingly achieve competitive results on cross-lingual tasks (Wu and Dredze, 2019;K et al., 2020). Besides, several pretext tasks are proposed to utilize parallel corpora to learn better sentence-level cross-lingual representations (Conneau and Lample, 2019;Chi et al., 2021b;Hu et al., 2020a). For example, the translation language modeling (TLM; Conneau and Lample 2019) task performs MLM on the concatenated parallel sentences, which implicitly enhances cross-lingual transferability. However, most pretext tasks either learn alignment at the sentence level or implicitly encourage cross-lingual alignment, leaving explicit fine-grained alignment task not fully explored.
In this paper, we introduce a new cross-lingual pre-training task, named as denoising word alignment. Rather than relying on external word aligners trained on parallel corpora (Cao et al., 2020;Zhao et al., 2020;Wu and Dredze, 2020), we utilize self-labeled alignments in our task. During pretraining, we alternately self-label word alignments and conduct the denoising word alignment task in an expectation-maximization manner. Specifically, the model first self-labels word alignments for a translation pair. Then we randomly mask tokens in the bitext sentence, which is used as the perturbed input for denosing word alignment. For each masked token, the model learns a pointer network to predict the self-labeled alignments in the other language. We repeat the above two steps to iteratively boost the bitext alignment knowledge for cross-lingual pre-training.
We conduct extensive experiments on a wide range of cross-lingual understanding tasks. Experimental results show that our model outperforms the baseline models on various datasets, particularly on the token-level tasks such as question answer-ing and structured prediction. Moreover, our model can also serve as a multilingual word aligner, which achieves reasonable low error rates on the bitext alignment benchmarks.
Our contributions are summarized as follows: • We present a cross-lingual pre-training paradigm that alternately self-labels and predicts word alignments.
• We introduce a pre-training task, denoising word alignment, which predicts word alignments from perturbed translation pairs.
• We propose a word alignment algorithm that formulates the word alignment problem as optimal transport.
• We demonstrate that our explicit alignment objective is effective for cross-lingual transfer.  (Hu et al., 2020b). mT5 (Xue et al., 2020) learns a multilingual version of T5 (Raffel et al., 2020) with text-totext tasks. In addition to monolingual text, several methods utilize parallel corpora to improve crosslingual transferability. XLM (Conneau and Lample, 2019) presents the translation language modeling (TLM) task that performs MLM on concatenated translation pairs. ALM (Yang et al., 2020) introduces code-switched sequences into cross-lingual LM pre-training. Unicoder (Huang et al., 2019) employs three cross-lingual tasks to learn mappings among languages. From an information-theoretic perspective, InfoXLM (Chi et al., 2021b) proposes the cross-lingual contrastive learning task to align sentence-level representations. Additionally, AM-BER (Hu et al., 2020a) introduces an alignment objective that minimizes the distance between the forward and backward attention matrices. More recently, Ernie-M (Ouyang et al., 2020) presents the back-translation masked language modeling task that generates pseudo parallel sentence pairs for learning TLM, which provides better utilization of monolingual corpus. VECO (Luo et al., 2020) pretrains a unified cross-lingual language model for both NLU and NLG. mT6 (Chi et al., 2021a) improves the multilingual text-to-text transformer with translation pairs. Notably, Word-aligned BERT models (Cao et al., 2020;Zhao et al., 2020) finetune mBERT by an explicit alignment objective that minimizes the distance between aligned tokens. Wu and Dredze (2020) exploit contrastive learning to improve the explicit alignment objectives. However, Wu and Dredze (2020) show that these explicit alignment objectives do not improve cross-lingual representations under a more extensive evaluation. Moreover, these models are restricted to stay close to their original pretrained values, which is not applicable for large-scale pre-training. On the contrary, we demonstrate that employing our explicit alignment objective in large-scale pre-training can provide consistent improvements over baseline models.

Related Work
Word alignment The IBM models (Brown et al., 1993) are statistical models for modeling the translation process that can extract word alignments between sentence pairs. A large number of word alignment models are based on the IBM models (Och and Ney, 2003;Mermer and Saraçlar, 2011;Dyer et al., 2013;Östling and Tiedemann, 2016). Recent studies have shown that word alignments can be extracted from neural machine translation models (Ghader and Monz, 2017;Koehn and Knowles, 2017; or from pretrained cross-lingual LMs (Jalili Sabet et al., 2020;Nagata et al., 2020). Figure 1 illustrates an overview of our method for pre-training our cross-lingual LM, which is called XLM-ALIGN. XLM-ALIGN is pretrained in an expectation-maximization manner with two alternating steps, which are word alignment selflabeling and denoising word alignment. We first formulate word alignment as an optimal transport problem, and self-label word alignments of the input translation pair on-the-fly. Then, we update the model parameters with the denoising word alignment task, where the model uses a pointer network (Vinyals et al., 2015) to predict the aligned tokens from the perturbed translation pair.

Word Alignment Self-Labeling
The goal of word alignment self-labeling is to estimate the word alignments of the input translation pair on-the-fly, given the current XLM-ALIGN model. Given a source sentence

Translation Pair Noisy Translation Pair (Random Masks)
Pointer Network query keys

DWA Loss
(b) Denoising word alignment Figure 1: An overview of our method. XLM-ALIGN is pretrained in an expectation-maximization manner with two alternating steps. (a) Word alignment self-labeling: we formulate word alignment as an optimal transport problem, and self-labels word alignments of the input translation pair on-the-fly; (b) Denoising word alignment: we update the model parameters with the denoising word alignment task, where the model uses a pointer network to predict the aligned tokens from the perturbed translation pair. S = s 1 . . . s i . . . s n and a target sentence T = t 1 . . . t j . . . t m , we model the word alignment between S and T as a doubly stochastic matrix A ∈ R n×m + such that the rows and the columns all sum to 1, where A ij stands for the probability of the alignment between s i and t j . The rows and the columns of A represent probability distributions of the forward alignment and the backward alignment, respectively. To measure the similarity between two tokens from S and T , we define a metric function f sim by using cross-lingual representations produced by XLM-ALIGN: where is a constant to avoid negative values in the log function, and h i is the hidden vector of the i-th token by encoding the concatenated sequence of S and T with XLM-ALIGN. Empirically, the metric function produces a high similarity score if the two input tokens are semantically similar. The word alignment problem is formulated as finding A that maximizes the sentence similarity between S and T : We can find that Eq.
(2) is identical to the regularized optimal transport problem (Peyré et al., 2019), if we add an entropic regularization to A: Eq.
(3) has a unique solution A * such that According to Sinkhorn's algorithm (Peyré et al., 2019), the variables u and v can be calculated by the following iterations: where v t can be initialized by v t=0 = 1 m . With the solved stochastic matrix A * , we can produce the forward word alignments − → A by applying argmax over rows: Similarly, the backward word alignments ← − A can be computed by applying argmax over columns. To obtain high-precision alignment labels, we adopt an iterative alignment filtering operation. We initialize the alignment labels A as ∅. In each iteration, we follow the procedure of Itermax (Jalili Sabet et al., 2020) that first computes − → A and ← − A by Eq. (7). Then, the alignment labels are updated by: Finally, A * is updated by: where α is a discount factor. After several iterations, we obtain the final self-labeled word alignments A.

Denoising Word Alignment
After self-labeling word alignments, we update the model parameters with the denoising word alignment (DWA) task. The goal of DWA is to predict the word alignments from the perturbed version of the input translation pair. Consider the perturbed version of the input translation pair (S * , T * ) constructed by randomly replacing the tokens with masks. We first encode the translation pair into hidden vectors h * with the XLM-ALIGN encoder: where [S * , T * ] is the concatenated sequence of S * and T * with the length of n + m. Then, we build a pointer network upon the XLM-ALIGN encoder that predicts the word alignments. Specifically, for the i-th source token, we use h * i as the query vector and h * n+1 , . . . , h * n+m as the key vectors. Given the query and key vectors, the forward alignment probability a i is computed by the scaled dot-product attention (Vaswani et al., 2017): where d h is the dimension of the hidden vectors. Similarly, the backward alignment probability can be computed by above equations if we use target tokens as the query vectors and h * 1 . . . h * n as key vectors. Notice that we only consider the self-labeled and masked positions as queries. Formally, we use the following query positions in the pointer network: where M is the set of masked positions. The training objective is to minimize the cross-entropy between the alignment probabilities and the selflabeled word alignments: where CE(·, ·) stands for the cross-entropy loss, and A(i) is the self-labeled aligned position of the i-th token.

Pre-training XLM-ALIGN
We illustrate the pre-training procedure of XLM-ALIGN in Algorithm 1. In addition to DWA, we also include MLM and TLM for pre-training XLM-ALIGN, which implicitly encourage the crosslingual alignment. The overall loss function is defined as: In each iteration, we first sample monolingual text X , and parallel text (S, T ). Then, we self-label word alignments and update the model parameters by learning pretext tasks. Notice that the model parameters are initialized by a cold-start pre-training to avoid producing low-quality alignment labels. The cold-start pre-training can be accomplished by using a pretrained LM as the model initialization.

Pre-training
Following previous cross-lingual pretrained models (Conneau and

XTREME Benchmark
XTREME is a multilingual benchmark for evaluating cross-lingual generalization. We evaluate our model on 7 cross-lingual downstream tasks included by XTREME, which can be grouped into Besides, we only use one model for evaluation on all target languages, rather than selecting different models for each language. The detailed fine-tuning hyperparameters can be found in supplementary document.

Results
In Table 1, we present the evaluation results on XTREME structured prediction, question answering, and sentence classification tasks. It can be observed that our XLM-ALIGN obtains the best average score over all the baseline models, improving the previous score from 66.4 to 68.9. It demonstrates that our model learns more transferable representations for the cross-lingual tasks, which is beneficial for building more accessible multilingual NLP applications. It is worth mentioning that our method brings noticeable improvements on the question answering and the structured prediction tasks. Compared with XLM-R base , XLM-ALIGN provides 6.7% and 1.9% F1 improvements on Ty-DiQA and NER. The improvements show that the  Table 2: Evaluation results for word alignment on four English-centric language pairs. We report the alignment error rate scores (lower is better). For both SimAlign (Jalili Sabet et al., 2020) and our optimal-transport alignment method, we use the hidden vectors from the 8-th layer produced by XLM-R base or XLM-ALIGN. "(reimplementation)" is our reimplementation of SimAlign-Itermax. pretrained XLM-ALIGN benefits from the explicit word alignment objective, particularly on the structured prediction and question answering tasks that require token-level cross-lingual transfer. In terms of sentence classification tasks, XLM-ALIGN also consistently outperforms XLM-R base .

2003) as the evaluation metrics.
Results We first explore whether our word alignment self-labeling method is effective for generating high-quality alignment labels. Thus, we compare our method with (1) fast align (Dyer et al., 2013), a widely-used implementation of IBM Model 2 (Och and Ney, 2003); (2) SimAlign (Jalili Sabet et al., 2020), state-of-theart unsupervised word alignment method. For a fair comparison, we use the same pretrained LM and hidden layer as in SimAlign to produce sentence representations. In specific, we take the hidden vectors from the 8-th layer of XLM-R base or XLM-ALIGN, and obtain the alignments following the procedure as described in Section 3.1. Since the produced alignments are subword-level, we convert the alignments into word-level by the following rule that "if two subwords are aligned, the words they belong to are also aligned". As shown in Table 2, we report the AER scores on the four language pairs. It can be observed that our optimal-transport method outperforms fast align and SimAlign, demonstrating that our method can produce high-quality alignment labels, which is helpful for the DWA task. Moreover, our method consistently outperforms SimAlign when using hidden vectors from both XLM-R base and XLM-ALIGN.
Then, we compare our XLM-ALIGN with XLM-R base on the word alignment task. Empirically, a lower AER indicates that the model learns better cross-lingual representations. From Table 2, XLM-ALIGN obtains the best AER results over all the four language pairs, reducing the averaged AER from 22.64 to 21.05. Besides, un-  der both SimAlign and our optimal-transport method, XLM-ALIGN provides consistent reduction of AER, demonstrating the effectiveness of our method for learning fine-grained cross-lingual representations.
We also compare XLM-ALIGN with XLM-R base using the hidden vectors from the 3-th layer to the 12-th layer. We illustrate the averaged AER scores in Figure 2. Notice that the results on the first two layers are not presented in the figure because of the high AER. It can be observed that XLM-ALIGN consistently improves the results over XLM-R base across these layers. Moreover, it shows a parabolic trend across the layers of XLM-R base , which is consistent with the results in (Jalili Sabet et al., 2020). In contrast to XLM-R base , XLM-ALIGN alleviates this trend and greatly reduces AER in the last few layers. We believe this property of XLM-ALIGN brings better cross-lingual transferability on the end tasks.

Analysis
In this section, we conduct comprehensive ablation studies for a better understanding of our XLM-ALIGN. To reduce the computational cost, we reduce the batch size to 256, and pretrain models with 50K steps in the following experiments.

Ablation Studies
We perform ablation studies to understand the components of XLM-ALIGN, by removing the denoising word alignment loss (−DWA), the TLM loss (−TLM), or removing both (XLM-R*), which is identical to continue-training XLM-R base with MLM. We evaluate the models on XNLI, POS, NER, and MLQA, and present the results in Table 3. Comparing −TLM with −DWA, we find that DWA is more effective for POS and MLQA, while TLM performs better on XNLI and NER. Comparing −TLM with XLM-R*, it shows that directly learning DWA slightly harms the perfor-   mance. However, jointly learning DWA with TLM provides remarkable improvements over −DWA, especially on the question answering and the structure prediction tasks that requires token-level crosslingual transfer. This indicates that TLM potentially improves the quality of self-labeled word alignments, making DWA more effective for crosslingual transfer.

Word Alignment Self-Labeling Layer
It has been shown that the word alignment performance has a parabolic trend across the layers of mBERT and XLM-R (Jalili Sabet et al., 2020). It indicates that the middle layers produce higherquality word alignments than the bottom and the top layers. To explore which layer produces better alignment labels for pre-training, we pretrain three variants of XLM-ALIGN, where we use the hidden vectors from three different layers for word alignment self-labeling. We use the 8-th, 10-th, and 12-th layers for word alignment self-labeling during the pre-training. We present the evaluation results in Table 4. Surprisingly, although Layer-8 produces higher-quality alignment labels at the beginning of the pre-training, using the alignment labels from the 12-th layer learns a more transferable XLM-ALIGN model for cross-lingual end tasks.

Denoising Word Alignment Layer
Beyond the self-labeling layer, we also investigate which layer is better for learning the denoising word alignment task. Recent studies have shown  that it is beneficial to learn sentence-level crosslingual alignment at a middle layer (Chi et al., 2021b). Therefore, we pretrain XLM-ALIGN models by using three different layers for DWA, that is, using the hidden vectors of middle layers as the input of the pointer network. We compare the evaluation results of the three models in Table 5. It can be found that learning DWA at Layer-8 improves XNLI while learning DWA at higher layers produces better performance on the other three tasks. It suggests that, compared with sentence-level pretext tasks that prefers middle layers, the DWA task should be applied at top layers.

Effects of Alignment Filtering
Although our self-labeling method produces highquality alignment labels, the alignment filtering operation can potentially make some of the tokens unaligned, which reduces the example efficiency. Thus, we explore whether the alignment filtering is beneficial for pre-training XLM-ALIGN. To this end, we pretrain an XLM-ALIGN model without alignment filtering. In specific, we use the union set of the forward and backward alignments as the selflabeled alignments so that all tokens are aligned at least once. The forward and backward alignments are obtained by applying the argmax function over rows and columns of A * , respectively. Empirically, the alignment filtering operation generates high-precision yet fewer labels, while removing the filtering promises more labels but introduces low-confident labels. In Table 6, we compare the results of the models with or without alignment filtering. It can be observed that the alignment filtering operation improves the performance on the end tasks. This demonstrates that it is necessary to use high-precision labels for learning the denoising word alignment task. On the contrary, using perturbed alignment labels in pre-training harms the performance on the end tasks.

Effects of DWA Query Positions
In the denoising word alignment task, we always use the hidden vectors of the masked positions  as the query vectors in the pointer network. To explore the impact of the DWA query positions, we compare three different query positions in Table 7: (1) masked: only using the masked tokens as queries; (2) unmasked: randomly using 15% of the unmasked tokens as queries; (3) all-aligned: for each self-labeled aligned pair, randomly using one of the two tokens as a query. Also, we include the no-query baseline that does not use any queries, which is identical to removing DWA. It can be observed that using all the three query positions improves the performance over the no-query baseline. Moreover, using the masked positions as queries achieves better results than the other two positions, demonstrating the effectiveness of the masked query positions.

Discussion
In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. By alternately self-labeling and predicting word alignments, our XLM-ALIGN model learns transferable cross-lingual representations. Experimental results show that our method improves the cross-lingual transferability on a wide range of tasks, particularly on the token-level tasks such as question answering and structured prediction. Despite the effectiveness for learning crosslingual transferable representations, our method also has the limitation that requires a cold-start pre-training to prevent the model from producing low-quality alignment labels. In our experiments, we also try to pretrain XLM-ALIGN from scratch, i.e., without cold-start pre-training. However, the DWA task does not work very well due to the lowquality of self-labeled alignments. Thus, we recommend continue-training XLM-ALIGN on the basis of other pretrained cross-lingual language models. For future work, we would like to research on removing this restriction so that the model can learn word alignments from scratch.

Ethical Considerations
Despite the current advances in NLP, most NLP research works and applications are English-centric, making none-English users hard to access to NLPrelated services. Our method aims to pretrain cross-lingual language models that transfer supervision signals from high-resource languages to lowresource languages, which makes the NLP services and applications more accessible for low-resourcelanguage speakers. Furthermore, our method can build multilingual models that serve on different languages at the same time, reducing the computational resources for building multilingual models separately for each language.

A Pre-Training Data
We use raw sentences from the Wikipedia dump and CCNet 4 as monolingual corpora. The CCNet corpus we use is reconstructed following (Conneau et al., 2020) to reproduce the CC-100 corpus. The resulting corpus contains 94 languages. Table 8 and Table 9 report the language codes and data size of CCNet and Wikipedia dump. Notice that several languages share the same ISO language codes, e.g., zh represents both Simplified Chinese and Traditional Chinese. Besides, Table 10 shows the statistics of our parallel corpora.

B Hyperparameters for Pre-Training
As shown in Table 11, we present the hyperparameters for pre-training XLM-ALIGN. We use the same vocabulary with XLM-R (Conneau et al., 2020).

C Hyperparameters for Fine-Tuning
In Table 12, we present the hyperparameters for fine-tuning XLM-R base and XLM-ALIGN on the XTREME end tasks. For each task, the hyperparameters are searched on the joint validation set of all languages.

D Detailed Results on XTREME
We present the detailed results of XLM-ALIGN on XTREME in          Table 19: Results on PAWS-X cross-lingual paraphrase adversaries.