Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only

,


Introduction
Recent studies show that multilingual pre-trained language models (mPLMs) significantly improve the performance of cross-lingual natural language processing tasks.Conventional mPLMs (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020;Xue et al., 2021) typically adopt multiple monolingual corpora (mono-mPLMs) to perform masked language modeling during pretraining, obtaining impressive and stable multilingual capabilities, which may intuitively result from the semantic alignments but remains underexplored.Another line of research involves improving the multilingual pre-training by incorporating the cross-lingual parallel corpora (para-mPLMs) into pre-training (Conneau and Lample, 2019;Cao et al., 2020;Chi et al., 2021a,b,c;Luo et al., 2021;Wei et al., 2021;Ouyang et al., 2021).However, parallel sentences are not always available, especially for low-resource languages (Tran et al., 2020).And collecting such data often entails substantial costs.Therefore, exploring approaches to improve multilingual pre-training without using parallel corpora is important and worthy of study.
To achieve this, we first conduct analyses to investigate the token alignment properties of XLM-R, a strong mono-mPLM that only uses multiple monolingual corpora for pre-training.Our empirical study demonstrates that the cross-lingual token alignments occur at the embedding layer (alignment location) with surprisingly high alignment accuracy (alignment degree) but they become weaker at the higher layers (Figure 1 (a)).We also find that the alignments are geometrically aligned instead of absolutely aligned (alignment format), as shown in Figure 1 (b).The phenomenon shows that token embeddings of different languages are separately distributed but geometrically similar.
We further compare the differences in geometric similarities of representations from the bottom layer to the top layer using mono-mPLMs and para-mPLMs.And we find that the representations be-come geometrically dissimilar at higher layers of mono-mPLMs while para-mPLMs would alleviate the problem by using parallel sentences, obtaining better cross-lingual transfer capability.It shows the necessity of explicit cross-lingual interactions.
Based on the above observation, we propose self-improving methods to encourage cross-lingual interactions using self-induced token alignments.Intuitively, the masked tokens can be predicted with semantic-equivalent but slightly language-mixed contexts.Therefore, we first utilize self-induced alignments to perform token-level code-switched masked language modeling (TCS-MLM), which requests the model to predict original masked tokens with the semantic-equivalent but code-switched surrounding text.Considering that vanilla replacements usually lack diversity, we further propose a novel semantic-level code-switched masked language modeling (SCS-MLM), which replaces the context tokens with a weighted combination of multiple semantically similar ones in other languages.SCS-MLM involves on-the-fly semantic replacements during training, further enhancing the diversity of code-switched examples and cross-lingual interactions.
We evaluate our methods on various crosslingual transfer tasks.Specifically, we conduct experiments on natural language understanding tasks, including XNLI (Conneau et al., 2018) and PAWS-X (Hu et al., 2020) for sentence-pair classification, Wikiann (Pan et al., 2017) and UDPOS (Nivre et al., 2018) for structural prediction, MLQA (Lewis et al., 2020) for question answering, and Tatoeba (Artetxe and Schwenk, 2019) for sentence retrieval.We also perform experiments on unsupervised machine translation to evaluate the performance of the generation task.Experimental results demonstrate that our methods significantly improve the performance compared with the strong baselines, even surpassing some mPLMs pre-trained using parallel corpora.Further analysis demonstrates that our methods improve the geometric similarity of representations for different languages, and thus promoting the cross-lingual transfer capability.
Our contributions are summarized as follows: • Our empirical study shows the existence of cross-lingual token alignments in mono-mPLMs.We further measure their accuracy, identify the location, and verify the format.
• Comparing mono-mPLMs with para-mPLMs, we find that mono-mPLMs tend to disturb the geometric similarities between representations at higher layers while para-mPLMs remain unaffected, showing the necessity of cross-lingual interactions during pre-training.
• We propose token-level/semantic-level codeswitched masked language modeling to encourage cross-lingual interactions during pretraining, improving the cross-lingual transfer capability without relying on parallel corpora.

A Closer Look at Multilinguality
In this section, we take the commonly used XLM-R, the strong mono-mPLM, as an example to show our observation2 .Specifically, we first investigate the properties of cross-lingual token alignments in mono-mPLMs, showing their relation to geometric similarity.Then, we explore the variation of geometric similarity of representations across different layers and demonstrate that the geometric similarity at higher layers would be disturbed due to the lack of cross-lingual interactions, hindering the cross-lingual transfer capability.

Language-Specific Vocabulary
Generally, mPLMs adopt a huge vocabulary shared across 100+ languages.Different languages always have shared and independent tokens.Previous studies (Conneau and Lample, 2019;Pires et al., 2019;Wu and Dredze, 2019) regard the shared token as the source of cross-lingual capability.However, the latent relevance between language-specific tokens is not fully exploited.Suppose that each language l in the language set has a corresponding corpus C l .We first record the tokens whose frequencies are larger than 100 in C l , obtaining the vocabulary V l for the specific language.Then, we remove shared tokens when processing two languages l a and l b to avoid the impact of overlapping.Finally, we obtain the languagespecific vocabularies independent of each other:

Token-Alignments in Mono-mPLMs
After obtaining the language-specific vocabularies from other languages to English, we calculate the similarity among token embeddings of XLM-R and directly export high-quality cross-lingual token alignments as dictionaries.Specifically, we adopt the cross-domain similarity local scaling (CSLS) (Lample et al., 2018) to compute the token similarity from language X to language Y.For token embeddings x and y in two languages, the CSLS score is computed as: where r K (x) is the average score from x to the Knearest target neighbourhoods N (x).And r K (y) is vice versa.
Accuracy of Token-Alignments To measure the quality of dictionaries, we collect golden dictionaries from wiki-dict 3 and MUSE (Lample et al., 2018).The accuracy scores are shown in Table 1.We find that the exported cross-lingual dictionaries have good quality, demonstrating that mono-mPLMs learn the alignments between different languages using monolingual corpora only.Particularly, distant language pairs, such as En-Ja (63.97%) and En-Ko (62.52%), have higher accuracy scores, while they usually have little overlapping tokens as anchor points for alignment.The phenomenon directly proves that the cross-lingual ability does not only depend on the overlapping tokens in different languages.Another potential factor could be the occurrence of words with similar meanings at comparable frequencies across languages (K et al., 2019).Using language modeling as the objective may unearth such regularities and stimulate the cross-lingual transfer capability.
3 https://github.com/onny/wikidictFormat of Token-Alignments The second question is whether the alignments are absolute alignment or geometrical alignment (Figure 1 (b)).Absolute alignment requests the token embeddings are language-agnostic (Artetxe et al., 2017;Lample et al., 2018) while geometrical alignment focuses on the correspondence between tokens in different languages (Vulić et al., 2020).The latter just need to have a similar geometric spatial structure while keeping language characteristics (Roy et al., 2020).
Thus, we visualize the embedding of tokens in the exported dictionaries.The results across five diverse languages are shown in Figure 2 (a).We can find that the token embeddings are separately distributed in space according to the language similarity instead of aggregating together, showing that the alignments are geometrical alignment.We also use RSIM4 to measure the geometric similarity in Table 1.By contrast, token representations at the top layer aggregate together (Figure 2 (b)).Location of Token-Alignments Since top-layer hidden states aggregate, whether the accuracy of token alignments improves from the bottom to the top layers becomes a question.To answer the question, we compute the token alignment accuracy, average cosine similarity, and RSIM scores using hidden states of different layers.Figure 3 (a) shows that with the layers becoming higher, token alignment accuracy decreases but the average cosine similarity between translation pairs increases, demonstrating that cross-lingual token alignments mainly exist in the embedding layer while higher layers focus on the aggregation of language-specific token representations.Moreover, Figure 3 (b) shows that the geometric similarities between language-specific token representations at top layers become weaker.

Cross-Lingual Interactions Matters for
Geometric Similarity Maintenance §2.2 shows that language-specific token representations of XLM-R at higher layers aggregate together (Figure 2 (b)) but the alignment and geometric similarity are disturbed (Figure 3).Since para-mPLMs usually obtain better performance on cross-lingual transfer tasks, we compare the difference between para-mPLMs and XLM-R (mono-mPLM) in the above aspects.Specifically, we choose VECO (Luo et al., 2021) and INFOXLM (Chi et al., 2021b) as representatives of para-mPLMs, which are pretrained with monolingual and parallel corpora, obtaining improvements compared with XLM-R. Figure 4 (a) shows the phenomenon of token alignment accuracy and cosine similarity across layers.We find that different mPLMs exhibit similar behavior, wherein both mono-mPLM and para-mPLMs tend to aggregate token representations while ignoring alignments at the higher layers.The reason behind this may lie in that higher layers of PLMs prioritize complex semantic combinations rather than token features (Jawahar et al., 2019).
Figure 4 (b) compares the average RSIM scores of different mPLMs.VECO and INFOXLM have higher RSIM scores than XLM-R cross layers, showing that parallel corpora would improve the geometric similarity between languages.Furthermore, RSIM scores of VECO/INFOXLM across layers are more balanced than XLM-R.It demonstrates that explicit cross-lingual interactions (parallel corpora) are useful in maintaining geometric similarity in mPLMs, which could be one of the factors contributing to better cross-lingual transfer capability than mono-mPLMs.
3 Our Method §2.2 demonstrates that mono-mPLMs learn crosslingual token alignments and can export them as high-quality dictionaries.§2.3 shows explicit crosslingual interactions may enhance cross-lingual transfer capability.These observations motivate us to explore self-improving methods to increase cross-lingual interactions without relying on parallel corpora.Next, we introduce our proposed tokenlevel/semantic-level code-switch masked language modeling for multilingual pre-training.

Token-Level Code-Switch MLM (TCS)
Previous code-switch methods either rely on the existing bilingual dictionaries (Lin et al., 2020;Chaudhary et al., 2020) or the alignment tools to build the alignment pairs using parallel sentences (Ren et al., 2019;Yang et al., 2020).Our work proves that the self-induced dictionaries from mono-mPLMs are accurate enough, which would help mono-mPLMs self-improving and do not require the use of prior bilingual knowledge.
Therefore, we replace 10∼15% tokens using the dictionary to construct multilingual code-switched contexts but keep the masked positions unchanged, forcing the model to predict the masked tokens with different but semantic-equivalent contexts.For example, the original English token sequence (after cutting) is converted into a code-switch one using the self-induced dictionary (En-De): _A _cat _sit _on _the [mask] .⇒ L MLM ⇓ _A _Katze _sit _on _the [mask] .⇒ L TCS-MLM Both the original and code-switched sentences are fed into mono-mPLMs to perform masked lan-guage modeling.The training loss is: Considering that token replacements often lack diversity, we propose a novel semantic-level codeswitch method, which replaces 10∼15% tokens with the average weighting of its neighbors in another language, as shown in Figure 5.
Considering that mPLMs provide contextual output distributions across the vocabulary and avoid polysemy problems (Tversky and Gati, 1982), we first utilize the mono-mPLM to obtain the output probability distribution across the vocabulary for tokens.Then, we choose top-k5 tokens according to probabilities and average-weight their embeddings as the contextual token representation x: where p i is the normalized probability of i-th token in the top-k tokens.
After obtaining the contextual representations, we adopt the embedding table to search for corresponding translations on-the-fly instead of directly using the discrete dictionaries, which would improve the diversity of training examples and keep the semantics.Similarly, we also keep top-k translation candidates and average-weighting their embedding as the replacement ŷ: where q j is the normalized CSLS on-the-fly scores6 across the top-k tokens in the corresponding language-specific vocabulary V.
Same as §3.1, we request the mono-mPLMs to perform masked language modeling based on the original examples and semantically code-switched ones.The final training loss is: training instances for the i-th language, the probability for i-th language can be calculated as: where α is set as 0.7.
Model Configuration Due to the restriction of resources, we conduct experiments on the Transformer encoder models to verify the effectiveness of our method.For fair comparisons with previous studies on natural language understanding tasks, we pre-train a 12-layer Transformer encoder as the BASE model (768 hidden dimensions and 12 attention heads) and a 24-layer Transformer encoder as the LARGE model (1024 hidden dimensions and 16 attention heads) using fairseq toolkit.The activation function used is GeLU.Following (Chi et al., 2021b;Luo et al., 2021;Ouyang et al., 2021), we initialize the parameters with XLM-R.
We also pre-train the 6-layer Transformer encoder (1024 hidden dimensions and 8 attention heads), which is adopted to evaluate the performance on unsupervised machine translation.
Optimization Settings We use the Adam optimizer to train our model, whose learning rate is scheduled with a linear decay with 4000 warm-up steps.The peaking learning rates are separately set as 2e-4 and 1e-4 for BASE and LARGE model.Pretraining is conducted using 8 Nvidia A100-80GB GPUs with 2048 batch size.The BASE model takes about 1 month and the LARGE model takes about 2 months for pre-training.Appendix A shows more details about the pre-training settings.
5 Experiments on Downstream Tasks 5.1 Natural Language Understanding

Experimental Settings
We consider four kinds of cross-lingual NLU tasks: Sentence-Pair Classification We choose XNLI (Conneau et al., 2018) for cross-lingual language inference and PAWS-X (Hu et al., 2020) for crosslingual paraphrase identification.
Question Answering We choose MLQA (Lewis et al., 2020) for cross-lingual question answering.
We conduct experiments on the above crosslingual NLU tasks to evaluate the cross-lingual transfer capability of our method: fine-tune the model with the English training set and evaluate the foreign language test sets.We separately describe the results as follows: Sentence-Pair Classification The cross-lingual natural language inference (XNLI) aims to determine the relationship between the two input sentences, entailment, neural or contradiction.And PAWS-X aims to judge whether the two sentences are paraphrases or not.
As shown in Table 2, our SCS-MLM BASE surpasses the baseline models including mBERT, XLM, XLM-R BASE and Unicoder.Moreover, our SCS-MLM LARGE obtains equivalent performance with some pre-trained models using parallel sentences, including VECO LARGE and HICTL LARGE .In contrast, although TCS-MLM also obtains improvements, it is not as good as SCS-MLM.We suppose that the limited dictionaries would lead to insufficient cross-lingual interactions.and Wikiann aims to identify name entities.We reported the average F1 score for each dataset.

Structural Prediction
Table 3 shows the results of our models.Compared with previous studies, our proposed SCS-MLM LARGE obtains the best results on UDPOS, achieving 76.0 F1 score.For Wikiann, our TCS-MLM LARGE and SCS-MLM LARGE also obtain significant improvements compared with the strong baseline XLM-R.We suppose that the induced dictionaries contain the relations of entities and postagging across different languages, which promotes improvements in the structural prediction tasks.

Cross-lingual Question Answering
MLQA aims to answer questions based on the given paragraph, which contains 7 languages.
The F1/EM scores are shown in Table 4.We can find that our proposed TCS-MLM and SCS-MLM are significantly better than the strong baseline XLM-R and even surpass some models pre-trained with parallel sentences, such as VECO, VECO 2.0 and HICTL.Although our methods cannot surpass ERNIE-M LARGE , they narrow the gaps between mPLMs training with or without parallel sentences, demonstrating the effectiveness of our methods.

Cross-lingual Retrieval
To evaluate the crosslingual sentence retrieval capability of our models, we choose a subset of the Tatoeba dataset (36 language pairs), which aims to identify the parallel sentence among 1000 candidates.Following previous studies, we used the averaged representation in the middle layer of different models (XLM-R BASE , + TCS-MLM BASE and + SCS-MLM BASE ) to evaluate the retrieval task.
The results are shown in Figure 6.We can find that our proposed SCS-MLM BASE obtains better retrieval accuracy scores (average acc.60.73) than TCS-MLM BASE (+0.49acc.) and XLM-R BASE (+7.46 acc.) across language directions, demonstrating the effectiveness of our method.

Natural Language Generation -UNMT
As our proposed pre-training methods do not rely on parallel sentences, we choose the harder taskunsupervised neural machine translation (UNMT) to evaluate the performance on the generation task.

Experimental Results
Table 5 shows the translation performance on WMT14 En-Fr, WMT16 En-De, WMT16 En-Ro test sets.We can find that our proposed SCS-MLM can improve the translation quality compared with the strong baselines, XLM and MASS.For example, SCS-MLM respectively outperforms XLM and MASS by 1.1 and 1.2 BLEU scores in WMT16 En→Ro.SCS-MLM also surpasses previous studies on average, verifying its effectiveness.Moreover, the results also show that our method is suitable for the seq2seq model -MASS (Figure 8  appendix A.3), demonstrating that our method is independent of the model architectures.

SCS-MLM Improves Geometric
Similarity at Higher Layers §2.3 shows that para-mPLMs would maintain the geometric similarity of language-specific token representations across layers.As our method incorporate explicit cross-lingual interactions into pretraining, a similar phenomenon should occur.
Therefore, we plot the RSIM scores across 24layer LARGE models for measurement.As shown in Figure 7 (a), compared with the baseline model, our proposed SCS-MLM increases the geometric similarity of different languages across layers.Figure 7 (b) shows the RSIM improvements focus on the higher layers, thereby achieving balanced geometric similarities, akin to the observations of para-mPLMs in Figure 4 (b).It could illustrate the reason why our method is effective on various cross-lingual transfer tasks.

Related Work
Multilingual pre-trained language models begin with mBERT (Devlin et al., 2019) and XLM (Conneau and Lample, 2019), which learn the shared feature space among languages using multiple monolingual corpora.XLM-R (Conneau et al., 2020) shows the effects of models when trained on a large-scale corpus, establishing strong baselines for subsequent studies.
Based on the observation that parallel corpora would help cross-lingual alignment in (Conneau and Lample, 2019), many studies pay attention to the usage of the parallel corpus.Unicoder (Huang et al., 2019) employs a multi-task learning framework to learn cross-lingual semantic representations.ALM (Yang et al., 2020) and PARADISE (Reid and Artetxe, 2022) uses parallel sentences to construct code-switch sentences.INFOXLM (Chi et al., 2021b) and HICTL (Wei et al., 2021) respectively employ sentence-level and token-level contrastive learning for cross-lingual semantic alignments.VECO (Luo et al., 2021) proposes a variable framework to enable the model to process understanding and generation tasks.ERNIE-M (Ouyang et al., 2021) generates pseudo-training examples, further improving the overall performance on downstream tasks.(Ai and Fang, 2023) explores the MASK strategy (prototype-word) to improve cross-lingual pre-training.
Different from previous studies, our work first investigates the properties of token alignments behind multilinguality and then proposes selfimproving methods for multilingual pre-training with only monolingual corpora, alleviating the need for parallel sentences.

Conclusion
In this work, we first investigate the properties of cross-lingual token alignments in mono-mPLMs, and then make a comparison between mono-mPLMs and para-mPLMs, demonstrating that geometric similarities of higher-layer representations would be damaged without explicit cross-lingual interactions, hindering the multilinguality.Therefore, we propose token-level and semantic-level codeswitch masked language modeling to improve the cross-lingual interactions without relying on parallel corpora.Empirical results on language understanding and generation tasks demonstrate the effectiveness of our methods.Future work would adapt our methods to much larger language models.

A.1 Pre-Training Data
We use the open-source CC-100 corpora 7 for the pre-training of BASE and LARGE models.Due to the resource and memory restrictions, we just select 50 languages that cover the downstream tasks and conduct random sampling following (Luo et al., 2021).Table 6  For a fair comparison with previous studies, we pre-train the cross-lingual language models with the same model architecture of XLM and MASS.The pre-training data for UNMT is shown in Table 8.We compare SCS-MLM with other UNMT pretraining methods (Ren et al., 2019(Ren et al., , 2021;;Song et al., 2019;Ai and Fang, 2022), which have the equivalent number of parameters.
During inference, we use the beam size 1 and length penalty 1.0.To be consistent with previous works, we use multi-bleu.perl to measure the translation quality.
The illustration of our method on MASS is shown in Figure 8, which is similar to Figure 5 but needs to predict multiple adjacent tokens using the sequence-to-sequence model.

C Cross-Lingual Alignments on Different mPLMs
We also evaluate the accuracy of the token alignments across different language models.The results are shown in Table 9.Some examples are included in Figure 11.We find that different kinds of language models form cross-lingual alignments.It demonstrates that pre-trained language models automatically learn cross-lingual mapping based on language modeling regardless of the specific modeling methods (i.e.masked language modeling, text span prediction, or casual language modeling).For different kinds    of pre-trained models, the alignment accuracy increases with the size of the parameters.Moreover, all the models show a similar pattern that distant language pairs have higher alignment accuracy.Furthermore, we also draw the RSIM scores across layers of different mono-mPLMs in Figure 10.We find that different models share a similar phenomenon that RSIM scores are higher at the lower layers but lower at higher layers.None of them could keep the geometric similarity balanced like para-mPLMs, VECO, or InfoXLM, as shown in Figure 4. Therefore, we argue that explicit crosslingual interactions still matter regardless of different architectures.

D Ablation Study -Effect of k
In the proposed method, SCS-MLM, k plays an important role.evaluate the performance on UNMT task (WMT16 En↔Ro).As shown in Figure 9, we can find that SCS-MLM obtains the best performance when k is set as 8. Therefore, we set k=8 for all the experiments in our paper.

E Experimental Results Details
Due to space limitations, we just report the average cross-lingual transfer metric scores in the main paper.The details for each language test set are listed in Table 10-14.
Translate-train-all is another evaluation method for multilingual pre-trained language models.It means fine-tuning a multilingual model on the concatenation of all data (English training corpus and translated training corpus in other languages).Although our method mainly focuses on improving cross-lingual transfer capability, it can also bring improvements in Translate-train-all settings.
These results are also provided in

Figure 1 :
Figure 1: Illustration of properties of token alignments.

Figure 2 :
Figure 2: The visualization of embedding and states.

Figure 3 :
Figure 3: The token alignment accuracy/cosine similarity and RSIM across different layers of XLM-R LARGE .

Figure 5 :
Figure 5: The illustration of SCS-MLM.First, we use the mPLM to predict the distribution of the context token, obtaining the contextual token representations and searching corresponding translations.Then, we use the weighted representations to replace the corresponding token and request the model to predict the same masked tokens.

Figure 6 :
Figure 6: Tatoeba results for languages, which are sorted according to the performance of XLM-R BASE .

Figure 8 :
Figure8: The illustration of our method on MASS language modeling, which masks multiple adjacent tokens for prediction.

Table 1 :
Alignment accuracy/RSIM of translation pairs derived from different sizes of XLM-RoBERTa models across different languages to English.The number below the language is the size of exported cross-lingual dictionary.

Table 4 :
Structural prediction task contains UDPOS and Wikiann.Given a sentence, UDPOS aims to label the pos-tagging for tokens Evaluation results on MLQA cross-lingual question answering.We report the F1 / exact match (EM) scores.The results of TCS-MLM and SCS-MLM are averaged over five runs.

Table 5 :
in Unsupervised translation performance on WMT14 En-Fr, WMT16 En-De, WMT16 En-Ro.The results of previous studies are picked from corresponding papers.

Table 6 :
shows the statistics of the monolingual data in each language.Statistic of pre-training data in our experiments.We follow the common practices to conduct experiments on UNMT benchmarks.For evaluation, we separately adopt newsdev/test 2014 En-Fr, newsdev/test 2016 En-De, newsdev/test 2016 En-Ro as development and test sets.

Table 7 :
Hyperparameters used for pre-training.

Table 8 :
Data statistics for unsupervised machine translation training.

Table 9 :
Prediction accuracy of translation pairs derived from different mPLMs across different languages to English.Because XLM-R/mT5/X-GLM supports different numbers of languages and has different vocabularies, comparisons between different types of mPLMs may not be meaningful.
Considering that the pre-training of BASE and LARGE models are time-consuming, we pre-train SMALL models with different k and