Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

Pre-trained multilingual language encoders, such as multilingual BERT and XLM-R, show great potential for zero-shot cross-lingual transfer. However, these multilingual encoders do not precisely align words and phrases across languages. Especially, learning alignments in the multilingual embedding space usually requires sentence-level or word-level parallel corpora, which are expensive to be obtained for low-resource languages. An alternative is to make the multilingual encoders more robust; when fine-tuning the encoder using downstream task, we train the encoder to tolerate noise in the contextual embedding spaces such that even if the representations of different languages are not aligned well, the model can still achieve good performance on zero-shot cross-lingual transfer. In this work, we propose a learning strategy for training robust models by drawing connections between adversarial examples and the failure cases of zero-shot cross-lingual transfer. We adopt two widely used robust training methods, adversarial training and randomized smoothing, to train the desired robust model. The experimental results demonstrate that robust training improves zero-shot cross-lingual transfer on text classification tasks. The improvement is more significant in the generalized cross-lingual transfer setting, where the pair of input sentences belong to two different languages.


Introduction
Zero-shot cross-lingual transfer learning aims to learn models with data available in one or more source languages and use them in other target languages for which there is no data (zero-resource) available. The zero-shot cross-lingual transfer has a great practical value for low-resource languages since it reduces the requirement of labeled data to learn models for downstream tasks, e.g., text classification (Conneau et al., 2018;Yang et al., 2019) and question answering (Lewis et al., 2020).
Recently, pre-trained multilingual language encoders, such as multilingual BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020a), demonstrate promising performance on zero-shot crosslingual transfer learning for a wide range of downstream tasks (Hu et al., 2020;Liang et al., 2020). These language encoders learn a shared multilingual contextual embedding space; they are able to represent word pairs in parallel sentences with similar contextual representations. However, the multilingual encoders fail to capture this similarity when the source and target languages are less similar at levels of morphology, syntax, and semantics (Ahmad et al., 2019a,b).
Prior studies (Cao et al., 2020;Pan et al., 2021;Dou and Neubig, 2021) have shown that aligning the representations of different languages in the multilingual embedding space plays an important role for zero-shot cross-lingual transfer learning. As illustrated in Figure 1a, words with similar meanings (e.g. this, ceci, and 这) have similar representations in the contextual multilingual embedding space, even though these words are in different languages. This alignment helps models transfer the learned knowledge from source languages to target languages. Therefore, several works focus on improving the quality of alignments in the multilingual embedding space (Cao et al., 2020;Chi et al., 2020;Pan et al., 2021;Dou and Neubig, 2021). Nevertheless, learning such alignments usually requires sentence-level or word-level parallel corpora, which are expensive to be obtained for low-resource languages. In addition, because the meanings of words in different languages are usually not exactly matched, learn a perfect alignment could be impossible.
In this work, we start from another point of view to improve zero-shot cross-lingual transfer performance. We aim to make the multilingual encoders robust such that they can tolerate a certain amount of noise in the input embeddings. More specifi-(a) Contextual representations of different words.
(b) Robust regions try to cover neighbor embeddings. We aim to learn a robust classifier whose robust regions (orange circles) that cover as many neighbor words as possible.
cally, as shown in Figure 1b, we target to construct robust regions (orange circles) for embeddings in the multilingual embedding space. During training, the robust model is expected to output similar predictions for embeddings in the same robust region. Therefore, as long as similar words in different languages fall into the same robust region, even if they are not perfectly aligned, the model can still have similar predictions for them.
To learn the robust model, we first draw connections between adversarial examples (Li et al., 2020;Garg and Ramakrishnan, 2020;Jin et al., 2020) and the failure cases of zero-shot cross-lingual transfer, and then study two widely used robust training methods to learn the robust model: (1) adversarial training (Goodfellow et al., 2015;Madry et al., 2018) and (2) randomized smoothing (Cohen et al., 2019;Ye et al., 2020). Both of them can make the model robust against perturbations in the input embeddings by modifying the training objective when fine-tuning model for the downstream task. For randomized smoothing, we also adopt the data augmentation approach (Ye et al., 2020) to learn the robust model.
We perform experiments on two cross-lingual text classification tasks, paraphrase identification and natural language inference 1 . The experimental results demonstrate that robust training indeed improves the performance of zero-shot cross-lingual transfer on the classification benchmarks: PAWS-X (Yang et al., 2019) and XNLI (Conneau et al., 2018). On average the cross-lingual transfer performance improves by 2.1 and 1.6 points on PAWS-X and XNLI, respectively. In addition, we show that robust training remarkably improves generalized cross-lingual transfer (Lewis et al., 2020). In this setting, the pair of input sentences in the text classification tasks belong to two different languages, e.g., paraphrase prediction for a pair of sentences in English and Korean.
Embedding misalignment handling. Instead of directly aligning the representations, there is a line of research making the model be aware of the embedding misalignment issues by considering additional syntactic features, such as part-of-speech (Kozhevnikov and Titov, 2013) and dependency parse trees (Ahmad et al., 2019b;Subburathinam et al., 2019;Liu et al., 2019;Ahmad et al., 2021a,b), and other syntactic features (Meng et al., 2019). However, those syntactic features require large human efforts to obtained.
Robust training. Recently, adversarial attacks are presented to check the robustness of NLP models, such as character manipulation (Ebrahimi et al., 2018;Gil et al., 2019), word replacements (Alzantot et al., 2018;Li et al., 2020;Garg and Ramakrishnan, 2020;Jin et al., 2020), and syntactic rearrangements (Iyyer et al., 2018). To against those attacks, various robust training methods are proposed. For example, Alzantot et al. (2018) trains a robust model by data augmentation with generated adversarial examples. Other works (Ebrahimi et al., 2018;Dong et al., 2021;Zhou et al., 2021) consider adversarial training, which includes the adversarial accuracy to the training objective. A few studies propose transformations on inputs before feeding them to models (Edizel et al., 2019;Jones et al., 2020). Randomized smoothing (Cohen et al., 2019;Ye et al., 2020) is presented to make models robust against noise in input representations. Another line of research aims at providing theoretical guarantee of robustness, including interval bound propagation methods (Jia et al., 2019;Huang et al., 2019) and verification methods (Shi et al., 2020). Most of those robust training methods focus on defending adversarial attacks, while we propose to apply robust training methods to improve the zero-shot cross-lingual transfer performance.

Zero-Shot Cross-Lingual Transfer with Robust Training
In this work, we focus on zero-shot cross-lingual transfer for text classification tasks. Our goal is to learn a classifier f from a set of training examples in source languages At test time, we directly use the classifier f to conduct inference on a set of test examples in target languages We expect that the classifier f can transfer the learned knowledge from the source languages to the target languages.

Connection with Adversarial Examples
The aligned representations of different languages have been shown as a crucial factor (Cao et al., 2020;Chi et al., 2020;Pan et al., 2021) for multilingual embeddings to be effective for zero-shot cross-lingual transfer. For example, assuming the source language and the target language are English and French, respectively, and considering a pair of parallel sentences "this is a cat" (in English) and "Ceci est un chat" (in French), we can get the contextual representations of the source English sentence E src = (v 1 , v 2 , v 3 , v 4 ) and the target French sentence E tgt = (u 1 , u 2 , u 3 , u 4 ). Let δ denote the difference between the source and the target contextual representations as follows. 2 Since words with similar meanings have similar representations, the norm of their differences δ i is supposed to be small. Therefore, if f (E src ) = c, we have a high probability for f (E tgt ) = c as well, which means that the classifier f is able to transfer the learned knowledge from the source language to the target language. If unfortunately, the transfer fails, we have We observe that Eq. (1) is very similar to the definition of adversarial examples (Alzantot et al., 2018;Li et al., 2020;Garg and Ramakrishnan, 2020;Jin et al., 2020). The goal of adversarial examples is to find a small perturbation ∆ for an instance x such that a classifier h changes the prediction on x, as illustrated by the following equation. h where ∆ is small.
For the case that cross-lingual transfer fails, the difference between the source and target representations δ behaves like an adversarial perturbation. This inspires us to consider robust training methods, which are designed for defending adversarial examples, to improve the zero-shot cross-lingual transfer performance. More specifically, our goal is to train a robust classifier that can tolerate small perturbations on input embeddings. As shown in Figure 1b, we aim to train a robust classifier f that has robust regions (orange circles) such that the robust classifier f outputs similar values for input embeddings are in the same robust region.

Adversarial Training
The main idea of adversarial training is considering the most effective adversarial perturbation in each optimization iteration. More precisely, in normal training, we learn a classifier f by solving the following optimization problem where Enc(·) is the multilingual encoder and L is the cross-entropy loss. When considering adversarial training, we solve the following min-max optimization problem instead where ε is a hyper-parameter to control the size of robust regions which are described by several norm balls δ i . The inner maximization finds the most effective perturbation to change the prediction, while the outer minimization tries to ensure the correct prediction against the perturbation. With this min-max optimization, the classifier f is aware of perturbations within the robust regions δ i and becomes more robust.

Randomized Smoothing
Unlike adversarial training, which always considers the most effective perturbation, randomized smoothing focuses on the expectation case and aims to guarantee the local smoothness of the classifier at the same time. Following previous work (Cohen et al., 2019;Ye et al., 2020), we let f be the classifier learned by solving the normal optimization problem and learn a smoothed classifier g such that where P δ is a prior distribution of the perturbation δ and Y is the label space. In other words, we want that g(Enc(x)) has a similar output value (label predictions) to f (Enc(x)). The random perturbation δ is introduced to ensure the local smoothness of g. That is, g(Enc(x) + δ), the output for the perturbed input, is similar to the output value of g(Enc(x)). Compared to the original classifier f , the smoothed classifier g is more robust against local perturbations. We consider two different ways to learn the smoothed classifier g: (1) random perturbation and (2) data augmentation.
In each optimization step, we randomly sample a perturbation δ from P δ and add it to Enc(x). Then, we use the perturbed representation as the input to calculate the loss and update the classifier g.
Data augmentation (DA). Another common way to approximate the smoothed classifier g is data augmentation (Ye et al., 2020). Instead of randomly sampling the perturbation δ, we consider a predefined synonym set (Alzantot et al., 2018). For every example x = (w 1 , w 2 , ..., w n ) in X src , we generate m augmented examples by replacing each word w i in x with one of its synonym words (including w i itself). We allow multiple replacements in one example. Then, we use the augmented data to train a smoothed classifier g.
It is worth noting that the predefined synonym set is required for only source languages. Unlike previous work (Qin et al., 2020;Liu et al., 2020), which uses bilingual dictionary of both source languages and target languages, the proposed method does not need any additional annotations of target languages.  Table 2: Averaged results of zero-shot cross-lingual transfer on XNLI with 10 different random seeds. Highest scores are in bold. Underlines denote that the improvement is significant with p ≤ 0.05 for the bootstrapped paired t-test. *We report the numbers in the previous paper (Hu et al., 2020).

Experiments
We conduct experiments to verify that robust training indeed improves the performance of zero-shot cross-lingual transfer.

Setup
We consider two cross-lingual text classification datasets: Cross-lingual Paraphrase Adversaries from Word Scrambling (PAWS-X) (Yang et al., 2019) and Cross-lingual Natural Language Inference (XNLI) (Conneau et al., 2018). The goal of PAWS-X is to determine whether two sentences are paraphrases to each other or not. XNLI is designed for natural language inference; given a premise and a hypothesis, the classifier predicts the relation of the two sentences from {entailment, neutral, con-tradiction}.
For both datasets, we consider English as the source language and treat other languages as the target languages. We use the train, validation, and test splits provided by XTREME framework (Hu et al., 2020). Specifically, we conduct 10 runs of experiments with 10 different random seeds. In each run, we train the classifier on the English training set, use the English validation set to search the best parameters, and record the results of the test sets. Finally, the averaged results of 10-run experiments are reported.
Compared models. We consider the following four different models: For the randomized smoothing via data augmentation, we consider the synonym set provide by previous work (Alzantot et al., 2018), which is constructed by searching nearest neighbors of words in the GloVe embedding space (Pennington et al., 2014) post-processed by the counter-fitting method (Mrksic et al., 2016). The number of augmented examples m is set to 10 and 3 for PAWS-X and XNLI, respectively, while more discussion on m is shown in Section 4.2. For other parameters, such as the learning rate and the batch size, we follow the training scripts provided by XTREME framework (Hu et al., 2020). Table 1 shows the averaged results of PAWS-X with 10 different random seeds. We first notice that all mBERT-ADV, mBERT-RS-RP, and mBERT-RS-DA perform better than the standard mBERT on average. Especially, robust training leads to up to 4.0% improvement on Japanese, up to 4.1% improvement on Korean, and up to 2.9% improvement on Chinese. The results suggest that robust training helps in improving the performance of zero-shot cross-lingual transfer learning. We observe that randomized smoothing is usually better than adversarial training. The reason is that adversarial training always considers the most effective adversarial perturbation during the optimization process. Adversarial perturbations are suitable for defending adversarial examples as they are specifically designed for attacking the classifier. However, in the zero-shot cross-lingual transfer case, the perturbations are not explicitly designed but reflect the natural difference between languages. Therefore, randomized smoothing, which considers the average case, becomes the better choice.

Zero-Shot Cross-Lingual Transfer
We have a similar conclusion for the XNLI dataset. As shown in Table 2, robust training indeed leads to improvements on zero-shot cross-lingual transfer. Again, randomized smoothing performs better than the adversarial training approach.
Finally, we compare the two different ways (random perturbation and data augmentation) to learn the smoothed classifier. They have competitive performance on PAWS-X; however, data augmentation performs better than random perturbation on XNLI. We hypothesize that the ideal robust regions in practice may not be perfect norm balls. In fact, they are more like convex hulls composed by the neighbor words (Dong et al., 2021). By considering a predefined synonym set, mBERT-RS-DA can better capture the shapes of robust regions, leading to a more stable performance.
What languages are benefited most from robust training? We notice that cross-lingual transfer to some languages is significantly improved by robust training, especially those languages that are quite different from the source language (English). To verify this conjecture, we consider lang2vec (Littell et al., 2017), a tool that extracts features of different languages by querying the URIEL typological database 3 , to calculate the distance between English and other languages. Then, we show the performance gaps between mBERT-RS-DA and mBERT over all languages as well as the least square regression line in Figure 2. Note that the languages are sorted according to their distances to English from left to right.
From Figure 2a, we observe an obvious trend for PAWS-X that languages with larger distances to English have more performance gain with robust training. We posit that it is because languages with larger distances have more different representations from English in the multilingual embedding space. The norm of the perturbation δ defined in Section 3 will be larger and thus the failure cases occur more often. By performing robust training, we reduce failure cases that lead to a larger improvement. Similar trend can be observed for XNLI (Figure 2b). Performance on languages with larger distances to English is improved more with the robust training.
How many augmented data needed for randomized smoothing? Since mBERT-RS-DA seems to be the most effective model for both PAWS-X and XNLI, we do further ablation on the number of augmented data for each example m. Figure 3 shows the average performance of mBERT-RS-DA on PAWS-X over different choices of m. We can observe that larger m leads to better performance in general because more augmented examples help the model better approximate the local smoothness, resulting in more accurate robust regions. Interestingly, when m ≤ 10, increasing m can significantly improve the performance. When m > 10, increasing m only slightly improves the performance. This result suggests that setting m to 10 for PAWS-X. Interestingly, we observe that setting m to 3 is good enough for XNLI. This ablation study indicates that randomized smoothing with data augmentation can use just a few augmented instances per example to learn good robust regions.

Zero-Shot Generalized Cross-Lingual Transfer Results
Next, we study the zero-shot cross-lingual transfer in a generalized setting. Lewis et al. (2020) proposed the generalized setting for the question answering task where the question and the context may belong to two different languages. 4 We consider the generalized setting for cross-lingual text classification since the input of PAWS-X and XNLI tasks are pairs of sentences. For example, consider XNLI on English-Arabic sentence pairs; the premises are in English, and the hypotheses are in Arabic. Note that due to the parallel nature of PAWS-X and XNLI dataset 5 , we can pair up sentences from two different languages. Notice that we directly use the trained models in Section 4.2 to conduct inference in the generalized setting. In other words, all the classifiers are trained on English-English sentence pairs, without the consideration of target languages. The results of mBERT-RS-RP and mBERT-RS-DA on PAWS-X and XNLI over all combinations of languages are shown in Figure 4 and Figure 5, respectively. While the diagonal numbers indicate the transfer results in the cross-lingual transfer settings, the non-diagonal entries present the generalized transfer performances. Note that we report the performance difference between the compared model and mBERT (exact numbers can be found in Appendix A) and the languages are sorted according to their distances to English. We observe that the nondiagonal numbers are much larger than the diagonal numbers, which suggests that robust training results in larger performance improvements in the generalized cross-lingual transfer setting. Given that the input sentences in training examples are in the same language (English), during inference, mBERT makes more mistakes in the classification tasks as the contextual representations for the input sentences may not be aligned accurately. However, mBERT-RS-RP and mBERT-RS-DA can tolerate a certain amount of noise in input embeddings. Therefore, they are more stable when the input sentences come from different languages, leading to a significant improvement.

Study on Syntactic Perturbations
As mentioned in Section 3, our primary focus is on the perturbations in the multilingual embedding space and does not consider the influence of language syntax in cross-lingual transfer. Different languages have linguistic differences, such as word order. Differences in word order across languages affect the contextual embedding space that impacts cross-lingual transfer (Ahmad et al., 2019b). Therefore, we conduct a preliminary experiment to study the influence of syntax in robust training.
mBERT-RS-DA uses a predefined synonym set to generate perturbed examples for data augmen-tation. Following a similar strategy, we construct syntactically perturbed examples for data augmentation. More specifically, for every example x = (w 1 , w 2 , ..., w n ) in X src , we generate m syntactically perturbed examples by randomly swapping adjacent words with a probability p = 0.1. This random swapping may result in some examples with different word orders, which simulates the syntactic perturbations. Then, we use those syntactically perturbed examples to train the smoothed classifier g, called mBERT-RS-syntax. Table 3 presents the preliminary results. The average performance of mBERT-RS-syntax is similar to the performance of standard mBERT. Interest-  Table 3: Results of syntactic perturbations on PAWS-X. Highest scores are in bold. Underlines denote that the improvement is significant with p ≤ 0.05 for the bootstrapped paired t-test. *We report the numbers in the previous paper (Hu et al. (2020)).
ingly, the zero-shot cross-lingual transfer performance drops when the target languages are more similar to the source language English (German, Spanish, and French), while the transfer performance increases when the target languages are more different from English (Japanese, Korean, and Chinese). This preliminary result suggests that it is possible to improve the zero-shot crosslingual transfer by considering syntactic perturbations. One potential extension is adopting paraphrase generation models (Iyyer et al., 2018; to construct more sophisticated syntactic perturbations and we leave this direction for future work.

Conclusion
In this work, we propose a robust model by drawing connections between adversarial examples and the failure cases of zero-shot cross-lingual transfer. We adopt two robust training methods, adversarial training and randomized smoothing, to train the desired robust model. The experimental results demonstrate that robust training improves zero-shot cross-lingual transfer on text classification tasks. In addition, the improvement is more significant in the generalized cross-lingual transfer setting. A Detailed Results of Zero-Shot Generalized Cross-Lingual Transfer