Consistency Regularization for Cross-Lingual Fine-Tuning

Fine-tuning pre-trained cross-lingual language models can transfer task-specific supervision from one language to the others. In this work, we propose to improve cross-lingual fine-tuning with consistency regularization. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks, including text classification, question answering, and sequence labeling.


Introduction
Pre-trained cross-lingual language models (Conneau and Lample, 2019;Conneau et al., 2020a;Chi et al., 2020) have shown great transferability across languages. By fine-tuning on labeled data in a source language, the models can generalize to other target languages, even without any additional training. Such generalization ability reduces the required annotation efforts, which is prohibitively expensive for low-resource languages.
Recent work has demonstrated that data augmentation is helpful for cross-lingual transfer, e.g., translating source language training data into target languages (Singh et al., 2019), and generating codeswitch data by randomly replacing input words in the source language with translated words in target languages (Qin et al., 2020). By populating the dataset, their fine-tuning still treats training instances independently, without considering the inherent correlations between the original input and its augmented example. In contrast, we propose to utilize consistency regularization to better leverage data augmentation for cross-lingual fine-tuning. Intuitively, for a semantic-preserving augmentation strategy, the predicted result of the original input should be similar to its augmented one. For example, the classification predictions of an English sentence and its translation tend to remain consistent.
In this work, we introduce a cross-lingual finetuning method XTUNE that is enhanced by consistency regularization and data augmentation. First, example consistency regularization enforces the model predictions to be more consistent for semantic-preserving augmentations. The regularizer penalizes the model sensitivity to different surface forms of the same example (e.g., texts written in different languages), which implicitly encourages cross-lingual transferability. Second, we introduce model consistency to regularize the models trained with various augmentation strategies. Specifically, given two augmented versions of the same training set, we encourage the models trained on these two datasets to make consistent predictions for the same example. The method enforces the corpus-level consistency between the distributions learned by two models.
Under the proposed fine-tuning framework, we study four strategies of data augmentation, i.e., subword sampling (Kudo, 2018), code-switch substitution (Qin et al., 2020), Gaussian noise (Aghajanyan et al., 2020), and machine translation. We evaluate XTUNE on the XTREME benchmark (Hu et al., 2020), including three different tasks on seven datasets. Experimental results show that our method outperforms conventional fine-tuning with data augmentation. We also demonstrate that XTUNE is flexible to be plugged in various tasks, such as classification, span extraction, and sequence labeling.
We summarize our contributions as follows: • We propose XTUNE, a cross-lingual finetuning method to better utilize data augmentations based on consistency regularization.
• We study four types of data augmentations that can be easily plugged into cross-lingual fine-tuning.
• We give instructions on how to apply XTUNE to various downstream tasks, such as classification, span extraction, and sequence labeling.
• We conduct extensive experiments to show that XTUNE consistently improves the performance of cross-lingual fine-tuning.
Cross-Lingual Data Augmentation Machine translation has been successfully applied to the cross-lingual scenario as data augmentation. A common way to use machine translation is to finetune models on both source language training data and translated data in all target languages. Furthermore, Singh et al. (2019) proposed to replace a segment of source language input text with its translation in another language. However, it is usually impossible to map the labels in source language data into target language translations for token-level tasks.    Qin et al. (2020) finetuned models on multilingual code-switch data, which achieves considerable improvements.
Consistency Regularization One strand of work in consistency regularization focused on regularizing model predictions to be invariant to small perturbations on image data. The small perturbations can be random noise (Zheng et al., 2016), adversarial noise (Miyato et al., 2019;Carmon et al., 2019) and various data augmentation approaches (Hu et al., 2017;Ye et al., 2019;Xie et al., 2020). Similar ideas are used in the natural language processing area. Both adversarial noise (Zhu et al., 2020;Jiang et al., 2020; and sampled Gaussian noise (Aghajanyan et al., 2020) are adopted to augment input word embeddings. Another strand of work focused on consistency under different model parameters (Tarvainen and Valpola, 2017;Athiwaratkun et al., 2019), which is complementary to the first strand. We focus on the cross-lingual setting, where consistency regularization has not been fully explored.

Methods
Conventional cross-lingual fine-tuning trains a pretrained language model on the source language and directly evaluates it on other languages, which is also known as the setting of zero-shot cross-lingual fine-tuning. Specifically, given a training corpus D in the source language (typically in English), and a model f (·; θ) that predicts task-specific probability distributions, we define the loss of cross-lingual fine-tuning as: where G(x) denotes the ground-truth label of example x, (·, ·) is the loss function depending on the downstream task. Apart from vanilla cross-lingual fine-tuning on the source language, recent work shows that data augmentation is helpful to improve performance on the target languages. For example, Conneau and Lample (2019) add translated examples to the training set for better cross-lingual transfer. Let A(·) be a cross-lingual data augmentation strategy (such as code-switch substitution), and D A = D ∪ {A(x) | x ∈ D} be the augmented training corpus, the fine-tuning loss is L task (D A , θ). Notice that it is non-trivial to apply some augmentations for tokenlevel tasks directly. For instance, in part-of-speech

XTUNE: Cross-Lingual Fine-Tuning with Consistency Regularization
We propose to improve cross-lingual fine-tuning with two consistency regularization methods, so that we can effectively leverage cross-lingual data augmentations.

Example Consistency Regularization
In order to encourage consistent predictions for an example and its semantically equivalent augmentation, we introduce example consistency regularization, which is defined as follows: where KL S (·) is the symmertrical Kullback-Leibler divergence. The regularizer encourages the predicted distributions f (x; θ) and f (A(x); θ) to agree with each other. The stopgrad(·) operation 2 is used to stop back-propagating gradients, which is also employed in (Jiang et al., 2020;. The ablation studies in Section 4.2 empirically show that the operation improves fine-tuning performance. 2 Implemented by .detach() in PyTorch.

Model Consistency Regularization
While the example consistency regularization is conducted at the example level, we propose the model consistency to further regularize the model training at the corpus level. The regularization is conducted at two stages. First, we obtain a finetuned model θ * on the training corpus D: In the second stage, we keep the parameters θ * fixed. The regularization term is defined as: where D A is the augmented training corpus, and KL(·) is Kullback-Leibler divergence. For each example x of the augmented training corpus D A , the model consistency regularization encourages the prediction f (x; θ) to be consistent with f (x; θ * ).
The regularizer enforces the corpus-level consistency between the distributions learned by two models.
An unobvious advantage of model consistency regularization is the flexibility with respect to data augmentation strategies. For the example of partof-speech tagging, even though the labels can not be directly projected from an English sentence to its translation, we are still able to employ the regularizer. Because the term R 2 is put on the same example x ∈ D A , we can always align the tokenlevel predictions of the models θ and θ * .

Full XTUNE Fine-Tuning
As shown in Figure 1, we combine example consistency regularization R 1 and model consistency regularization R 2 as a two-stage fine-tuning process. Formally, we fine-tune a model with R 1 in the first stage: where the parameters θ * are kept fixed for R 2 in the second stage. Then the final loss is computed via: where λ 1 and λ 2 are the corresponding weights of two regularization methods. Notice that the data augmentation strategies A, A , and A * can be either different or the same, which are tuned as hyper-parameters.

Data Augmentation
We consider four types of data augmentation strategies in this work, which are shown in Figure 2. We aim to study the impact of different data augmentation strategies on cross-lingual transferability.

Subword Sampling
Representing a sentence in different subword sequences can be viewed as a data augmentation strategy (Kudo, 2018;Provilkov et al., 2020). We utilize XLM-R (Conneau et al., 2020a) as our pre-trained cross-lingual language model, while it applies subword tokenization directly on raw text data using SentencePiece (Kudo and Richardson, 2018) with a unigram language model (Kudo, 2018). As one of our data augmentation strategies, we apply the on-the-fly subword sampling algorithm in the unigram language model to generate multiple subword sequences.

Gaussian Noise
Most data augmentation strategies in NLP change input text discretely, while we directly add random perturbation noise sampled from Gaussian distribution on the input embedding layer to conduct data augmentation. When combining this data augmentation with example consistency R 1 , the method is similar to the stability training (Zheng et al., 2016), random perturbation training (Miyato et al., 2019) and the R3F method (Aghajanyan et al., 2020). We also explore Gaussian noise's capability to generate new examples on continuous input space for conventional fine-tuning.

Code-Switch Substitution
Anchor points have been shown useful to improve cross-lingual transferability. Conneau et al. (2020b) analyzed the impact of anchor points in pre-training cross-lingual language models. Following Qin et al. (2020), we generate code-switch data in multiple languages as data augmentation. We randomly select words in the original text in the source language and replace them with target language words in the bilingual dictionaries to obtain code-switch data. Intuitively, this type of data augmentation explicitly helps pre-trained cross-lingual models align the multilingual vector space by the replaced anchor points.

Machine Translation
Machine translation has been proved to be an effective data augmentation strategy (Singh et al., 2019) under the cross-lingual scenario. However, the ground-truth labels of translated data can be unavailable for token-level tasks (see Section 3), which disables conventional fine-tuning on the augmented data. Meanwhile, our proposed model consistency R 2 can not only serve as consistency regularization but also can be viewed as a self-training objective to enable semi-supervised training on the unlabeled target language translations.

Task Adaptation
We give instructions on how to apply XTUNE to various downstream tasks, i.e., classification, span extraction, and sequence labeling. By default, we use model consistency R 2 in full XTUNE. We describe the usage of example consistency R 1 as follows.

Classification
For classification task, the model is expected to predict one distribution per example on n label types, i.e., model f (·; θ) should predict a probability distribution p cls ∈ R n label . Thus we can directly use example consistency R 1 to regularize the consistency of the two distributions for all four types of our data augmentation strategies.

Span Extraction
For span extraction task, the model is expected to predict two distributions per example p start , p end ∈ R n subword , indicating the probability distribution of where the answer span starts and ends, n subword denotes the length of the tokenized input text. For Gaussian noise, the subword sequence remains unchanged so that example consistency R 1 can be directly applied to the two distributions. Since subword sampling and code-switch substitution will change n subword , we control the ratio of words to be modified and utilize example consistency R 1 on unchanged positions only. We do not use the example consistency R 1 for machine translation because it is impossible to explicitly align the two distributions.

Sequence Labeling
Recent pre-trained language models generate representations at the subword-level. For sequence labeling tasks, these models predict label distributions on each word's first subword. Therefore, the model is expected to predict n word probability distributions per example on n label types. Unlike span extraction, subword sampling, code-switch substitution, and Gaussian noise do not change n word . Thus the three data augmentation strategies will not affect the usage of example consistency R 1 . Although word alignment is a possible solution to map the predicted label distributions between translation pairs, the word alignment process will introduce more noise. Therefore, we do not employ machine translation as data augmentation for the example consistency R 1 .

Experiment Setup
Datasets For our experiments, we select three types of cross-lingual understanding tasks from XTREME benchmark ( Fine-Tuning Settings We consider two typical fine-tuning settings from Conneau et al. (2020a) and Hu et al. (2020) in our experiments, which are (1) cross-lingual transfer: the models are finetuned on English training data without translation available, and directly evaluated on different target languages; (2) translate-train-all: translationbased augmentation is available, and the models are fine-tuned on the concatenation of English training data and its translated data on all target languages. Since the official XTREME repository 3 does not provide translated target language data for POS and NER, we use Google Translate to obtain translations for these two datasets.

Implementation Details
We utilize XLM-R (Conneau et al., 2020a) as our pre-trained cross-lingual language model. The bilingual dictionaries we used for code-switch substitution are from MUSE (Lample et al., 2018). 4 For languages that cannot be found in MUSE, we ignore these languages since other bilingual dictionaries might be of poorer quality. For the POS dataset, we use the average-pooling strategy on subwords to obtain word representation since part-of-speech is related to different parts of words, depending on the language. We tune the hyper-parameter and select the model with the best average results over all the languages' development set. There are two datasets without development set in multi-languages. For XQuAD, we tune the hyper-parameters with the development set of MLQA since they share the same training set and have a higher degree of overlap in languages. For TyDiQA-GoldP, we use the English test set   Table 2: Ablation studies on the XTREME benchmark. All numbers are averaged over five random seeds.
as the development set. In order to make a fair comparison, the ratio of data augmentation in D A is all set to 1.0. The detailed hyper-parameters are shown in the supplementary document.   are unavailable in these two datasets. To make a fair comparison in the translate-train-all setting, we augment the English training corpus with target language translations when fine-tuning with only example consistency R 1 . Otherwise, we only use the English training corpus in the first stage, as shown in Figure 1(a). Compared to XTUNE, the performance drop on two classification datasets under this setting is relatively small since R 1 can be directly applied between translation-pairs in any languages. However, the performance is significantly degraded in three question answering datasets, where we can not align the predicted distributions between translation-pairs in R 1 . We use subword sampling as the data augmentation strategy in R 1 for this situation. Fine-tuning with only model consistency R 2 degrades the overall performance by 1.1 points. These results demonstrate that the two consistency regularization methods complement each other. Be-  sides, we observe that removing stopgrad degrades the overall performance by 0.5 points. Table 3 provides results of each language on the XNLI dataset. For the cross-lingual transfer setting, we utilize code-switch substitution as data augmentation for both example consistency R 1 and model consistency R 2 . We utilize all the bilingual dictionaries, except for English to Swahili and English to Urdu, which MUSE does not provide. Results show that our method outperforms all baselines on each language, even on Swahili (+2.2 points) and Urdu (+5.4 points), indicating our method can be generalized to low-resource languages even without corresponding machine translation systems or bilingual dictionaries. For translate-train-all setting, we utilize machine translation as data augmentation for both example consistency R 1 and model consistency R 2 . We improve the XLM-R large baseline by +2.2 points on average, while we still have +0.9 points on average compared to FILTER. It is worth mentioning that we do not need corresponding English translations during inference. Complete results on other datasets are provided in the supplementary document.

Analysis
It is better to employ data augmentation for consistency regularization than for conventional fine-tuning. As shown in Table 4, com- pared to employing data augmentation for conventional fine-tuning (Data Aug.), our regularization methods (XTUNE R 1 , XTUNE R 2 ) consistently improve the model performance under all four data augmentation strategies. Since there is no labeled data on translations in POS and the issue of distribution alignment in example consistency R 1 , when machine translation is utilized as data augmentation, the results for Data Aug. and XTUNE R 1 in POS, as well as XTUNE R 1 in MLQA, are unavailable. We observe that Data Aug. can enhance the overall performance for coarse-grained tasks like XNLI, while our methods can further improve the results. However, Data Aug. even causes the performance to degrade for fine-grained tasks like MLQA and POS. In contrast, our proposed two consistency regularization methods improve the performance by a large margin (e.g., for MLQA under code-switch data augmentation, Data Aug. decreases baseline by 1.2 points, while XTUNE R 1 increases baseline by 2.6 points). We give detailed instructions on how to choose data augmentation strategies for XTUNE in the supplementary document.
XTUNE improves cross-lingual retrieval. We fine-tune the models on XNLI with different settings and compare their performance on two crosslingual retrieval datasets. Following Chi et al.
(2020) and Hu et al. (2020), we utilize representations averaged with hidden-states on the layer 8 of XLM-R base . As shown in Table 5, we observe significant improvement from the translatetrain-all baseline to fine-tuning with only example consistency R 1 , this suggests regularizing the taskspecific output of translation-pairs to be consistent also encourages the model to generate language-invariant representations. XTUNE only slightly improves upon this setting, indicating R 1 between translation-pairs is the most important factor to improve cross-lingual retrieval task.
XTUNE improves decision boundaries as well as the ability to generate language-invariant representations. As shown in Figure 3, we present t-SNE visualization of examples from the XNLI development set under three different settings. We observe the model fine-tuned with XTUNE significantly improves the decision boundaries of different labels. Besides, for an English example and its translations in other languages, the model fine-tuned with XTUNE generates more similar representations compared to the two baseline models. This observation is also consistent with the cross-lingual retrieval results in Table 5.

Conclusion
In this work, we present a cross-lingual fine-tuning framework XTUNE to make better use of data augmentation. We propose two consistency regularization methods that encourage the model to make consistent predictions for an example and its semantically equivalent data augmentation. We explore four types of cross-lingual data augmentation strategies. We show that both example and model consistency regularization considerably boost the performance compared to directly fine-tuning on data augmentations. Meanwhile, model consistency regularization enables semi-supervised training on the unlabeled target language translations. XTUNE combines the two regularization methods, and the experiments show that it can improve the performance by a large margin on the XTREME benchmark.

C Results for Each Dataset and Language
We provide detailed results for each dataset and language below. We compare our method against XLM-R large for cross-lingual transfer setting, FIL-TER (Fang et al., 2020) for translate-train-all setting.

D How to Select Data Augmentation Strategies in XTUNE
We give instructions on selecting a proper data augmentation strategy depending on the corresponding task.

D.1 Classification
The two distribution in example consistency R 1 can always be aligned. Therefore, we recommend using machine translation as data augmentation if the machine translation systems are available. Otherwise, the priority of our data augmentation strategies is code-switch substitution, subword sampling and Gaussian noise.

D.2 Span Extraction
The two distribution in example consistency R 1 can not be aligned in translation-pairs. Therefore, it is impossible to use machine translation as data augmentation in example consistency R 1 . We prefer to use code-switch when applying example consistency R 1 individually. However, when the training corpus is augmented with translations, since the bilingual dictionaries between arbitrary language pairs may not be available, we recommend using subword sampling in example consistency R 1 .

D.3 Sequence Labeling
Similar to span extraction, the two distribution in example consistency R 1 can not be aligned in translation-pairs. Therefore, we do not use machine translation in example consistency R 1 . Unlike classification and span extraction, sequence labeling requires finer-grained information and is more sensitive to noise. We found code-switch is worse than subword sampling as data augmentation in both example consistency R 1 and model consistency R 2 , it will even degrade performance for certain hyperparameters. Thus we recommend using subword sampling in example consistency R 1 , and use machine translation to augment the English training corpus if machine translation systems are available, otherwise subword sampling.

E Cross-Lingual Transfer Gap
As shown in Table 9, the cross-lingual transfer gap can be reduced under all four data augmentation strategies. Meanwhile, we observe machine translation and code-switch substitution achieve a smaller cross-lingual transfer gap than the other two data augmentation methods. This suggests the data augmentation methods with cross-lingual knowledge have a greater improvement in crosslingual transferability. Although code-switch significantly reduces the transfer gap on XNLI, the improvement is relatively small on POS and MLQA under the cross-lingual transfer setting, indicating the noisy code-switch substitution will harm the cross-lingual transferability on finer-grained tasks.