UXLA: A Robust Unsupervised Data Augmentation Framework for Zero-Resource Cross-Lingual NLP

Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose UXLA, a novel unsupervised data augmentation framework for zero-resource transfer learning scenarios. In particular, UXLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution, assuming no training label in the target language. At its core, UXLA performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on three diverse zero-resource cross-lingual transfer tasks. UXLA achieves SoTA results in all the tasks, outperforming the baselines by a good margin. With an in-depth framework dissection, we demonstrate the cumulative contributions of different components to its success.


Introduction
Self-supervised learning in the form of pretrained language models (LM) has been the driving force in developing state-of-the-art NLP systems in recent years. These methods typically follow two basic steps, where a supervised task-specific finetuning follows a large-scale LM pretraining (Radford et al., 2019). However, getting labeled data for every target task in every target language is difficult, especially for low-resource languages.
Recently, the pretrain-finetune paradigm has also been extended to multi-lingual setups to train effective multi-lingual models that can be used for zero-shot cross-lingual transfer. Jointly trained deep multi-lingual LMs like mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) coupled * Equal contribution with supervised fine-tuning in the source language have been quite successful in transferring linguistic and task knowledge from one language to another without using any task label in the target language. The joint pretraining with multiple languages allows these models to generalize across languages.
Despite their effectiveness, recent studies (Pires et al., 2019;K et al., 2020) have also highlighted one crucial limiting factor for successful crosslingual transfer. They all agree that the crosslingual generalization ability of the model is limited by the (lack of) structural similarity between the source and target languages. For example, for transferring mBERT from English, K et al. (2020) report about 23.6% accuracy drop in Hindi (structurally dissimilar) compared to 9% drop in Spanish (structurally similar) in cross-lingual natural language inference (XNLI). The difficulty level of transfer is further exacerbated if the (dissimilar) target language is low-resourced, as the joint pretraining step may not have seen many instances from this language in the first place. In our experiments ( §3.2), in cross-lingual NER (XNER), we report F1 reductions of 28.3% in Urdu and 30.4% in Burmese for XLM-R, which is trained on a much larger multilingual dataset than mBERT.
One attractive way to improve cross-lingual generalization is to perform data augmentation (Simard et al., 1998), and train the model on examples that are similar but different from the labeled data in the source language. Formalized by the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2001), such data augmentation methods have shown impressive results in vision (Zhang et al., 2018;Berthelot et al., 2019). These methods enlarge the support of the training distribution by generating new data points from a vicinity distribution around each training example. For images, the vicinity of a training image can be defined by a set of operations like rotation and scaling, or by linear mixtures of features and labels (Zhang et al., 2018). However, when it comes to text, such unsupervised augmentation methods have rarely been successful. The main reason is that unlike images, linguistic units are discrete and a smooth change in their embeddings may not result in a plausible linguistic unit that has similar meanings.
In NLP, to the best of our knowledge, the most successful augmentation method has so far been back-translation (Sennrich et al., 2016) which paraphrases an input sentence through round-trip translation. However, it requires parallel data to train effective machine translation systems, acquiring which can be more expensive for low-resource languages than annotating the target language data. Furthermore, back-translation is only applicable in a supervised setup and to tasks where it is possible to find the alignments between the original labeled entities and the back-translated entities, such as in question answering (Yu et al., 2018). Other related work includes contextual augmentation (Kobayashi, 2018), conditional BERT (Wu et al., 2018) and AUG-BERT (Shi et al., 2019). These methods use a constrained augmentation that alters a pretrained LM to a label-conditional LM for a specific task. Since they rely on labels, their application is limited by the availability of enough task labels.
In this work, we propose UXLA, a robust unsupervised cross-lingual augmentation framework for improving cross-lingual generalization of multilingual LMs. UXLA augments data from the unlabeled training examples in the target language as well as from the virtual input samples generated from the vicinity distribution of the source and target language sentences. With the augmented data, it performs simultaneous self-learning with an effective distillation strategy to learn a strongly adapted cross-lingual model from noisy (pseudo) labels for the target language task. We propose novel ways to generate virtual sentences using a multilingual masked LM (Conneau et al., 2020), and get reliable task labels by simultaneous multilingual co-training. This co-training employs a twostage co-distillation process to ensure robust transfer to dissimilar and/or low-resource languages.
We validate the effectiveness and robustness of UXLA by performing extensive experiments on three diverse zero-resource cross-lingual transfer tasks-XNER, XNLI, and PAWS-X, which posit different sets of challenges, and across many (14 in total) language pairs comprising languages that are similar/dissimilar/low-resourced. UXLA yields impressive results on XNER, setting SoTA in all tested languages outperforming the baselines by a good margin. The relative gains for UXLA are particularly higher for structurally dissimilar and/or low-resource languages: 28.54%, 16.05%, and 9.25% absolute improvements for Urdu, Burmese, and Arabic, respectively. For XNLI, with only 5% labeled data in the source, it gets comparable results to the baseline that uses all the labeled data, and surpasses the standard baseline by 2.55% on average when it uses all the labeled data in the source. We also have similar findings in PAWS-X. We provide a comprehensive analysis of the factors that contribute to UXLA's performance. We open-source our framework at https://ntunlpsg.github.io/project/uxla/ .

UXLA Framework
While recent cross-lingual transfer learning efforts have relied almost exclusively on multi-lingual pretraining and zero-shot transfer of a fine-tuned source model, we believe there is a great potential for more elaborate methods that can leverage the unlabeled data better. Motivated by this, we present UXLA, our unsupervised data augmentation framework for zero-resource cross-lingual task adaptation. Figure 1 gives an overview of UXLA.
Let D s = (X s , Y s ) and D t = (X t ) denote the training data for a source language s and a target language t, respectively. UXLA augments data from various origins at different stages of training. In the initial stage (epoch 1), it uses the augmented training samples from the target language (D t ) along with the original source (D s ). In later stages (epoch 2-3), it uses vicinal sentences generated from the vicinity distribution of source and target examples: ϑ(x s n |x s n ) and ϑ(x t n |x t n ), where x s n ∼ X s and x t n ∼ X t . It performs self-training on the augmented data to acquire the corresponding pseudo labels. To avoid confirmation bias with self-training where the model accumulates its own errors, it simultaneously trains three task models to generate virtual training data through data augmentation and filtering of potential label noises via multi-epoch co-teaching (Zhou and Li, 2005).
In each epoch, the co-teaching process first performs co-distillation, where two peer task models are used to select "reliable" training examples to train the third model. The selected samples with pseudo labels are then added to the target task Figure 1: Training flow of UXLA. After training the base task models θ (1) , θ (2) , and θ (3) on source labeled data Ds (WarmUp), we use two of them (θ (j) , θ (k) ) to pseudo-label and co-distill the unlabeled target language data (D t ). A pretrained LM (Gen-LM) is used to generate new vicinal samples for both source and target languages, which are also pseudo-labeled and co-distilled using the two task models (θ (j) , θ (k) ) to generateDs andDt. The third model θ (l) is then progressively trained on these datasets: {Ds, D t } in epoch 1,Dt in epoch 2, and all in epoch 3.
model's training data by taking the agreement from the other two models, a process we refer to as coguessing. The co-distillation and co-guessing mechanism ensure robustness of UXLA to out-of-domain distributions that can occur in a multilingual setup, e.g., due to a structurally dissimilar and/or lowresource target language. Algorithm 1 gives a pseudocode of the overall training method. Each of the task models in UXLA is an instance of XLM-R finetuned on the source language task (e.g., English NER), whereas the pretrained masked LM parameterized by θ mlm (i.e., before fine-tuning) is used to define the vicinity distribution ϑ(x n |x n , θ mlm ) around each selected example x n . In the following, we describe the steps in Algorithm 1.

Warm-up: Training Task Models
We first train three instances of the XLM-R model (θ (1) , θ (2) , θ (3) ) with an additional task-specific linear layer on the source language (English) labeled data. Each model has the same architecture (XLM-R large) but is initialized with different random seeds. For token-level prediction tasks (e.g., NER), the token-level representations are fed into the classification layer, whereas for sentence-level tasks (e.g., XNLI), the [CLS] representation is used as input to the classification layer.
Training with confidence penalty Our goal is to train the task models so that they can be used reliably for self-training on a target language that is potentially dissimilar and low-resourced. In such situations, an overly confident (overfitted) model may produce more noisy pseudo labels, and the noise will then accumulate as the training progresses. Overly confident predictions may also im-pose difficulties on our distillation methods ( §2.3) in isolating good samples from noisy ones. However, training with the standard cross-entropy (CE) loss may result in overfitted models that produce overly confident predictions (low entropy), especially when the class distribution is not balanced. We address this by adding a negative entropy term −H to the CE loss as follows.
where x is the representation that goes to the output layer, and y c and p c θ (x) are respectively the ground truth label and model predictions with respect to class c. Such regularizer of output distribution has been shown to be effective for training large models (Pereyra et al., 2017). We also report significant gains with confidence penalty in §3. Appendix B shows visualizations on why confidence penalty is helpful for distillation.

Sentence Augmentation
Our augmentated sentences come from two different sources: the original target language samples X t , and the virtual samples generated from the vicinity distribution of the source and target samples: ϑ(x s n |x s n , θ mlm ) and ϑ(x t n |x t n , θ mlm ) with x s n ∼ X s and x t n ∼ X t . It has been shown that contextual LMs pretrained on large-scale datasets capture useful linguistic features and can be used to generate fluent grammatical texts (Hewitt and Manning, 2019). We use XLM-R masked LM (Conneau et al., 2020) as our vicinity model θ mlm , which is trained on massive multilingual corpora (2.5 TB of Common-Crawl data in 100 languages). The Algorithm 1 UXLA: a robust unsupervised data augmentation framework for cross-lingual NLP Input: source (s) and target (t) language datasets: Ds = (Xs, Ys), Dt = (Xt); task models: θ (1) , θ (2) , θ (3) , pre-trained masked LM θmlm, mask ratio P , diversification factor δ, sampling factor α, and distillation factor η Output: models trained on augmented data 1: θ (1) , θ (2) , θ (3) = WARMUP(Ds, θ (1) , θ (2) , θ (3) ) warm up with conf. penalty. 2: for e ∈ [1 : 3] do e denotes epoch.
In order to generate samples around each selected example, we first randomly choose P % of the input tokens. Then we successively (one at a time) mask one of the chosen tokens and ask XLM-R masked LM to predict a token in that masked position, i.e., compute ϑ(x m |x, θ mlm ) with m being the index of the masked token. For a specific mask, we sample S candidate words from the output distribution, and generate novel sentences by following one of the two alternative approaches.
(i) Successive max In this approach, we take the most probable output token (S = 1) at each pre- A new sentence is constructed by P % newly generated tokens. We generate δ (diversification factor) virtual samples for each original example x, by randomly masking P % tokens each time.
(ii) Successive cross In this approach, we divide each original (multi-sentence) sample x into two parts and use successive max to create two sets of augmented samples of size δ 1 and δ 2 , respectively. We then take the cross of these two sets to generate δ 1 × δ 2 augmented samples.
Augmentation of sentences through successive max or cross is carried out within the GEN-LM (generate via LM) module in Algorithm 1. For tasks involving a single sequence (e.g., XNER), we directly use successive max. Pairwise tasks like XNLI and PAWS-X have pairwise dependencies: dependencies between a premise and a hypothesis in XNLI or dependencies between a sentence and its possible paraphrase in PAWS-X. To model such dependencies, we use successive cross, which uses cross-product of two successive max applied independently to each component.

Co-labeling through Co-distillation
Due to discrete nature of texts, VRM based augmentation methods that are successful for images such as MixMatch (Berthelot et al., 2019) that generates new samples and their labels as simple linear interpolation, have not been successful in NLP. The meaning of a sentence can change entirely even with minor variations in the original sentence. For example, consider the following example generated by our vicinity model.
Original: EU rejects German call to boycott british lamb.
Masked: <mask> rejects german call to boycott british lamb.

XLM-R: Trump rejects german call to boycott british lamb.
Here, EU is an Organization whereas the newly predicted word Trump is a Person (different name type). Therefore, we need to relabel the augmented sentences no matter whether the original sentence has labels (source) or not (target). However, the relabeling process can induce noise, especially for dissimilar/low-resource languages, since the base task model may not be adapted fully in the early training stages. We propose a 2-stage sample distillation process to filter out noisy augmented data.
Stage 1: Distillation by single-model The first stage of distillation involves predictions from a single model for which we propose two alternatives: (i) Distillation by model confidence: In this approach, we select samples based on the model's prediction confidence. This method is similar in spirit to the selection method proposed by Ruder and Plank (2018a). For sentence-level tasks (e.g., XNLI), the model produces a single class distribution for each training example. In this case, the model's confidence is computed by p * = max c∈{1...C} p c θ (x). For token-level sequence labeling tasks (e.g., NER), the model's confidence is computed by: where T is the length of the sequence. The distillation is then done by selecting the top η% samples with the highest confidence scores.
(ii) Sample distillation by clustering: We propose this method based on the finding that large neural models tend to learn good samples faster than noisy ones, leading to a lower loss for good samples and higher loss for noisy ones (Han et al., 2018;Arazo et al., 2019). We use a 1d twocomponent Gaussian Mixture Model (GMM) to model per-sample loss distribution and cluster the samples based on their goodness. GMMs provide flexibility in modeling the sharpness of a distribution and can be easily fit using Expectation-Maximization (EM) (See more on Appendix C). The loss is computed based on the pseudo labels predicted by the model. For each sample x, its goodness probability is the posterior probability p(z = g|x, θ GMM ), where g is the component with smaller mean loss. Here, distillation hyperparameter η is the posterior probability threshold based on which samples are selected.
Stage 2: Distillation by model agreement In the second stage of distillation, we select samples by taking the agreement (co-guess) of two different peer models θ (j) and θ (k) to train the third θ (l) . Formally, AGREEMENT

Data Samples Manipulation
UXLA uses multi-epoch co-teaching. It uses D s and D t in the first epoch. In epoch 2, it usesD t (target virtual), and finally it uses all the four datasets -D s , D t ,D t , andD s (line 22 in Algorithm 1). The datasets used at different stages can be of different sizes. For example, the number of augmented samples inD s andD t grow polynomially with the successive cross masking method. Also, the co-distillation produces sample sets of variable sizes. To ensure that our model does not overfit on one particular dataset, we employ a balanced sampling strategy. For N number of datasets , we define the following multinomial distribution to sample from: where α is the sampling factor and n i is the total number of samples in the i th dataset. By tweaking α, we can control how many samples a dataset can provide in the mix.

Experiments
We consider three tasks in the zero-resource crosslingual transfer setting. We assume labeled training data only in English, and transfer the trained model to a target language. For all experiments, we report the mean score of the three models that use different seeds.

Tasks & Settings
XNER: We use the standard CoNLL datasets (Sang, 2002;Sang and Meulder, 2003) for English (en), German (de), Spanish (es) and Dutch (nl). We also evaluate on Finnish (fi) and Arabic (ar) datasets collected from Bari et al. (2020). Note that Arabic is structurally different from English, and Finnish is from a different language family. To show how the models perform on extremely lowresource languages, we experiment with three structurally different languages from WikiANN (Pan et al., 2017) of different (unlabeled) training data sizes: Urdu (ur-20k training samples), Bengali (bn-10K samples), and Burmese (my-100 samples).
XNLI We use the standard dataset (Conneau et al., 2018). For a given pair of sentences, the task is to predict the entailment relationship between the two sentences, i.e., whether the second sentence (hypothesis) is an Entailment, Contradiction, or  Neutral with respect to the first one (premise). We experiment with Spanish, German, Arabic, Swahili (sw), Hindi (hi) and Urdu.
PAWS-X The Paraphrase Adversaries from Word Scrambling Cross-lingual task (Yang et al., 2019) requires the models to determine whether two sentences are paraphrases. We evaluate on all the six (typologically distinct) languages: fr, es, de, Chinese (zh), Japanese (ja), and Korean (ko).
Evaluation setup Our goal is to adapt a task model from a source language distribution to an unknown target language distribution assuming no labeled data in the target. In this scenario, there might be two different distributional gaps: (i) the generalization gap for the source distribution, and (ii) the gap between the source and target language distribution. We wish to investigate our method in tasks that exhibit such properties. We use the standard task setting for XNER, where we take 100% samples from the datasets as they come from various domains and sizes without any specific bias. However, both XNLI and PAWS-X training data come with machine-translated texts in target languages. Thus, the data is parallel and lacks enough diversity (source and target come from the same domain). Cross-lingual models trained in this setup may pick up distributional bias (in the label space) from the source. Artetxe et al. (2020) also argue that the translation process can induce subtle artifacts that may have a notable impact on models.
Therefore, for XNLI and PAWS-X, we experiment with two different setups. First, to ensure distributional differences and non-parallelism, we use 5% of the training data from the source language and augment a different (nonparallel) 5%  data for the target language. We used a different seed each time to retrieve this 5% data. Second, to compare with previous methods, we also evaluate on the standard 100% setup. The evaluation is done on the entire test set in both setups. We will refer to these two settings as 5% and 100%. More details about model settings are in Appendix D.

Results
XNER Table 1 reports the XNER results on the datasets from CoNLL and (Bari et al., 2020), where we also evaluate an ensemble by averaging the probabilities from the three models. We observe that after performing warm-up with conf-penalty ( §2.1), XLM-R performs better than mBERT on average by ∼3.8% for all the languages. UXLA gives absolute improvements of 3.76%, 4.34%, 6.94%, 8.31%, and 4.18% for es, nl, de, ar, and fi, respectively. Interestingly, it surpasses supervised LSTM-CRF for nl and de without using any target language labeled data. It also produces comparable results for es.
In Table 2, we report the results on the three lowresource langauges from WikiANN. From these results and the results of ar and fi in Table 1, we see that UXLA is particularly effective for languages that are structurally dissimilar and/or lowresourced, especially when the base model is weak:  28.54%, 16.05%, and 9.25% absolute improvements for ur, my and ar, respectively.
XNLI-5% From Table 3, we see that the performance of XLM-R trained on 5% data is surprisingly good compared to the model trained on full data (see XLM-R (our imp.)), lagging by only 5.6% on average. In our single GPU implementation of XNLI, we could not reproduce the reported results of Conneau et al. (2020). However, our results resemble the reported XLM-R results of XTREME (Hu et al., 2020). We consider XTREME as our standard baseline for XNLI-100%. We observe that with only 5% labeled data in the source, UXLA gets comparable results to the XTREME baseline that uses 100% labeled data (lagging behind by only ∼0.7% on avg.); even for ar and sw, we get 0.22% and 1.11% improvements, respectively. It surpasses the standard 5% baseline by 4.2% on average. Specifically, UXLA gets absolute improvements of 3.05%, 3.34%, 5.38%, 5.01%, 4.29%, and 4.12% for es, de, ar, sw, hi, and ur, respectively. Again, the gains are relatively higher for low-resource and/or dissimilar languages despite the base model being weak in such cases.
XNLI-100% Now, considering UXLA's performance on the full (100%) labeled source data in Table 3, we see that it achieves SoTA results for all of the languages with an absolute improvement of 2.55% on average from the XTREME baseline. Specifically, UXLA gets absolute improvements of 1.95%, 1.68%, 4.30%, 3.50%, 3.24%, and 1.65% for es, de, ar, sw, hi, and ur, respectively.

Analysis
In this section, we analyze UXLA by dissecting it and measuring the contribution of its each of the components. For this, we use the XNER task and analyze the model based on the results in Table 1.

Analysis of distillation methods
Model confidence vs. clustering We first analyze the performance of our single-model distillation methods ( §2.3) to see which of the two alternatives works better. From Table 5, we see that both perform similarly with model confidence being slightly better. In our main experiments (Tables  1-4) and subsequent analysis, we use model confidence for distillation. However, we should not rule out the clustering method as it gives a more general  solution to consider other distillation features (e.g., sequence length, language) than model prediction scores, which we did not explore in this paper.
Distillation factor η We next show the results for different distillation factor (η) in Table 5. Here 100% refers to the case when no single-model distillation is done based on model confidence. We notice that the best results for each of the languages are obtained for values other than 100%, which indicates that distillation is indeed an effective step in UXLA. See Appendix B for more analysis on η.
Two-stage distillation We now validate whether the second-stage distillation (distillation by model agreement) is needed. In Table 5, we also compare the results with the model agreement (shown as ∩) to the results without using any agreement (φ). We observe better performance with model agreement in all the cases on top of the single-model distillation which validates its utility. Results with η = 100, Agreement = ∩ can be considered as the tri-training (Ruder and Plank, 2018b) baseline. Figure 2 presents the effect of different types of augmented data used by different epochs in our multi-epoch co-teaching framework. We observe that in every epoch, there is a significant boost in F1 scores for each of the languages. Arabic, being structural dissimilar to English, has a lower base score, but the relative improvements brought by UXLA are higher for Arabic, especially in epoch 2     when it gets exposed to the target language virtual data (D t ) generated by the vicinity distribution.

Effect of Confidence Penalty & Ensemble
For all the three tasks, we get reasonable improvements over the baselines by training with confidence penalty ( §2.1). Specifically, we get 0.56%, 0.74%, 1.89%, and 1.18% improvements in XNER, XNLI-5%, PAWS-X-5%, and PAWS-X-100% respectively (Table 1,3,4). The improvements in XNLI-100% are marginal and inconsistent, which we suspect due to the balanced class distribution. From the results of ensemble models, we see that the ensemble boosts the baseline XLM-R. However, our regular UXLA still outperforms the ensemble baselines by a sizeable margin. Moreover, ensembling the trained models from UXLA further improves the performance. These comparisons ensure that the capability of UXLA through co-teaching and co-distillation is beyond the ensemble effect. Table 6 shows the robustness of the fine-tuned UXLA model on XNER task. After fine-tuning in a specific target language, the F1 scores in English remain almost similar (see first row). For some languages, UXLA adaptation on a different language also improves the performance. For example, Arabic gets improvements for all UXLA-adapted models (compare 50.88 with others in row 5). This indicates that augmentation of UXLA does not overfit on a target language. More baselines, analysis and visualizations are added in Appendix.

Related Work
Recent years have witnessed significant progress in learning multilingual pretrained models. Notably, mBERT (Devlin et al., 2019) extends (English) BERT by jointly training on 102 languages. XLM (Lample and Conneau, 2019) extends mBERT with a conditional LM and a translation LM (using parallel data) objectives. Conneau et al. (2020) train the largest multilingual language model XLM-R with RoBERTa (Liu et al., 2019). Wu and Dredze (2019), Keung et al. (2019), and Pires et al. (2019) evaluate zero-shot cross-lingual transferability of mBERT on several tasks and attribute its generalization capability to shared subword units. Pires et al. (2019) also found structural similarity (e.g., word order) to be another important factor for successful crosslingual transfer. K et al. (2020), however, show that the shared subword has a minimal contribution; instead, the structural similarity between languages is more crucial for effective transfer.
Older data augmentation approaches relied on distributional clusters (Täckström et al., 2012). A number of recent methods have been proposed using contextualized LMs (Kobayashi, 2018;Wu et al., 2018;Shi et al., 2019;Ding et al., 2020;Liu et al., 2021). These methods rely on labels to perform label-constrained augmentation, thus not directly comparable with ours. Also, there are fundamental differences in the way we use the pretrained LM. Unlike them our LM augmentation is purely unsupervised and we do not perform any fine-tuning of the pretrained vicinity model. This disjoint characteristic gives our framework the flexibility to replace θ lm even with a better monolingual LM for a specific target language, which in turn makes UXLA extendable to utilize stronger LMs that may come in the future. In a concurrent work (Mohiuddin et al., 2021), we propose a contextualized LM based data augmentation for neural machine translation and show its advantages over traditional back-translation gaining improved performance in low-resource scenarios.

Conclusion
We propose a novel data augmentation framework, UXLA, for zero-resource cross-lingual task adaptation. It performs simultaneous self-training with data augmentation and unsupervised sample selection. With extensive experiments on three different cross-lingual tasks spanning many language pairs, we have demonstrated the effectiveness of UXLA. For the zero-resource XNER task, UXLA sets a new SoTA for all the tested languages. For both XNLI and PAWS-X tasks, with only 5% labeled data in the source, UXLA gets comparable results to the baseline that uses 100% labeled data. Is masked language model pre-training with cross-lingual training data from task dataset useful? In Table 7, We perform language model finetuning on XLM-R large model with multilingual sentences of NER dataset and perform adaptation with only English language. With the LMfinetuned XLM-R model, we didn't see any significant increase in cross-lingual transfer. For Spanish, Arabic language, the score even got decreased, which indicates possible over-fitting. However, robustness experiment in table 6 (see in the main paper, sec 4.4) indicates that our proposed method doesn't overfit on target language rather than augment the new knowledge base.  Need for the combination of co-teaching, codistillation and co-guessing? The combination of these helps to distill out the noisy samples better.
Efficiency of the method and expensive extra costs for large-scale pretrained models It is a common practice in model selection to train 3-5 disjoint LM-based task models (e.g., XLM-R on NER) with different random seeds and report the ensemble score or score of the best (validation set) model. In contrast, UXLA uses 3 different models and jointly trains them where the models assist each other through distillation and co-labeling. In that sense, the extra cost comes from distillation and co-labeling, which is not significant and is compensated by the significant improvements that UXLA offers.

B Visualization of confidence penalty B.1 Effect of confidence penalty in classification
In Figure 3 (a-b), we present the effect of the confidence penalty (Eq. 1 in the main paper) in the target language (Spanish) classification on the XNER dev. data (i.e., after training on English NER). We show the class distribution from the final logits (on the target language) using t-SNE plots. From the figure, it is evident that the use of confidence penalty in the warm-up step makes the model more robust to unseen out-of-distribution target language data yielding better predictions, which in turn also provides a better prior for self-training with pseudo labels.

B.2 Effect of confidence penalty in loss distribution
Figures 3(c) and 3(d) present the per-sample loss (i.e., mean loss per sentence w.r.t. the pseudo labels) distribution in histogram without and with confidence penalty, respectively. Here, accurate-2 refers to the sentences which have at most two wrong NER labels, and sentences containing more than two errors are referred to as noisy samples. It shows that without confidence penalty, there are many noisy samples with a small loss which is not desired. In addition to that, the figures also suggest that the confidence penalty helps to separate the clean samples from the noisy ones either by clustering or by model confidence.  on their length in the x-axis; y-axis represents the loss. As we can see, the losses are indeed more scattered when we train the model with confidence penalty, which indicates higher per-sample entropy, as expected. Also, we can see that as the sentence length increases, there are more wrong predictions. Our distillation method should be able to distill out these noisy pseudo samples.
Finally, Figures 4(c) and 4(d) show the length distribution of all vs. the selected sentences (by Distillation by model confidence) without and with confidence penalty. Bari et al. (2020) shows that cross-lingual NER inference is heavily dependent on the length distribution of the samples. In general, the performance of the lower length samples is more accurate. However, if we only select the lower length samples we will easily overfit. From these plots, we observe that the confidence penalty also helps to perform a better distillation as more sentences are selected (by the distillation procedure) from the lower length distribution, while still covering the entire lengths. This shows that using the confidence penalty in training, model becomes more robust.
In summary, comparing the Figures 3(c-d) -4(cd), we can conclude that training without confidence penalty can make the model more prone to over-fitting, resulting in more noisy pseudo labels. Training with confidence penalty not only improves pseudo labeling accuracy but also helps the distillation methods to perform better noise filtering.

C Details on distillation by clustering
One limitation of the confidence-based (singlemodel) distillation is that it does not consider task-specific information. Apart from classifier confidence, there could be other important features that can distinguish a good sample from a noisy one. For example, for sequence labeling, sequence length can be an important feature as the models tend to make more mistakes (hence noisy) for longer sequences Bari et al. (2020). One might also want to consider other features like fluency, which can be estimated by a pre-trained conditional LM like GPT Radford et al. (2020). In the following, we introduce a clustering-based method that can consider these additional features to separate good samples from bad ones.  EM training for two-component GMM Let x i ∈ IR denote the loss for sample x i and z i ∈ {0, 1} denote its cluster id. We can write the 1d GMM model as: where θ k = {µ k , σ 2 k } are the parameters of the kth mixture component and π k = p(z i = k) is the probability (weight) of the k-th component with the condition 0 ≤ π k ≤ 1 and k π k = 1.
In EM, we optimize the expected complete data log likelihood Q(θ, θ t−1 ) defined as: where r i,k (θ t−1 ) is the responsibility that cluster k takes for sample x i , which is computed in the Estep so that we can optimize Q(θ, θ t−1 ) (Eq. 4) in the M-step. The E-step and M-step for a 1d GMM can be written as: E-step: Inference For a sample x, its goodness probability is the posterior probability p(z = g|x, θ), where g ∈ {0, 1} is the component with smaller mean loss. Here, distillation hyperparameter η is the posterior probability threshold based on which samples are selected.
Relation with distillation by model confidence Astute readers might have already noticed that per-sample loss has a direct deterministic relation with the model confidence. Even though they are different, these two distillation methods consider the same source of information. However, as mentioned, the clustering-based method allows us to incorporate other indicative features like length, fluency, etc. For a fair comparison between the two methods, we use only the per-sample loss in our primary (single-model) distillation methods.

D Hyperparameters
We present the hyperparameter settings for XNER and XNLI tasks for the XLA framework in Table 8. In the warm-up step, we train and validate the task models with English data. However, for cross-lingual adaptation, we validate (for model selection) our model with the target language development set. We train our model with respect to the number of steps instead of the number of epochs.
In the case of a given number of epochs, we convert it to a total number of steps. We observe that learning rate is a crucial hyperparameter. In table 8, lr-warm-up-steps refer to the warmup-step from triangular learning rate scheduling. This hyperparameter is not to be confused with  Table 8: Hyperparameter settings for XNER, XNLI, and PAWS-X task. Total number of parameter for each of the model is 550M. We used V100 GPUs to do the experiments. Average run-time for each of the languages may differ based on total number of augmented samples. In an average, for per million augmentation requires .5-2 days based of various settings of training mechanism (ie., fp16 training, gradient accumulation etc).
Warm-up step of the UXLA framework. In our experiments, effective batch-size is another crucial hyperparameter that can be obtained by gradient accumulation steps. We fix the maximum sequence length to 280 for XNER and 128 tokens for XNLI. For each of the experiments, we report the average score of three task models, θ (1) , θ (2) , θ (3) , which are initialized with different seeds. We perform each of the experiments in a single GPU setup with float32 precision.

E Additional Related Work
Vicinal risk minimization. One of the fundamental challenges in deep learning is to train models that generalize well to examples outside the training distribution. The widely used Empirical Risk Minimization (ERM) principle where models are trained to minimize the average training error has been shown to be insufficient to achieve generalization on distributions that differ slightly from the training data (Szegedy et al., 2014;Zhang et al., 2018). Data augmentation supported by the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2001) can be an effective choice for achieving better out-of-training generalization.
In VRM, we minimize the empirical vicinal risk defined as: where f θ denotes the model parameterized by θ, and D aug = {(x n ,ỹ n )} N n=1 is an augmented dataset constructed by sampling the vicinal distribution ϑ(x,ỹ|x i , y i ) around the original training sample (x i , y i ). Defining vicinity is however challenging as it requires to extract samples from a distribution without hurting the labels. Earlier methods apply simple rules like rotation and scaling of images (Simard et al., 1998). Recently, Zhang et al. (2018); Berthelot et al. (2019) and Li et al. (2020) show impressive results in image classification with simple linear interpolation of data. However, to our knowledge, none of these methods has so far been successful in NLP due to the discrete nature of texts.