Boosting Cross-Lingual Transfer via Self-Learning with Uncertainty Estimation

Recent multilingual pre-trained language models have achieved remarkable zero-shot performance, where the model is only finetuned on one source language and directly evaluated on target languages. In this work, we propose a self-learning framework that further utilizes unlabeled data of target languages, combined with uncertainty estimation in the process to select high-quality silver labels. Three different uncertainties are adapted and analyzed specifically for the cross lingual transfer: Language Heteroscedastic/Homoscedastic Uncertainty (LEU/LOU), Evidential Uncertainty (EVI). We evaluate our framework with uncertainties on two cross-lingual tasks including Named Entity Recognition (NER) and Natural Language Inference (NLI) covering 40 languages in total, which outperforms the baselines significantly by 10 F1 for NER on average and 2.5 accuracy for NLI.


Introduction
Recent multilingual pre-trained language models such as mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020) and mT5 (Xue et al., 2021) have demonstrated remarkable performance on various direct zero-shot cross-lingual transfer tasks, where the model is finetuned on the source language, and directly evaluated on multiple target languages that are unseen in the task-finetuning stage. While direct zero-shot transfer is a sensible testbed to assess the multilinguality of language models, one would apply supervised or semi-supervised learning on target languages to obtain more robust and accurate predictions in a practical scenario.
In this work, we investigate self-learning (also known as "pseudo labels") as one way to apply semi-supervised learning on cross-lingual transfer, where only unlabeled data of target languages are required, without any efforts to annotate gold labels for target languages. As self-learning has been proven effective in certain tasks of computer vision (Yalniz et al., 2019;Xie et al., 2020) and natural language processing (Artetxe et al., 2018;Dong and de Melo, 2019;Karan et al., 2020), we propose to formalize an iterative self-learning framework for multilingual tasks using pre-trained models, combined with explicit uncertainty estimation in the process to guide the cross-lingual transfer.
Our self-learning (SL) framework utilizes any multilingual pre-trained models as the backbone, and iteratively grows the training set by adding predictions of target language data as silver labels. We reckon two important observations from our preliminary study (baselines in §4). First, compared with self-training one target language at a time, jointly training multiple languages together can improve the performance on most languages, especially for certain low-resource languages that can achieve up to 8.6 F1 gain in NER evaluation. Therefore, our SL framework features the joint training strategy, maximizing potentials of different languages benefiting each other. Second, compared with simply using all unlabeled data as silver labels without considering prediction confidence, estimating uncertainties becomes critical in the transfer process, as higher quality of silver labels should lead to better performance. We hence introduce three different uncertainty estimations in the SL framework.
Specifically, we adapt uncertainty estimation techniques based on variational inference and evidence learning for our cross-lingual transfer, namely LEU, LOU and EVI ( §3.2). We evaluate our framework and three uncertainties on two multilingual tasks from XTREME (Hu et al., 2020): Named Entity Recognition (NER), and Natural Language Inference (NLI). Empirical results suggest LEU to be the best uncertainty estimation overall, while the others can also perform well on certain languages ( §4.1). Our analysis shows further evaluation of different estimations, corroborating the correlation between the uncertainty quality and the final SL performance. Characteristics of dif-ferent estimations are also discussed, including the language similarities learned by LOU and the current limitation of EVI in the SL process ( §5).
Our contributions in this work can be summarized as follows. (1) We propose the self-learning framework for the cross-lingual transfer and identify the importance of uncertainty estimation under this setting. (2) We adapt three different uncertainty estimations in our framework, and evaluate the framework on both NER and NLI tasks covering 40 languages in total, improving the performance of both high-resource and low-resource languages on both tasks by a solid margin (10 F1 for NER and 2.5 accuracy score for NLI on average). (3) Further analysis is conducted to compare different uncertainties and their characteristics.

Related Work
We introduce the work of uncertain estimation briefly. As deep learning models are optimized by minimizing the loss without special care on the uncertainty, they are usually poor at quantifying uncertainty and tend to make over-confident predictions, despite producing high accuracies (Lakshminarayanan et al., 2017). Estimating the uncertainty of deep learning models has been recently studied in NLP tasks (Xiao and Wang, 2019a;Zhang et al., 2019;He et al., 2020). There are two main uncertainty types in Bayesian modelling (Kendall and Gal, 2017;Depeweg et al., 2018): epistemic uncertainty that captures the model uncertainty itself, which can be explained away with more data; aleatoric uncertainty that captures the intrinsic data uncertainty regardless of models. Aleatoric uncertainty can further be devided into two sub-types: heteroscedastic uncertainty that depends on input data, and homoscedastic uncertainty that remains constant for all data within a task. In this work, we only focus on aleatoric uncertainty, as it is more closely related to our SL process to select confident and high-quality predictions within each iteration.

Approach
We keep the same model architecture throughout our experiments: a multilingual pre-trained language model is employed to encode each input sequence, followed by a linear layer to classify on the hidden state of CLS token for NLI, and of each token for NER, which is the same model setting from XTREME (Hu et al., 2020). Cross-entropy (CE) loss is used during training in the baseline.

Self-Learning (SL) Framework
We formulate the task-agnostic SL framework for cross lingual transfer into the following four phases, as shown in Figure 1. In the training phase, the model parameter θ gets optimized by the training inputs X and labels Y , with Y being gold labels of the source language in the first iteration, along with silver labels of target languages in later iterations. Inputs of different languages are mixed together. In the prediction phase, the model predicts on the remaining unlabeled data X * l = {x * l1 , . . . , x * lN } of each target language l, with each prediction denoted as y * = f θ (x * ). In the uncertainty estimation phase, the model estimates the prediction uncertainty based on one of the methods described in §3.2, denoted as γ = f θ γ (x * , y * ), representing the model confidence of the prediction. In the selection phase, data in each X * l is ranked based on the uncertainty score γ, and we select top-K percent of each X * l with their predictions as silver labels, adding to the training data. To avoid posing potential inductive bias from imbalanced label distribution, we select equal amount of inputs for each label type, similar to previous work on self-learning (Yalniz et al., 2019;Dong and de Melo, 2019;Mukherjee and Awadallah, 2020).
After selection, the model goes back to the training phase and starts a new iteration with the updated training set. The entire process keeps iterating until there is no remaining unlabeled data; early stop criteria are implemented on the dev set of the source language only, as gold labels are not available for other languages. Each phase can be adjusted by task-specific requirements (see A.2).

Uncertainty Estimation
We adapt three different uncertainty estimation techniques in our framework. Let C be the label classes, p c be the probability of class c for an input.
Language Heteroscedastic Uncertainty (LEU) LEU injects Gaussian noise into class logits whose variance is predicted by the model as an inputdependent uncertainty (Kendall and Gal, 2017), regardless of languages. A Gaussian distribution is placed on the logit space g ∼ N (g θ , (σ θ ) 2 ), where the model is modified to predict both raw logit g θ and standard deviation σ θ given each input. We use the expectation of the logit softmax as the new probability, computed by Monte Carlo sampling: with g tc being the logit of class c at t-th sampling from g. The training loss and the uncertainty take into account the new probability formulation p c : The loss is composed of the CE loss L t (x, c) on input x and gold class c with tth sampled probabilities. The uncertainty is the entropy of the new probabilities: γ = − c p c log p c . When an input of any language is hard to predict, the model will signal high variance, indicating high uncertainty, as the probability distribution tends to be uniform.

Language Homoscedastic Uncertainty (LOU)
LOU estimates the uncertainty of each certain language, regardless of the input. Similar to the formulation of task uncertainty (Cipolla et al., 2018), we propose to place an uncertainty σ l on a language l as the homoscedastic uncertainty. σ is used as the softmax temperature on the predicted logits g θ : . The final uncertainty is also the entropy of the scaled probabilities. A higher σ l leads to higher entropy of all inputs of language l, as the probability distribution tends to be more uniform. During training, each σ l is a learned parameter directly, and the new loss for an input of language l can be approximated as: L(x, c) is the same CE loss as in Eq (1). Note that LOU does not change the input selection nor ranking within each language; we mainly use it as an optimization strategy to jointly train inputs of multiple languages, automatically distinguishing the importance of different target languages.
Evidential Uncertainty (EVI) EVI estimates the evidence-based uncertainty (Sensoy et al., 2018), where the softmax probability is replaced with Dirichlet distribution, and each predicted logit for class c is regarded as the evidence. We employ the decomposed entropy vacuity and dissonance proposed by Shi et al. (2020). vacuity is high when there lacks evidence for all the classes, indicating out-of-distribution (OOD) samples that are far away from the source language; dissonance becomes high when there are conflicts of strong evidence among certain classes (more details are shown in A.1). The prediction is said uncertain if either vacuity or dissonance is high. For each input, let S be the total evidence strength, and let the label y c be 1 for the gold class and 0 for the others. The following describes the expected probability p c for the class c under Dirichlet distribution, as well as the training loss L EVI :

Experiments
The framework with different uncertainties are evaluated on two cross-lingual transfer datasets: XNLI (Conneau et al., 2018) for the NLI task covering 15 languages, and Wikiann (Pan et al., 2017) for the NER task covering 40 languages. For both datasets, English is the source language with gold labels, and we use the dev set of target languages (TLs) as the source of unlabeled data; we do not consult any gold labels of TLs in the SL process. XLM-R Large (Conneau et al., 2020) is used as the multilingual encoder across our experiments. Our detailed experimental setting can be found in A.3. We implement three different settings for the baseline. 1 BL-Direct is the direct zero-shot transfer without utilizing unlabeled data of TLs. BL-Single trains gold data of English and silver data of only one TL per model; it simply selects predictions of all unlabeled data as silver labels, without considering any uncertainties. BL-Joint is similar to BL-Single but instead train with all TLs jointly.
For SL, we set top-K percent selection to be top 8% of total unlabeled data for each label type, so the entire SL process will finish in around 6 iterations. We found that K below 10% can generally yield decent performance.
For the analysis, we also include two common uncertainties used in previous work of self-learning  on other tasks: max probability (MPR), and entropy (ENT); both use plain softmax probabilities (A.4).

Results
The results for NER and NLI are shown in Table 1 and 2 respectively. BL-Direct is equivalent to our re-implementation of Hu et al. (2020). BL-Single outperforms BL-Direct on NER by 3.1 F1 on average, demonstrating the effectiveness of utilizing unlabeled data even without considering uncertainties. Remarkablely, languages such as Arabic (ar), Japanese (ja), Urdu (ur) and Chinese (zh) receive 10+ gain in F1. By contrast, BL-Single does not surpass the baseline for NLI, partially because all TLs already have much closer performance to English, which in turn highlights the importance of estimating uncertainties for SL.
BL-Joint outperforms BL-Single on both tasks by a slight margin, and we do see performance gain over BL-Single on 32/40 and 10/15 languages for NER and NLI respectively. Certain languages such as Hindi (hi), Javanese (jv) and Yoruba (yo) receive non-trivial benefits (2.6 -8.6 F1 gain for NER) through the joint language training, validating our joint training strategy for SL.
Evaluation of SL is shown with the best results of each uncertainty from 3 repeated runs. The best performance of SL for both tasks is achieved by adopting LEU as the uncertainty estimation, which outperforms three baselines significantly (10% gain for NER and 2.5% for NLI on average), and surpasses other uncertainties by a slight margin. In NER specifically, certain low-resource languages such as Basque (eu), Persian (fa), Burmese (my) and Urdu (ur) have substantial performance improvement over BL-Joint (13.8 -25.9 F1 gain); the performance of certain high-resource languages such as Arabic (ar), German (de) and Chinese (zh) can also increase by a solid margin over BL-Joint (4.1 -13.3 F1 gain). The trend of improving both high and low-resource languages is also present in NLI. All results are stable across multiple runs with standard deviation within 0.1 -0.2 on average.
Results also suggest that other uncertainty estimations can achieve comparable performance, as LEU does not dominate every language. We further conduct analysis on uncertainties as follows.

Analysis
Uncertainty Comparison To directly assess different uncertainty estimations, we evaluate uncertainty scores by AUROC against predictions, such that AUROC is high when the model is confident on correct predictions and uncertain on incorrect predictions. The left side of Table 3 shows the AU-ROC of four estimations on the test sets of both tasks. MPR and ENT are also included in the experiments for comparison; LOU is excluded as it does not change selection. The right side of Table 3  M  T  I  E  M  T  I   shows the SL performance drop using other uncertainties compared to LEU, serving as an indirect evaluation of different uncertainties. As shown, LEU indeed achieves the best AUROC, being a better uncertainty estimation compared to others; EVI has the lowest AUROC and also the lowest SL performance; MPR and ENT can bring moderate scores on both AUROC and SL. Thus, Table 3 corroborates strong correlation between AUROC and SL performance: better uncertainty can indeed lead to higher performance in the SL process. Table 2 shows that LOU reaches the same accuracy as LEU on XNLI, with trivial performance gap for each language. We find that the learned uncertainty of each language is highly consistent through multiple runs, as shown in Table 4, which can be loosely interpreted as language similarities under the input of this task, e.g. Vietnamese (vi) appears to be more distant from English than others for this task, and the joint optimization of all languages could benefit from this learned language uncertainty. However, we do not find LOU to be as stable on NER, potentially because NER has much more noise and languages.  Evidential Uncertainty Although EVI is able to achieve good performance on certain languages, there also exists large gap for certain other languages compared to LEU. We attribute the inferior performance of EVI to two aspects. First, the predicted evidence (logit) still exhibits overconfidence, which destabilizes the vacuity and dissonance. Figure 2 shows an example of the evidence-based entropy distribution for EVI, and the model indicates most all predictions as certain (small entropy). Second, vacuity can only distinguish true OOD samples for English, as only English has gold labels. It could fail to recognize those confident samples of TLs that appear in-distribution but are inherently wrong, and falsely select them in the SL process. Figure 3 shows the t-SNE visualization of hidden states of inputs in English and Japanese on the test set of NER: some target language inputs that are close to English in terms of hidden states are predicted wrong, because of the zero-shot nature.

Conclusion
In this work, we propose a self-learning framework combined with explicit uncertainty estimation for cross-lingual transfer. Three different uncertainties are adapted, and the entire framework is evaluated on two tasks of NER and NLI, surpassing the baseline by a large margin. Further analysis shows the evaluation and characteristics of each uncertainty.

A.1 Uncertainty Estimation
For LOU, the uncertainty term as the denominator in the loss as in Eq (2) achieves the effect of "learned loss attenuation" (Kendall and Gal, 2017) during training, where uncertain samples have lower scale of loss, so that the optimization is less prone to noisy data. We use LOU to let the model learn the uncertainty for each language to achieve more stable training amid selected data with silver labels.
In practice, the model directly predicts the logvariance term log σ for both LEU and LOU, as the training is more stable and the variance is guaranteed to be positive. For EVI, we follow Sensoy et al. (2018) and define: e c is the evidence strength (logit) for class c, |C| is the number of classes. b c represents the belief mass for class c and u is the vacuity, denoted as vac = u. We follow Shi et al. (2020) and define dissonance for each input as: Both vac and diss are in the range of [0, 1]; being closer to 1 indicates more uncertainty. The final uncertainty is set as γ = diss + α · vac with α being a hyperparameter. In practice, ELU activation is added after raw logits to ensure the evidence strength is positive.

A.2 Task-Specific Adjustment
We adjust the SL process for NER as follows: the uncertainty score is obtained for each predicted entity, which is calculated as the averaged uncertainty score of all tokens within the entity. Ranking is performed on entities within each entity type; we select the input sequence if all its predicted entities have uncertainty within the top-K threshold.

A.3 Experimental Setting
We follow the same train/dev/test split and same evaluation protocol as XTREME (Hu et al., 2020).
Hyperparameters For both NLI and NER, we use the following hyperparameter setting as suggested by XTREME (Hu et al., 2020): 32 effective batch size, 2 × 10 −5 learning rate with linear decay scheduling, 1 max gradient norm.
For NLI in the self-learning (SL) process, we train the model by 5 epochs in the first iteration on English training set with gold labels, whereas we train 10 epochs for NER. After the first iteration, the model is trained for 3 epochs in each later iteration. For LEU, we set the Monte Carlo sampling T = 20. For EVI, we set α = 1 for NLI and α = 10 −2 for NER based on the empirical scale of vac and diss, keeping both on the same scale.
To avoid the training set growing too huge as the SL process iterates, we apply a sampling strategy upon new selection: each training epoch samples from the existing training set with equal amount of newly selected data, so that each training epoch consists of at least 50% latest selection. We adopt early stop on English dev set if the evaluation does not improve for over two iterations.
Our experiments uses NVIDIA Titan RTX GPUs. The training takes 10 hours for both NER and NLI.

A.4 Other Uncertainties
MPR is the max probability of label classes, denoted by γ = max c p c . It is equivalent to the probability of the predicted label, and is commonly used as the selecting criterion for classification tasks (Yalniz et al., 2019;Dong and de Melo, 2019). ENT is the entropy of the class probability distribution, denoted by γ = − c p c · log p c , which is another common uncertainty metric for classification (Depeweg et al., 2018;Xiao and Wang, 2019b).