Uncertainty-Aware Cross-Lingual Transfer with Pseudo Partial Labels

,


Introduction
The multilingual pre-trained language models such as mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020) and mT5 (Xue et al., 2021) are able to support zero-shot transfer from a source language to target languages. Despite the remarkable performance on direct zero-shot cross-lingual transfer tasks, one would apply semi-supervised learning on target languages to obtain more robust and accurate predictions in a practical scenario. Recent studies (Dong and de Melo, 2019;Xu et al., 2021) validate the effectiveness of self-learning in * Corresponding author. 1 We release our code at: github.com/slei109/CLTP. : For an unlabeled sample with ambiguous predictions, the standard one-hot-labeling takes the class with the highest confidence as the pseudo-one-hot-label, introducing the noise in the training phase due to the wrong prediction. Instead of choosing one among the predictions that all have low confidence, the proposed partial-labeling method takes both ambiguous classes as candidate labels, allowing the ground-truth label to be presented in the training phase.
cross-lingual transfer tasks, utilizing predictions of unlabeled data of target languages as silver labels. Dong and de Melo (2019) iteratively grow the training set by selecting top-k percent of unlabeled data and Xu et al. (2021) boost the performance by considering prediction confidence in the pseudo-label selection. Although their self-learning frameworks significantly improve the transferring performance, it still lags far behind supervised learning. The main reason is that the model suffers from the large number of incorrectly pseudo-labeled samples used in the training phase. Even though they adopt the selection mechanism where those easy and highconfidence predictions will be firstly added into the training set, it cannot guarantee the accurate prediction for all unlabeled data. In fact, the accuracy of the predictions drops quickly in the later iterations (see A.1), since most of the remaining unlabeled data are more difficult to be classified. To avoid the false positive pseudo labels, the most naive way is to select pseudo-labels with extremely high confidence. However, in this way, a large amount of unlabeled data will be discarded due to strictly high confidence bars. However, the discarded data are still valuable for learning a classifier, especially in the zero-shot task. So, how to maximize the utilization of the unlabeled data while minimizing the ratio of noisy pseudo labels in the training set?
Intuitively, if we are confident that the unlabeled data belong to a candidate class set but unable to assign one-hot pseudo labels, it is more efficient to present all potential labels comparing to discarding them directly. Thus, we propose to present pseudo-partial-labels for those data to the model. As illustrated in Figure 1, for an ambiguous sample, the model believes that it belongs to one of two categories with high confidence but has difficulty determining which category it belongs to. In this case, the standard one-hot-labeling takes the class with the highest confidence as the pseudo label, increasing the ratio of noisy pseudo labels in the training phase due to the wrong prediction. By contrast, the proposed partial-labeling method takes both ambiguous classes as candidate labels, allowing the ground-truth label presented in the training phase. In this way, the model can continue to learn on the pseudo-partial-labeled data by disambiguating the candidate labels and finding the latent ground-truth.
In this work, we propose an uncertainty-aware Cross-Lingual Transfer framework with Pseudo-Partial-Label (CLTP) that employs partial label learning to boost cross-lingual zero-shot transfer. Specifically, our framework utilizes any multilingual pre-trained models as the backbone, and iteratively grows the training set by adding predictions of target language data as silver labels. For those difficult data samples with low prediction confidence, different from discarding them directly or introducing a single-hypothetical but incorrect pseudo-label, we associate them with pseudo-partial-labels to better maximize the data utilization. To estimate the pseudo-partial-label, we propose a novel uncertainty-aware estimation method that considers both prediction confidence and the limitation to the number of candidate labels. The model continues to learn on the pseudo-partiallabeled data by disambiguating the candidate labels and finding the latent ground-truth.
Our key contributions can be summarized as follows. 1) We design an uncertainty-aware crosslingual transfer framework with pseudo-partiallabels. 2) We propose a novel pseudo-partial-label estimation method that considers prediction confi-dences and the limitation to the number of candidate classes. 3) We evaluate the proposed framework on both NER and NLI tasks across 40 languages in total. Comprehensive experiments show that our framework achieves a strong performance of both high-resource and low-resource languages on both tasks by a sizable margin, such as 6.9 on Kazakh (kk), 5.2 Marathi (mr) for NER and 1% on Arabic (ar), 0.8% on Bulgarian (bg) for NLI.

Related Work
Cross-Lingual Representation Learning. Pretrained transformer-based models have proven effective in learning cross-lingual information. mBERT (Devlin et al., 2019) is pre-trained on raw Wikipedia texts in languages using masked language modeling and next sentence prediction tasks with no explicit cross-lingual objective. XLM-R (Conneau et al., 2020) improves over mBERT by training longer with more data from Common-Crawl, and without the NSP objective. Recently, two self-learning based methods were proposed for cross-lingual transfer. Dong and de Melo (2019) proposed a self-learning framework to incorporate the predictions of mBERT for the cross-lingual text classification task. Xu et al. (2021) improved over the XLM-R by jointly training multiple languages together and considering prediction confidence in the silver labels selection process. However, these two methods still suffer from noisy training because of the incorrect pseudo-labels.
Pseudo-Labeling. Pseudo-labeling (Lee et al., 2013;Shi et al., 2018;Iscen et al., 2019) belongs to the self-learning scenario, and it is often used in semi-supervised learning to generate pseudo-labels for unlabeled samples with a model trained on labeled data. Inspired by noise correction work (Yi and Wu, 2019), Wang and Wu (2020) attempted to update the pseudo-labels through an optimization framework. Recently, Rizve et al. (2021) selected pseudo-labels with both prediction uncertainty and calibration, allowing for negative pseudo-labels generations. However, most existing methods involve learning from noisy data and cannot generalize to partial label learning.
Partial Label Learning. Partial label learning (Cour et al., 2011), also called ambiguously label learning  and superset label problem (Gong et al., 2017), has subsequently attracted a lot of attention Wang and Zhang, 2020;Yao et al., 2020;Yan and Guo, 2020;. It refers to the task where each training sample is associated with a set of candidate labels, while only one of them is assumed to be true. Existing studies on the partial label learning can be divided into two groups: average-based methods and identification-based methods. The averagebased methods (Cour et al., 2011;Zhang and Yu, 2015;Zhang et al., 2016) consider each candidate label as equally important during model training, and average the outputs of all candidate labels for predictions. The identification-based methods aim at directly maximizing the output of exactly one candidate label, chosen as the truth label. Yan and Guo (2020) studied the utilization of batch label correction; Yao et al. (2020) managed to improve the performance by combining different networks. Wen et al. (2021) proposed the Leveraged Weighted (LW) loss, considering the trade-off between losses on partial labels and non-partial ones. In this work, we adopt the idea of LW loss function for partial label learning due to its effectiveness and generalization.
Uncertainty Estimation. Recently, estimating the uncertainty of deep learning models has attracted increasing attention and have been validated the effectiveness in NLP tasks (Zhang et al., 2019;He et al., 2020). There are two main uncertainty types in Bayesian modeling (Kendall and Gal, 2017;Depeweg et al., 2018): epistemic uncertainty (EU) that captures the model uncertainty itself, which can be explained with more data; aleatoric uncertainty (AU) that captures the intrinsic data uncertainty regardless of models. Another group of uncertainty estimation methods are based on belief/evidence theory through Fuzzy Logic (De Silva, 2018), Dempster-Shafer Theory (Sentz et al., 2002), and Subjective Logic (Sensoy et al., 2018). Belief theorists focus on the reasoning of the inherent uncertainty in information resulting from unreliable, incomplete, deceptive, and/or conflicting evidences. Subjective Logic considers uncertainty in subjective opinions in terms of vacuity (i.e., lack of evidence) (Sensoy et al., 2018), dissonance (i.e., conflicting evidence), and consonance (i.e., composite subsets of state values) (Shi et al., 2020).
Summary Current works on self-learning based cross-lingual transfer methods suffer from noisy training and poor generalization due to the incorrectly pseudo-labeled samples. In this work, we adopt the idea of partial label learning to maximize unlabeled data utilization while reducing the ef-fect of ambiguously pseudo-label estimations in the self-learning framework. To the best of our knowledge, it is the first time pseudo-partial-label employed in the self-learning framework for the cross-lingual transfer and estimating the pseudopartial-label with the prediction uncertainty.

Model
In this section, we propose a partial-label based self-learning framework to boost cross-lingual transfer performance. The overview of the proposed framework is presented in Section 3.2. The technical details for the uncertainty-aware pseudopartial-label estimation and partial label learning are described in Sections 3.3 and 3.4, respectively.

Preliminary
Partial label learning refers to the task where each training sample is associated with a set of candidate labels, while only one of them is assumed to be true. The goal is to find the latent ground-truth for the input through observing the partial label set. Formally, given a non-empty feature space (input space) X ⊂ R d and a supervised label space For convenience, we use k-hot partial labels to represent the number of candidate classes in the partial label. For example, the partial label in Figure 1 is a two-hot partial label and has two candidate classes.

Partial-Label based Self-Learning Framework
Our approach aims to improve the overall performance by maximizing the utilization of unlabeled data while reducing the noise introduced in the training phase. This can be accomplished by applying partial label learning on those highly uncertain predictions. The intuition is that if the data appears ambiguous to be classified, it will be more effective to present potential labels instead of discarding them directly or introducing a single-hypothetical but incorrect pseudo-label in the training phase. The complete training procedure of our proposed task-agnostic framework for cross-lingual transfer is shown in Figure 2. In our proposed CLTP frame- work, we first train a pre-trained multilingual model on the gold labels of the source language. Then the model makes predictions on the unlabeled dataset of the target languages. The proposed uncertaintyaware estimation component generates the pseudopartial-labels based on the model predictions and their corresponding uncertainty estimations. After that, we adopt a selection mechanism to incorporate the unlabeled data with high confidence scores into the training phase. The whole training process for our method is described in Algorithm 1, where D u denotes the set of tuples combining unlabeled data with corresponding pseudo-partial-labels and prediction uncertainty γ diss . First, a model f (·) is trained on the gold labels of the source language in the first iteration. Once trained, the model can make predictions and estimate the pseudo-partial-labels for all unlabeled data of target languages in D u based on the method introduced in Section 3.3. Note that the inputs of different languages are mixed together. Next, a subset of the pseudo-partial-labels S u is selected with the uncertainty estimation. After selection, the model goes back to the training phase using the selected pseudo-partial-labels as well as the gold labels. We repeat the process iteratively until max iteration is reached. The early stop criteria are implemented on the dev set of the source language only since the gold labels are not available for the other languages. Note that the model is trained only on the one-hot pseudo-label in the first three iterations to accelerate the convergence.

Pseudo-Partial-Label Estimation
The key point of pseudo-partial-label estimation is to guarantee that the ground-truth class of an instance resides in the candidate label set, which is the basic definition for partially supervised learning. Intuitively, if the model classifies an instance with lower confidence, the instance may be hard to dis-Algorithm 1 Self-Learning on Cross-Lingual Tasks Input: A gold label dataset D L of source language, an unlabeled dataset of target languages U. 1: repeat 2: Training f (·) on D L with L EV I 3: for each target language do 4: end for 14: until max iteration arrives. 15: Training f (·) on D L with L φ until converge tinguish from several classes or cannot be identified due to the lack of knowledge. Hence, we consider prediction confidence in the pseudo-partial-label estimation to better determine the most uncertain classes for each ambiguous instance.
We define the prediction uncertainty of the instance x belonging to the partial-label y as the partial-label uncertainty, which is denoted as γ diss (x, y). The decomposed entropy dissonance proposed by Shi et al. (2020) is adapted to calculate the partial-label uncertainty, as it can indicate the contradiction among certain classes. Specifically, dissonance is an evidence-based uncertainty (Sensoy et al., 2018) where the softmax probability is replaced by Dirichlet distribution, and each predicted logit for class c is regarded as the evidence e c . The expected probability p c for class c under Dirichlet distribution is defined as follows.
where S is referred to as the Dirichlet strength. We adopt the cross-entropy loss as the training loss L EVI , and its Bayes risk under the Dirichlet distribution can be defined as where Beta(α) is the multinomial beta function (Kotz et al., 2004) and ψ(·) is the digamma function. α is the parameters of the Dirichlet density on the predictors. As shown in Equation (2), training the model with L EVI is to make the positive evidence close to the total evidence when the ground-truth is positive. If there are conflicts of strong evidence among certain classes, dissonance will become high to indicate the contradiction. The following describes the dissonance for each instance: where b c = e c /S represents the belief mass for class c. Recall that each predicted logit for class c is regarded as the evidence e c .
In the pseudo-partial-label estimation, the belief mass for the instance belongs to a candidate class set can be calculated via Binomial Comultiplication operator in subjective logic, which is denoted as '∨'. Let b c and b h be the belief mass for class c and class h, respectively. The belief mass for the instance belongs to class c or class h is defined as: Thus, we can calculate the partial-label uncertainty γ diss (x, y) via Equation (4) and (5). However, if we simply estimate the pseudopartial-label only based on the lowest partial-label uncertainty, it will lead to an invalid partial label like (1, 1, 1, 1), as containing all classes in the candidate set must have the highest confidence. Furthermore, in the partial label learning, the accuracy of the model decreases with the increased number of similar labels to the true label Wen et al., 2021) because it increases the learning difficulty. To remedy this problem, we leverage a penalty ratio to balance the prediction uncertainty and the number of candidate classes. Specifically, from Figure.3, we observe that as the number of candidate classes increases, the improvement in prediction recall tends to decrease. That means containing too many candidate classes in the pseudo-partial-label has limited improvement to the model. Only the most confusing candidate classes are important for those ambiguous samples. Motivated by this, we employ a penalty ratio to punish a larger number of candidate classes. Thus, a pseudo-partial-labelỹ is obtained as follows: where Y is the collection of all subsets in the partial label space and ||y|| 1 calculates the number of candidate classes in the partial label y. λ and τ are penalty ratios to punish a larger number of candidate classes, which determine the penalty strength and the preference for candidate class number, respectively.

Learning with Pseudo-Partial-Labels
The target of partial-label learning is to learn a classifier with access to the candidate label set (partial label set) by disambiguating the candidate labels and finding the latent ground-truth for the input during the training phase. We adopt Leveraged Weighted (LW) loss function (Wen et al., 2021) for partial label learning due to its effectiveness and generalization. The LW loss is a multiclass loss and searches for the latent ground-truth by assigning more weights to the loss of classes more likely to be the true and lessening weights to the confusing ones. To be specific, the LW loss function is defined as: where φ(·) : R → R + denotes a binary loss and we adopt the Sigmoid loss function. We use parameter β to distinguish the effects between candidate classes and non-candidate ones. f c (x) represents the predicted logit of instance x for class c, while w c is the weighting parameters to assign weights to the loss of classes. Since the key point is to disambiguate the candidate classes, the model is supposed to assign more weights to the loss of classes that are more likely to be the ground-truth. Thus, instead of assigning fixed values, the weighting parameters are updated by normalizing the prediction score through an iterative learning process. Specifically, at the t-th learning step, w (t) c is calculated as follows: , ifỹ c = 0 c varies with sample instances. In this way, as the training epochs grow, the model focuses on the true class and rules out the untrue classes by penalizing large value of φ(−f c (x)).

Evaluation Tasks & Datasets
NLI XNLI (Conneau et al., 2018) is an evaluation benchmark for the cross-lingual NLI task across 15 languages. Given a sentence pair of premise and hypothesis, the task is to classify their relationship as "neutral", "entailment", or "contradiction".
NER Wikiann (Pan et al., 2017) is an evaluation benchmark for the cross-lingual NER task covering 40 languages. There are three entity types: "LOC", "PER" and "ORG", and each token is tagged in the BIO2 format with 7 label types.
We follow the same train/dev/test split and same evaluation protocol as XTREME (Hu et al., 2020). English is the source language with gold labels for both datasets, and we use the dev set of target languages as the source of unlabeled data. Gold labels of target languages cannot be accessed in the self-learning process.

Implementation Details
Model Details. We keep the same model architecture throughout our experiments: XLM-R Large (Conneau et al., 2020) is used as the multilingual pre-trained model to encode input sequence, followed by a linear layer to classify on the hidden state, which is the same model setting from XTREME. We set the penalty ratios λ = 4 and τ = 10 for all experiments. For LW loss, we set β = 2 as suggested by Wen et al. (2021).
Training Details. For both NER and NLI tasks, we use the AdamW (Loshchilov and Hutter, 2018) optimizer with a linear learning rate scheduler for all experiments. We use a batch size of 32 and a max sequence length of 128. We first train the model by 10 epochs on English training set with gold labels for the NER task and 5 epochs for the NLI task with a 2 × 10 −5 learning rate. In the selflearning process, we keep the same learning rate for the NER task and set a 5 × 10 −6 learning rate to train the model for the NLI task. The model is trained for 3 epochs in each iteration. Experiments are run on a single 24GB NVIDIA 3090 GPU.

Baselines.
We compare our CLTP framework with three different settings for the baselines: BL-Direct is equivalent to Hu et al. (2020), which is the direct zero-shot transfer without utilizing unlabeled data of target languages. BL-Single takes silver labels of only one target language as the training set in the self-learning process and simply uses model predictions as silver labels without considering prediction confidences. BL-Joint is similar to BL-Single but instead takes silver labels of all target languages jointly. We also compare our method with uncertainty-aware self-learning framework (Xu et al., 2021): SL-LEU trains the model with silver labels and selects them by considering Language Heteroscedastic Uncertainty (LEU) and SL-EVI takes Evidential Uncertainty (EVI) to estimate the prediction uncertainty, following the same training settings as Xu et al. (2021) utilized.

Comparisons
The results of NER task and NLI task are shown in Tables 1 and 2, respectively. Self-learning based methods outperform the direct zero-shot transfer with XLM-R large by a large margin in NER, achieving 11.1 gain in F1 on average. The trend of improvement can also be observed in NLI, validating the self-learning strategy on cross-lingual transfer tasks. Furthermore, our method surpasses SL-LEU on NER by 2.2 in F1 on average, demonstrating the effectiveness of utilizing pseudo-partiallabel for those ambiguous data. Remarkably, our method achieves a sizeable gain, 5+ in F1 on both low-resource languages like Malayalam (ml) and Marathi (mr), and high-resource languages such as Russian (ru) and Tegulu (te). This shows that our method can further boost the model performance of the self-learning framework. As shown in Ta  ble 2, our model outperforms the baselines on NLI across almost all 15 test languages. Comparing to the direct zero-shot, our method achieves an improvement of 2.4% on average. Our method also gives an average increase of 1.2% and 0.5% on SL-EVI and SL-LEU, respectively. Specifically, we observe over 1% gain for Arabic (ar), Bulgarian (bg), Greek (el), and Turkish (tr) when we compare CLTP framework with the best performance of SL.

Result Analyses
To better understand the key components and settings of CLTP framework, we perform some analyses on the NER task. Uncertainty Estimation. To assess different uncertainty estimations for pseudo-partial-label estimation, we evaluate the recall score of the pseudopartial-labels in the last iteration, such that recall is high when the pseudo-partial-label contains the true class. Here, we adopt the two-hot partial label setting where the two classes with the highest prediction confidence were set as candidate classes, as it is the best setting to directly measure the contradiction among certain classes. We compare evidential uncertainty with two commonly used uncertainty metrics for classification (Depeweg et al., 2018;Dong and de Melo, 2019;Xiao and Wang, 2019): the max probability of label classes (i.e., max_prob) and the entropy of the class probability distribution (i.e., entropy). As shown in Figure 4, dissonance achieves the best performance among all uncertainty estimations on average, demonstrating its capability of ambiguous class selection. Hyperparameter Analyses. We introduce new hyperparameters λ and τ to control the penalty ratio. Table 3 shows the evaluation accuracy in different penalty ratio settings. We find that using λ = 4, τ = 10 leads to the best performance, and further reduction/increase in the ratio lead to performance degradation. The penalty ratio does affect the performance of the model as it adjusts the proportion of various pseudo-partial-labels. In  Settings average (F1) λ = 2, τ = 10 74.6 λ = 4, τ = 5 72.9 λ = 4, τ = 10 75.5 λ = 8, τ = 10 73.4 Table 3: Hyperparameter analyses to the penalty ratio on the NER task. Different settings adjust the proportion of pseudo-partial-labels.
specific, when τ stays the same, as λ increases, the proportion of partial-labels decreases. It indicates that most pseudo-labels are one-hot labels that are similar to the self-learning framework proposed by Xu et al. (2021). Similarly, if we keep λ constant, reducing τ will lead to not only a higher proportion of partial-labels over the unlabeled set but also more candidate classes in each partial-label. We observe an over 2.6 gain when we compare λ = 4, τ = 10 setting with λ = 4, τ = 5 setting, indicating that if we ignore the limitation on the number of candidate classes, the model will suffer from invalid partial labels like (1, 1, 1, 1) because it cannot provide any information when all classes are set as candidates.
Effect of pseudo-partial-label length. To analyze the effect of different number of candidate classes in pseudo-partial-label estimation, we evaluate the model trained with various manually designed pseudo-partial-labels. Specifically, we select the top classes with the prediction confidence to set pseudo-partial-labels. The results are shown in Table 4. When we utilize the manually set partiallabels, we observe an accuracy drop of 0.3, 2.4 and 3.0 in two-hot partial labels, three-hot partial labels and four-hot partial labels, respectively. This demonstrates the effectiveness of our pseudopartial-label estimation scheme. In addition, the model trained with two-hot partial labels outper-Settings avg self-learning + no partial labels 73.3 self-learning + two-hot partial labels 75.2 self-learning + three-hot partial labels 73.1 self-learning + four-hot partial labels 72.5 CLTP (ours) 75.5 Table 4: Effect of pseudo-partial-label length on the NER task. Note that two-hot partial labels indicates that the pseudo-partial-labels are directly estimated by selecting the two classes with the max prediction probability and no partial-labels is equivalent to SL.
forms the baselines. By contrast, three-hot partial labels and four-hot partial labels do not surpass the baselines, partially due to the difficulty of disambiguating the candidate classes. The trend of improving performance with fewer candidate classes is consistent with the phenomenon in partial label learning Wen et al., 2021).

Conclusion
In this work, we propose an uncertainty-aware pseudo-partial-label framework for cross-lingual transfer. With the auxiliary of pseudo-partial-labels, CLTP framework improves the model by reducing the noise introduced in training phase while maximizing unlabeled data utilization. Moreover, we propose a novel pseudo-partial-label estimation method that considers both prediction confidence and the limitation to the number of similar classes. The proposed framework is evaluated on two tasks of NER and NLI and improves the performance of the pre-trained model by a solid margin (11.1 F1 for NER and 2.4% accuracy score for NLI on average). Compared to other self-learning based methods, our framework surpasses the baselines on both high-resource and low-resource languages, such as 6.9 on Kazakh and 5.2 Marathi for NER.