Unleashing the Multilingual Encoder Potential: Boosting Zero-Shot Performance via Probability Calibration

Pretrained multilingual encoder models can directly perform zero-shot multilingual tasks or linguistic probing by reformulating the input examples into cloze-style prompts. This is accomplished by predicting the probabilities of the label words at the masked token position, without requiring any updates to the model parameters. However, the performance of this method is limited by the model’s bias toward predicting label words which frequently oc-curred during the pretraining. These words typically receive high probabilities. To address this issue, we combine the models with calibration techniques which modify the probabilities of label words predicted by the models. We first validate the effectiveness of a proposed simple calibration method together with other existing techniques on monolingual encoders in both zero-and few-shot scenarios. We subsequently employ these calibration techniques on multilingual encoders, resulting in substantial performance improvements across a wide range of tasks 1 .


Introduction
Prompt-based learning (Brown et al., 2020;Liu et al., 2021) has emerged as an important research area.Recent research demonstrates that multilingual encoder models are capable of accomplishing zero-shot cross-lingual learning (Zhao and Schütze, 2021; Huang et al., 2022) and linguistic probing (Shapiro et al., 2021;Hartmann et al., 2021) by using cloze-style prompts.These prompts consist of an input sample, a task-specific context and a mask token.The encoder model applies Masked Language Modeling (MLM) (Devlin et al., 2019) to generate predictions for the mask token using a selection of prescribed candidate tokens from the vocabulary.These predictions can be subsequently utilized for label classification or probing purposes.
For example, the sentiment analysis of assigning the product review "Worked as stated!" to class POS can be reformulated as: "Worked as stated!All in all, it was [MASK]."The model is requested to fill in the word "good" at the mask token position, which is mapped to the POS label.
However, earlier studies indicate that the output of masked token prediction is biased towards certain label words in the candidate token list (Weissweiler et al., 2022;Nie et al., 2023).This bias not only affects the predicted class probabilities (Holtzman et al., 2021;Ahuja et al., 2022), but also deteriorates the model's overall performance (Zhao et al., 2021;Lu et al., 2022).According to Weissweiler et al. (2022) and Zhao et al. (2021), label words with higher frequency in the pretraining corpus tend to be predicted with higher probabilities.Besides, the prompt context can also influence the degree of bias present in the masked token prediction.Figure 1 illustrates the impact of the above mentioned biases on the model predictions.It shows the results of a binary sentiment analysis task with the BERT Base (Devlin et al., 2019)    be found in Table 6.By shifting the threshold for predicting POS from 0.5 to approx.0.95, the performance can be improved by more than 25%.Given only a mask token as input, the model predicts 0.92 and 0.08 as probabilities for the label words good and bad, respectively.To tackle the bias in the distribution of label words, our proposed solution in this work is to combine pretrained encoder models with calibration methods.
In this paper, we contribute by (1) proposing a simple yet effective calibration method that involves adding trainable penalties to the probabilities of the label words, (2) demonstrating its effectiveness in achieving performance enhancements comparable to other existing calibration techniques, (3) refining the calibration parameters with only a few training examples for further improvement, and (4) boosting the zero-shot performance of multilingual encoders by introducing calibration methods.

Existing Calibration Methods
Contextual Calibration (CC) Zhao et al. (2021) apply an affine transformation (Platt et al., 1999) to the original probabilities, as the first equation in Table 1 shows.The parameters of the affine transformation are obtained from the output probability distribution of the content-free input, e.g., the mask token, denoted pcf .W = diag(p cf ) −1 is the inverse diagonal matrix of pcf and b is an all-zero vector.
Domain Conditional Pointwise Mutual Information (PMI DC ) Holtzman et al. (2021) adjust the conditional class probability p(y|x, t) by dividing it with the prior probability p(y|t) of that class.We estimate p(y|t) for a given template t using MLM with a prompt created by instantiating the prompt template with an empty input.
Calibration By Marginalization (CBM) Yang et al. (2023) are inspired by PMI DC .Unlike PMI DC , CBM approximates p(y|x, t) in a more precise manner by computing its marginalized probability, as the third equation in Table 1 shows.For each prediction, the sum probability Σ x ′ ∈X p(y|x ′ , t) are calculated by taking all test inputs into account.

Our Method: Probability Penalty
Motivated by the observation in Figure 1 that a simple shift in the model's output distribution can substantially alleviate the label bias, we propose a penalty-based calibration approach as the equation in the last row of Table 1 shows.The core idea is to introduce a penalty term that is added to each individual label word probability.We initialize the corresponding parameter vector p with the negative prior probabilities of the label words.We estimate these prior probabilities using the output distribution of MLM applied to a mask token as input.

Experimental Setup
Dataset We first validate the effectiveness of the different calibration methods on several monolingual tasks.We study sentiment analysis using two datasets: binary Amazon Polarity (McAuley and Leskovec, 2013) and the English subset of 5label Multilingual Amazon Reviews (Keung et al., 2020), topic categorization using two datasets: the Ag News and Yahoo Answers Topics (Zhang et al., 2015), sentence pair classification using two datasets: English subsets of MNLI (Conneau et al., 2018) and PAWS-X (Yang et al., 2019), and 5 datasets from the GLUE benchmark (Wang et al., 2019): CoLA (Warstadt et al., 2019), MRPC (Dolan and Brockett, 2005), QQP, RTE (Dagan et al., 2005), and WNLI (Levesque et al., 2012).For the evaluation of multilingual encoders, we use Multilingual Amazon Reviews, XNLI and PAWS-X.Besides, following Nie et al. (2023), we expand the AG News dataset to 25 languages using machine translation to conduct a wide range of cross-lingual analyses.
In the multilingual experiments, we use their multilingual counterparts bert-base-multilingual-cased and xlm-roberta-base (Conneau et al., 2020).We use PyTorch (Paszke et al., 2019) and the HuggingFace framework (Wolf et al., 2020).We repeat each experiment 5 times with different random seeds and report the mean and variance.Details of the experimental setting can be found in Appendix A.

Zero-shot calibration
We first validate the effectiveness of the various calibration methods on monolingual encoders.Table 2 shows the results of zero-shot calibration, where we directly calculate the calibrated probabilities without using additional training samples.We report accuracies for evenly distributed datasets and F1 scores for imbalanced datasets.Compared to the uncalibrated baseline systems, we obtain improvements across most of the tasks, except for the CC method combined with the RoBERTa model.In this specific case, the average performance worsens compared to the no calibration baseline due to outlier performance observed in several tasks, such as Yahoo and CoLA.

Adding few-shot samples further boosts the performance
As the formulas in Table 1 show, PMI DC and CBM directly modify the probabilities without introducing additional parameters, while CC and Penalty use specific calibration parameters, which are trainable.In zero-shot calibration, these parameters are initialized by prior probabilities without being updated.We will now make use of the trainability of parameters in CC and Penalty to investigate if applying few-shot training to calibration parameters further improves the performance.
Table 3 shows the results of few-shot calibration.We observe that training the calibration parameters on just a few samples further enhances the performance of the calibrated systems.Compared to zero-shot calibration, few-shot calibration achieves better performance in most cases.We also compare calibration methods in few-shot scenarios with the NLI-based zero-shot classification base-line proposed by Yin et al. (2019).Details of the baseline setting and the few-shot training process are described in Appendices A.3 and B.
Figure 2 shows the few-shot calibration results of the RoBERTa model on the AG News and NLI tasks.Prior research (Zhao and Schütze, 2021) showed that few-shot learning can be unstable due to the randomness.However, as Figure 2 shows, the variation in performance diminishes obviously as the number of shots increases.Our experimental results indicate that few-shot calibration not only enhances the performance but also increases the steadiness.

Results on Multilingual Encoders
Table 4 shows our experimental results on multilingual datasets, indicating that calibration methods are also effective for multilingual encoders.
Our experiments cover a large range of languages considering both language availability, i.e., if or how much language data exists in the pretraining corpus, and language diversity, i.e., to which language family a language belongs.Specifically, for Amazon-S, XNLI and PAWS-X, we use the original test sets, mainly containing high-resource languages.In the multilingual AG News task, we include many low-resource and unseen languages by generating parallel multilingual test sets using machine translation techniques.Recent research by Hu et al. (2020) and Liu et al. (2022) shows that automatically translated test sets are useful for measuring cross-lingual performance.Hence, we adopt their methodology and expand the language coverage of the AG News dataset to 25.The list of languages can be found in Appendix C.
The results on multilingual BERT and XLM-R show that all four calibration methods improve the multilingual performance averaged across all tasks.For both models, CBM always emerges as the top-performing approach.Different from other approaches predicting the label with one input by another, CBM is the only method which leverages the test set (without labels) to adjust the calibration parameters.This could account for the substantial advantage of CBM over the others in terms of the performance.

Multilingual Analysis
Now we analyze how different language properties correlate with the performance of multilingual BERT on the AG News task.

Language Accessibility
We first group the evaluation languages into lowresource languages, unseen languages, and languages with unseen scripts to determine the influence of language accessibility.Low-resource languages are languages which are contained in the pretraining corpus, but only account for a small amount of it.Unseen languages do not occur in the pretraining, thus the multilingual encoder has never seen them.The hardest case involves languages with unseen scripts, where the model has not even encountered the characters of the language.However, our test set contains no languages with completely unseen scripts because machine translation frequently generates code-switched data.Figure 3 (a) shows that low-resource languages perform generally better than the other two types of unseen languages, indicating that the multilingual encoder's access to languages in the pretraining is crucial for the performance enhancement via calibration.

Language Diversity
We further group the languages according to their phylogenetic relationships, i.e., from which language family they are.We analyze the language families containing at least 3 languages.The box plots in Figure 3 (b) reveal that the impact of calibrating multilingual encoders varies across different language groups.Specifically, we observe that Indo-European and Dravidian languages tend to benefit more from calibration than Austronesian and Niger-Congo languages.
This discrepancy suggests that the effectiveness of calibration techniques can be influenced by the language accessibility of multilingual encoders and the linguistic characteristics of language families.

Conclusion
In conclusion, our work focuses on boosting the zero-shot learning performance of multilingual encoders in language understanding tasks through probability calibration techniques.We address the bias issue in the mask token prediction of label words by introducing various calibration techniques that modify the probabilities of these words.We first test the efficacy of different calibration methods in monolingual encoders.We also prove that with a minimal number of training examples, the calibrated probabilities yield further enhancements compared to the zero-shot calibration method.Our experiments on multilingual encoders demonstrate that all calibration methods bring a performance improvement across various tasks.

Limitations
We propose a simple yet effective calibration method to enhance the zero-shot performance for monolingual and multilingual encoders.While our work shows the effectiveness of calibration for enhancing the prediction with multilingual tasks, it is important to note that our research is primarily focused on classification tasks with multilingual encoders.As a result, our findings and proposed methods may not directly translate to generation tasks, such as question answering (QA), which involve the use of generative multilingual models.Future investigations should explore the application of our calibration methods on generation tasks and evaluate their effectiveness in enhancing the performance of generative multilingual models.This extension could provide valuable insights into the potential benefits and limitations of our approaches across a broader range of NLP tasks.

Ethics Statement
This research was conducted in accordance with the ACM Code of Ethics.All the datasets that we use are publicly available.We report only aggregated results in the main paper.We do not share any Personally Identifiable Data in this paper.

A Experimental Details
This section provides a comprehensive overview of our experimental setup, including hyperparameters, prompt templates that we use in our experiments, and the baselines.

A.1 Hyperparameters
To ensure experimental reproducibility, we present the hyperparameter settings used in our study in Table 5.

A.2 Prompt Engineering
We select a set of prompt templates for the tasks through our preliminary experiments.Table 6 shows the prompt templates and the label words used in our experiment.

A.3 Baseline
To establish a baseline, we initially conduct experiments without employing any calibration methods.Subsequently, we introduce four calibration methods individually and evaluate their impact on the performance.Besides, we compare our calibration methods with an NLI-based zero-shot classification baseline proposed by Yin et al. (2019), where they first finetune a pretrained language model on the MNLI dataset, then they reformulate common classification tasks to an NLI task format.The input sample is regarded as the premise, while the label serves as the hypothesis.The zero-shot classification is performed by directly comparing the probabilities of predicting entailment for all input-label pairs.For this baseline, we finetune a BERT model and a RoBERTa model on the MNLI task.

B Few-Shot Training of Calibration Parameters
Algorithm 1 presents the process of few-shot training of penalty calibration used in our few-shot investigation. Algorithm

C Detailed Results
Detailed results of the experiments in the main text can be found in this section.Table 8 shows the complete results of mBERT on the multilingual AG News dataset across all 25 languages.Table 7 provides an overview of languages covered by the multilingual AG News dataset.

Figure 1 :
Figure 1: Example of the model predictions bias.The graph shows the accuracy on the amazon polarity test data (equally distributed) as a function of the classification threshold.x-axis refers to the threshold probability of good to classify examples with the class POS.The best results are obtained by classifying examples as POS if the probability of good exceeds 0.96.

Figure 2 :
Figure 2: Performance and variation of few-shot calibration on the RoBERTa model.

Figure 3 :
Figure 3: Performance Improvement of multilingual BERT with two calibration methods.
model.The prompt template and label words used for this example can q(y|x, t) = Wp(y|x, t) + b PMI DC Domain Conditional Pointwise Mutual Information q(y|x, t) = log p(y|x,t) p(y|t)

Table 1 :
Overview of Calibration Methods.y refers to the label words.X is the test dataset, x is an input sample, and t is the prompt template.

Table 2 :
Results of zero-shot calibration methods on monolingual tasks.Amazon-P refers to Amazon Polarity (binary classification).Amazon-S refers to Amazon Star (5-way classification).

Table 4 :
AG News Amazon-S XNLI PAWS-X Avg.Results of calibration methods on multilingual datasets.We report the best results for CC and Penalty in different few-shot settings.