Enhancing Cross-lingual Natural Language Inference by Soft Prompting with Multilingual Verbalizer

Cross-lingual natural language inference is a fundamental problem in cross-lingual language understanding. Many recent works have used prompt learning to address the lack of annotated parallel corpora in XNLI. However, these methods adopt discrete prompting by simply translating the templates to the target language and need external expert knowledge to design the templates. Besides, discrete prompts of human-designed template words are not trainable vectors and can not be migrated to target languages in the inference stage flexibly. In this paper, we propose a novel Soft prompt learning framework with the Multilingual Verbalizer (SoftMV) for XNLI. SoftMV first constructs cloze-style question with soft prompts for the input sample. Then we leverage bilingual dictionaries to generate an augmented multilingual question for the original question. SoftMV adopts a multilingual verbalizer to align the representations of original and augmented multilingual questions into the same semantic space with consistency regularization. Experimental results on XNLI demonstrate that SoftMV can achieve state-of-the-art performance and significantly outperform the previous methods under the few-shot and full-shot cross-lingual transfer settings.


Introduction
Multilingual NLP systems have gained more attention due to the increasing demand for multilingual services.Cross-lingual language understanding (XLU) plays a crucial role in multilingual systems, in which cross-lingual natural language inference (XNLI) is a fundamental and challenging task (Conneau et al., 2018;MacCartney and Manning, 2008;Li et al., 2023Li et al., , 2022)).NLI is a fundamental problem in NLU that could help with tasks like semantic parsing (Liu et al., 2022a;Lin et al., 2022), and relation extraction (Liu et al., 2022b;Hu et al., 2020Hu et al., , 2021)).In XNLI settings, the model is trained on the source language with annotated data to reason the relationship between a pair of sentences (namely premise and hypothesis) and evaluated on the target language without parallel corpora.Pre-trained multilingual language models, such as mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), and XLM-R (Conneau et al., 2020), have demonstrated promising performance in cross-lingual transfer learning.These language models learn a shared multilingual embedding space to represent words in parallel sentences.However, these models are trained on a large number of parallel corpora, which are not available in many low-resource languages.The major challenge of XNLI is the lack of annotated data for low-resource languages.
To address this problem, some works explored using prompt learning (Brown et al., 2020;Schick and Schütze, 2021a;Shin et al., 2020) when adapting pre-trained language models to downstream tasks in cross-lingual scenarios.Prompt learning reformulates the text classification problem into a masked language modeling (MLM) problem by constructing cloze-style questions with a special token <MASK>.The model is trained to predict the masked word in the cloze-style questions.As shown in Table 1, prompt learning can be divided into three types: Discrete Prompts (DP), Soft Prompts (SP), and Mixed Prompts (MP).Zhao and Schütze (2021) investigated the effectiveness of prompt learning in multilingual tasks by simply applying soft, discrete, and mixed prompting with a uniform template in English.Qi et al. (2022) proposed a discrete prompt learning framework that constructs an augmented sample by randomly sampling a template in another language.By comparing the augmented samples and the original samples in the English template, the model can effectively perceive the correspondence between different languages.However, discrete prompts of human-designed template words require extensive external expert knowledge and are not flexible enough to adapt to different languages.Therefore, the model can't perform well when transferred from high-resource to low-resource languages.
In this paper, we propose a novel Soft prompt learning framework with the Multilingual Verbalizer (SoftMV) for XNLI.First, we construct cloze-style questions for the input samples with soft prompts which consist of trainable vectors.Second, we apply the code-switched substitution strategy (Qin et al., 2021) to generate multilingual questions which can be regarded as cross-lingual views for the English questions.Compared with discrete prompts, soft prompts perform prompting directly in the embedding space of the model and can be easily adapted to any language without human-designed templates.Both the original and augmented questions are fed into a pre-trained cross-lingual base model.The classification probability distribution is calculated by predicting the masked token with the multilingual verbalizer to reduce the gap between different languages.Finally, the two probability distributions are regularized by the Kullback-Leibler divergence (KLD) loss (Kullback and Leibler, 1951) to align the representations of original and augmented multilingual questions into the same space.The entire model is trained with a combined objective of the cross-entropy term for classification accuracy and the KLD term for representation consistency.The well-trained soft prompt vectors will be frozen in the inference stage.Experimental results on the XNLI benchmark show that SoftMV outperforms the baseline models by a significant margin under both the few-shot and full-shot settings.
Our contributions can be summarized as follows: • We propose a novel Soft prompt learning framework with a Multilingual Verbalizer (SoftMV) for XNLI.SoftMV leverages bilingual dictionaries to generate augmented multilingual code-switched questions for original questions constructed with soft prompts.
• We adopt the multilingual verbalizer to align the representations of original and augmented questions into the same semantic space with consistency regularization.
• We conduct extensive experiments on XNLI and demonstrate that SoftMV can significantly outperform the baseline methods under the few-shot and full-shot cross-lingual transfer settings.

Related Work
Early methods for cross-lingual natural language inference are mainly neural networks, such as Conneau et al. (2018) and Artetxe and Schwenk (2019).which encode sentences from different languages into the same embedding space via parallel corpora (Hermann and Blunsom, 2014).In recent years, large pre-trained cross-lingual language models have demonstrated promising performance.Devlin et al. (2019) extend the basic language model BERT to multilingual scenarios by pre-trained with multilingual corpora.Conneau and Lample (2019) propose a cross-lingual language model (XLM) which enhances BERT with the translation language modeling (TLM) objective.XLM-R (Conneau et al., 2020) is an improvement of XLM by training with more languages and more epochs.Although these methods do not rely on parallel corpora, they still have limitations because fine-tuning needs annotation efforts which are prohibitively expensive for low-resource languages.
To tackle this problem, some data augmentation methods have been proposed for XNLI.Ahmad et al. (2021) propose to augment mBERT with universal language syntax using an auxiliary objective for cross-lingual transfer.Dong et al. (2021) adopt Reorder Augmentation and Semantic Augmentation to synthesize controllable and much less noisy data for XNLI.Bari et al. (2021) improve cross-lingual generalization by unsupervised sample selection and data augmentation from the unlabeled training examples in the target language.Zheng et al. (2021) propose a cross-lingual finetuning method to better utilize four types of data augmentations based on consistency regularization.
However, these methods do not perform well under the few-shot settings.
Recently, prompt learning (Brown et al., 2020;Shin et al., 2020;Lester et al., 2021;Vu et al., 2022;Li and Liang, 2021;Qin and Eisner, 2021;Liu et al., 2022c) has shown promising results in many NLP tasks under the few-shot setting.The key idea of prompt learning for XNLI is reformulating the text classification problem into a masked language modeling problem by constructing cloze-style questions.Su et al. (2022) propose a novel promptbased transfer learning approach, which first learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task.Wu and Shi (2022) adopt separate soft prompts to learn embeddings enriched with domain knowledge.Schick and Schütze (2021a) explore discrete prompt learning to NLI with manually defined templates.Zhao and Schütze (2021) demonstrate that prompt learning outperforms fine-tuning for fewshot XNLI by simply applying soft, discrete, and mixed prompting with a uniform template in English.Qi et al. (2022) proposed a discrete prompt learning framework that constructs an augmented sample by randomly sampling a template in another language.However, discrete prompts of humandesigned template words require extensive external expert knowledge and are not flexible enough to adapt to different languages.In our work, we adopt trainable soft prompts to capture correspondence between different languages by comparing the augmented multilingual and original questions.

Framework
The proposed SoftMV framework is illustrated in Figure 1.The training process of SoftMV is formalized in Algorithm 1.For every training triple (premise, hypothesis, label) in English, SoftMV first constructs a cloze-style question with soft prompts initialized from the vocabulary.Then, we apply the code-switched substitution strategy to generate multilingual questions which can be regarded as cross-lingual views for the English questions.Both the original and augmented questions are fed into a pre-trained cross-lingual model to calculate the answer distributions of the mask token with a multilingual verbalizer.SoftMV is trained by minimizing the cross-entropy loss for classification accuracy and the Kullback-Leibler divergence (KLD) loss for representation consistency.Finally, the well-trained soft prompt vectors are frozen in the inference stage.

Soft Prompting
Each instance in batch I in XNLI dataset is denoted as (P i , H i , Y i ) i∈I , where P i = {w P j } m j=1 denotes the word sequence of premise, H i = {w H j } n j=1 denotes the word sequence of hypothesis, and Y i ∈ Y denotes the class label.SoftMV first constructs a cloze-style question with soft prompts as illustrated in Table 1.The question template is expressed as "<s>Premise.</s><s>Hypothesis?<v 1 >...<v n > <MASK></s>", where <s> and </s> are special tokens to separate sentences, <MASK> is the mask token, and v i is associated with a trainable vector (in the PLM's first embedding layer).Soft prompts are tuned in the continuous space and initialized with the average value of embeddings of the PLM's multilingual vocabulary.In cross-lingual transfer Algorithm 1 The training process of SoftMV.
Input: the number of epochs E and the training set i=1 by generating augmented multilingual questions with the code-switched strategy.
Compute total loss L by Eq. 7.

9:
end for 10: end for scenarios, it's a challenge for a model to align contextualized representations in different languages into the same semantic space when trained solely on the English dataset.Therefore, we adopt the code-switched strategy to create multilingual augmentations for the original questions.Followed by Qin et al. (2021), we use bilingual dictionaries (Lample et al., 2018) to replace the words of the original sentences.Specifically, for the English sentence, we randomly choose n = α * l words to be replaced with a translation word from a bilingual dictionary, where α is the code-switched rate and l is the length of the sentence.For example, given the sentence "Two men on bicycles competing in a race." in English, we can generate a multilingual code-switched sample "Two Männer(DE) on Bicyclettes(FR) competing in a yarış(TR)."which can be regarded as the cross-lingual view of the same meaning across different languages.The original and augmented cloze-style questions are fed into a pre-trained cross-lingual model to obtain the contextualized representation of the mask token, denoted as h o mask and h a mask .Let l denote the size of the vocabulary and d the dimension of the representation of the mask token, the answer probability distribution of the original question is calculated by: where W ∈ R l×d is the trainable parameters of the pre-trained MLM layer.The answer probability distribution y a of the augmented question is calculated in the same way.

Multilingual Verbalizer
After calculating the answer probability distribution of the mask token, we use the verbalizer to calculate the classification probability distribution.The verbalizer M → V is a function that maps NLI labels to indices of answer words in the given vocabulary.The model is trained to predict masked words that correspond to classification labels, as determined by the verbalizer.Concretely, the verbalizer of English is defined as {"Entailment" → "yes"; "Contradiction" → "no"; "Neutral" → "maybe"} according to Schick and Schütze (2021b).
Without parallel corpora in cross-lingual scenarios, there is a gap in the classification space between the original and multilingual representations.Using the English verbalizer for all languages might hinder the model's ability to capture semantic representations for multilingual inputs.Thus we use a multilingual verbalizer to learn a consistent classification probability distribution across different languages.The multilingual verbalizer comprises a set of verbalizers for different languages.The multilingual verbalizer is denoted as {M l , l ∈ L}, where L is the set of languages and l is a specific language.The non-English verbalizers are translated from English using bilingual dictionaries.Specifically, the verbalizer of Turkish is defined as {"Entailment" → "Evet.";"Contradiction" → "hiçbir"; "Neutral" → "belki"}.

Training Objective
In the training stage, given a batch I of N triples denoted as (X o i , X a i , Y i ) 1≤i≤N , the cross-entropy losses for the original question X o i and the augmented question X a i are respectively calculated 1364 by: where y o i,j (resp.y a i,j ) denotes the j-th element of the answer probability distribution y o for the original question X o i (resp.for the input X a i ) and I(C) is the indicator function that returns 1 if C is true or 0 otherwise.The cross-entropy losses of the original and augmented questions on the batch I are calculated by: However, for the same premise and hypothesis, the answer probability distribution of the augmented multilingual question created by the codeswitched strategy may lead to a deviation from that of the original question due to the misalignment of representations in the multilingual semantic space.Such a deviation may cause the model to learn the wrong probability distribution when the model is evaluated on target languages.To alleviate this problem, we propose a consistency regularization to constrain the answer probability distribution.In particular, we adopt the Kullback-Leibler divergence (KLD) to encourage the answer probability distribution of the augmented question to be close to that of the original question.The consistency loss is defined as: The cross-entropy loss encourages the model to learn correct predictions for the augmented inputs, while the KLD loss enforces consistency between the original and augmented representations in the same multilingual semantic space.Using these loss terms together ensures that the model not only performs well on the original inputs but also generalizes to the augmented inputs, resulting in a more robust model that effectively handles cross-lingual tasks.The overall objective in SoftMV is a tuned linear combination of the cross-entropy losses and KLD loss, defined as: where λ * are tuning parameters for each loss term.

Benchmark Dataset
We conducted experiments on the large-scale multilingual benchmark dataset of XNLI (Conneau et al., 2018), which extends the MultiNLI (Williams et al., 2018) benchmark (in English) to 15 languages2 through translation and comes with manually annotated development sets and test sets.For each language, the training set comprises 393K annotated sentence pairs, whereas the development set and the test set comprise 2.5K and 5K annotated sentence pairs, respectively.
We evaluate SoftMV and other baseline models under the few-shot and full-shot cross-lingual settings, where the models are only trained on English and evaluated on other languages.For the few-shot setting, the training and validation data are sampled by Zhao and Schütze (2021) with k ∈ {1, 2, 4, 8, 16, 32, 64, 128, 256} shots per class from the English training data in XNLI.We report classification accuracy as the evaluation metric.

Implementation Details
We implement SoftMV using the pre-trained XLM-RoBERTa model (Conneau et al., 2020) based on PyTorch (Paszke et al., 2019) and the Huggingface framework (Wolf et al., 2020).XLM-R is a widely used multilingual model and the baseline (PCT) we compare with only report the results using XLM-R.
We train our model for 70 epochs with a batch size of 24 using the AdamW optimizer.The hyperparameter α is set to 0.3 for combining objectives.The maximum sequence length is set to 256.All the experiments are conducted 5 times with different random seeds ({1, 2, 3, 4, 5}) and we report the average scores.The trained soft prompt vectors will be frozen in the inference stage.Appendix A shows the hyperparameters and computing devices used under different settings in detail.

Main Results
We conducted experiments on the XNLI dataset under the cross-lingual transfer setting, where models are trained on the English dataset and then directly evaluated on the test set of all languages.The settings can be further divided into two sub-settings: the few-shot setting using a fixed number of training samples per class, and the full-shot setting using the whole training set.
Few-shot results Table 2 reports the results for comparing SoftMV with other models on XNLI under the few-shot setting.The results of compared models are taken from Zhao and Schütze (2021); Qi et al. (2022).PCT † in the 1/2/4/8-shot experiments are reproduced by us, for not being reported before.Note that all models are based on XLM-R base and trained on the same split of data from Zhao and Schütze (2021).Results show that SoftMV significantly outperforms all baselines for all languages under all settings by 3.5% on average.As expected, all models benefit from more shots.When the k shots per class decrease, the gap between the performance of SoftMV and the stateof-the-art model (PCT) becomes larger, implying our model has a stronger ability to align contextualized representations in different languages into the same space when training data are fewer.In particular, SoftMV outperforms PCT by 4.4%, 2.8%, 4.3%, and 8.9% in the 1/2/4/8-shot experiments respectively.When the k shots per class are larger than 8, the average performance of SoftMV also outperforms PCT by an absolute gain of 2.5% on average.Furthermore, for different languages, all methods perform best on EN (English) and worst on AR (Arabic), VI (Vietnamese), UR (Urdu), and SW (Swahili).It is difficult to obtain usable corpora for these low-resource languages for XLM-R.Thus, the model has a poor learning ability for these languages.SoftMV also outperforms PCT on these low-resource languages, which demonstrates that our model is more effective in cross-lingual scenarios, especially for low-resource languages.
Full-shot results Table 3 shows the results on XNLI under the full-shot setting.The results of compared models are taken from Qi et al. (2022).SoftMV-XLM-R base achieves 78.8% accuracy averaged by 15 target languages, significantly outperforming the basic model XLM-R base by 4.6% on average.Compared with PCT, SoftMV improves by 3.5% on average based on XLM-R base .Furthermore, we can observe that the accuracy of SoftMV exceeds PCT by 0.3% on EN, but 4.6% on AR, 11.8% on SW, and 10.5% on UR.This indicates that SoftMV has better transferability across lowresource languages with well-trained soft prompt vectors.To further investigate the effectiveness, we also evaluated SoftMV with baselines based on XLM-R large model.It can be seen that SoftMV achieves 82.1% accuracy on average, significantly outperforming PCT and XLM-R large by 0.8% and 1.7%.Compared with the results on XLM-R base , the improvements of SoftMV on XLM-R large are smaller, which indicates that SoftMV is more effective on XLM-R base which has fewer parameters and worse cross-lingual ability.The performance gains are due to the stronger ability of SoftMV to align contextualized representations in different languages into the same semantic space with consistency regularization.

Ablation Study
To better understand the contribution of each key component of SoftMV, we conduct an ablation study under the 8-shot setting with XLM-R base .The results are shown in Table 4.After removing the code-switched method, the performance decreases by 1.9% on average which shows the augmented multilingual samples can help the model to understand other languages.When we remove the consistency loss, the average accuracy decreases by 2.5%.The consistency loss can help the model align the representations across different languages into the same semantic space.Removing the multilingual verbalizer leads to 1.7% accuracy drop on average.This demonstrates that the multilingual verbalizer can reduce the gap between different languages when calculating the classification probability distribution.We also replace soft prompts with discrete prompts as illustrated in Table 1, which leads to an accuracy drop of 1.3% on average.The accuracy decreases by 1.0% when using mixed prompts instead of soft prompts.The reason is that template words in mixed prompts have a bad effect on SoftMV if not specifically designed with expert knowledge.Furthermore, we use randomly initialized prompts to replace the prompts initialized from the multilingual vocabulary, which leads to 0.5% accuracy drop on average.

Analysis of Code-switched Method
To further investigate the code-switched method, we conduct experiments using a single language to create augmented multilingual samples.shows the results of SoftMV with 10 different seeds under the 8-shot setting for 15 languages on average.We can observe that SoftMV performs worst with an accuracy of 42.1% when using AR (Arabic) to replace the words in sentences.When using TR (Turkish) to replace the words in sentences, the performance of SoftMV outperforms the results using another language.The reason is that TR is different from EN, while not too rare like low-resource lan-guages such as UR (Urdu) and AR.Thus the model can better align contextualized representations in different languages into the same semantic space.When randomly selecting languages for the words of each sentence, SoftMV performs best with a lower standard deviation.Therefore, we apply a random strategy for the code-switched method in our experiments.We also conducted experiments to show how the length of soft prompts impacts performance.The results are illustrated in Figure 3 under the 8-shot setting.We can observe that the performance of SoftMV is very sensitive to the value of length.As the length of soft prompts increases, the performance of SoftMV first increases and then decreases.As the length of soft prompts increases, the model has the more expressive power to reduce the gaps across different languages.Therefore, the performance of the model is gradually improved.SoftMV achieves the best performance when the length of soft prompts is 4. When the length is larger than 4, the accuracy decreases sharply.The reason is that the model with longer soft prompts tends to overfit the training data under the few-shot setting.

Conclusion
In this paper, we propose a novel Soft prompt learning framework with a Multilingual Verbalizer (SoftMV) for XNLI.SoftMV applies the codeswitched substitution strategy to generate multilingual questions for original questions constructed with soft prompts.We adopt the multilingual verbalizer to align the representations of original and augmented samples into the same semantic space with consistency regularization.Experimental results on XNLI demonstrate that SoftMV significantly outperforms the previous methods under the few-shot and full-shot cross-lingual transfer settings.The detailed analysis further confirms the effectiveness of each component in SoftMV.

Limitations
SoftMV is specifically designed for cross-lingual natural language inference.We believe that some of the ideas in our paper can be used in other tasks of XLU, which remains to be further investigated by subsequent research.
In addition, we conduct experiments on the XNLI dataset which consists of 15 languages.SoftMV outperforms the baseline methods under the cross-lingual transfer settings.However, the cross-lingual ability of SoftMV on other languages, especially those lacking relevant datasets, needs to be verified in future work.
3: Divide Q into a set of batches B. 4: for epoch e = 1 to E do

Figure 1 :
Figure 1: The framework of SoftMV.The left part is the original questions.The right part is the augmented multilingual questions.The model is trained with a combined objective of the cross-entropy losses and the KLD loss.

Figure 2 :
Figure 2: Evaluation results of different strategies of the code-switched method under the 8-shot setting for 15 languages on average.

Figure 3 :
Figure 3: Evaluation results of different lengths of soft prompts under the 8-shot setting for 15 languages on average.

Table 1 :
The example of prompt templates for Discrete Prompts (DP), Soft Prompts (SP), and Mixed Prompts (MP).Premise and Hypothesis are a pair of sentences from the NLI dataset.Question and Answer are template words of discrete prompts.<v i > is the trainable vector of soft prompts.

Table 2 :
Qi et al. (2022)ts on XNLI under the few-shot cross-lingual transfer setting in accuracy(%).Each number is the mean performance of 5 runs."AVG." is the average accuracy for 15 languages.PCT † denote our reproduced results of the model inQi et al. (2022).The best performance is in bold.

Table 3 :
Comparison results on XNLI under the full-shot cross-lingual transfer setting in accuracy(%).Each number is the mean performance of 5 runs."AVG." is the average accuracy for 15 languages.The best performance is in bold.

Table 4 :
Ablation study results for SoftMV under the 8-shot setting in accuracy(%)."AVG." is the average accuracy for 15 languages.