Prompt-Tuning Can Be Much Better Than Fine-Tuning on Cross-lingual Understanding With Multilingual Language Models

Pre-trained multilingual language models show significant performance gains for zero-shot cross-lingual model transfer on a wide range of natural language understanding (NLU) tasks. Previously, for zero-shot cross-lingual evaluation, pre-trained models are only fine-tuned on English data and tested on a variety of target languages. In this paper, we do cross-lingual evaluation on various NLU tasks (sentence classification, sequence labeling, question answering) using prompt-tuning and compare it with fine-tuning. The results show that prompt tuning achieves much better cross-lingual transfer than fine-tuning across datasets, with only 0.1% to 0.3% tuned parameters. Additionally, we demonstrate through the analysis that prompt tuning can have better cross-lingual transferability of representations on downstream tasks with better aligned decision boundaries.


Introduction
Large Multilingual language models (Pires et al., 2019;Wu and Dredze, 2019;Conneau et al., 2020) show surprisingly impressive zero-shot crosslingual transfer on NLP tasks, even though they are only trained from monolingual corpora.Recently, large-scale benchmarks such as XTREME (Hu et al., 2020) and XGLUE (Liang et al., 2020) are introduced for cross-lingual evaluation.
In a cross-lingual transfer setting, models are only fine-tuned on the task-specific annotations in one language and evaluated in other languages.During fine-tuning, pre-trained language models are used for initialization and the entire model parameters are tuned on downstream tasks.While fine-tuning obtains strong performance, it is inefficient.Also as shown in (Hu et al., 2020), the cross-lingual transfer gap between the performance on the English test set and all other languages is large even with the best baseline XLM-R (Conneau et al., 2020).
Recently, prompt tuning, where only a small amount of additional parameters (i.e.prompts) is added and tuned, but the original model is kept frozen.Much fewer parameters or no parameters are tuned and thus the training is a lot more efficient.Prompt tuning still performs worse than fine-tuning in lots of NLP tasks (Brown et al., 2020;Shin et al., 2020;Zhong et al., 2021).More recently, Li and Liang (2021); Lester et al. (2021); Hambardzumyan et al. (2021) indicate prompt tuning is competitive with fine tuning on some of the NLU tasks.Language model capacity (e.g., 10 billion parameters) is a key ingredient for these approaches to succeed.More recently, (Liu et al., 2022) shows prompt tuning can also be comparable on several hard monolingual sequence labeling tasks such as extractive question answers.
In this paper, we aim to investigate the effect of prompt tuning in cross-lingual tasks.We freeze the entire multilingual language model and tune task prompts on the English training set for downstream tasks (sentence classification, structure prediction, question answering).Even with medium size multilingual language model (less than 1 billion parameters), prompt tuning achieves much higher performance than fine-tuning on various NLU tasks.
According to the analysis results, prompt tuning does fewer changes to sentence representations than fine-tuning and keeps good cross-lingual sentence representations.We also find that the decision boundaries of different language sentence representations after prompt tuning on English data are almost aligned well.However, these decision boundaries of different languages after fine-tuning are a large difference.These aligned decision boundaries can lead to stronger cross-lingual transfer.
This work sheds light on the strong cross-lingual ability of prompt tuning.Our results suggest prompt tuning is better than fine-tuning on crosslingual transfer.Our contributions are summarized as follows: we show that prompt tuning can per-form much better as compared to fine-tuning for cross-lingual transfer; we also show prompt tuning works better in the case of the cross-lingual transfer due to the relative small robust changes it brings to the originally learned representations.
2 Prompt-Tuning for Cross-Lingual Tasks Multilingual Language Models.In the past years, lots of pre-trained multilingual language models come out: mBERT, XLM (CONNEAU and Lample, 2019), XLM-R (Conneau et al., 2020), etc. XLM-R (Conneau et al., 2020) significantly outperforms multilingual BERT (mBERT; Devlin et al., 2019) on a variety of cross-lingual benchmarks XTREME (Hu et al., 2020).In some previous work (Luo et al., 2021;Zhang et al., 2019), XLM-R is also used for initialization to do another round of pretraining with parallel data to get the stronger cross-lingual ability.Previously, in the cross-lingual evaluation, models are fine-tuned on the English training data but evaluated on all target languages.As far as we know, we are the first to explore prompt tuning on several hard multilingual NLP tasks including structure prediction and question answering Figure 1: Two different approaches for cross-lingual evaluation when using large multilingual language model.Left: In fine-tuning, all model parameters are tuned on English task data.This setting is used in crosslingual evaluation before.Right: In prompt tuning, only small ratio parameters are tuned.We use prefix prompts and use layer prompts in our experiments.
Prompt Tuning.Fine-tuning on large pre-trained language models leads to strong performance on downstream tasks, however, it is memoryconsuming and lots of parameters need to save for each task.In prompt tuning, only a small part of the parameters ( e.g., prompts or task classifier ) are tuned during learning.However, it usually performs not as good as compared to fine-tuning.Recently, Lester et al. (2021) find prompt tuning can be better than fine-tuning when the model size is not extremely large (10 billion parameters).Prefix-tuning (Li and Liang, 2021) obtains comparable performance for natural language generation tasks.Liu et al. (2022) shows prompt tuning can be matched to fine-tuning on language understanding tasks even at hard sequence tagging tasks.
We investigate prompt tuning on cross-lingual understanding on a pre-trained multilingual language model.The framework is shown in Figure 1.Our setting is similar to Li and Liang (2021); Liu et al. (2022).The continuous prompts are added as prefix tokens and tuned during learning.In the implementation, the prompts are operated as past keys and values in each transformer layer.Each transformer layer has separated prompts.These continuous prompts are optimized, but multilingual language model parameters are frozen.

Training Details.
Our frozen models are built on the top of the pretrained XLM-R checkpoint of LARGE size with about 560M parameters.Previous work (Hu et al., 2020) shows it achieves stronger performance than mBERT1 .All our experiments were run with Huggingface (Wolf et al., 2020).More details are in the appendix.
Prompt Length.Prompt length usually plays an important role in prompt tuning.In our experiments, we treat this as a hyper-parameter.Longer prompt length often leads to have higher performance.In our experiments, prompt length is set to 16 or 32 and tuned on the English validation set.Overall Results Table 1 shows the zero-shot cross-lingual results on four different tasks.Prompt tuning performs much better than fine-tuning, especially for hard sequence task question answering.And prompt tuning is also with smaller variance.
Previously, although with parallel data or more monolingual data, cross-lingual transfer results (Zhang et al., 2019;Luo et al., 2021;Ruder et al., 2021) on question answering and structured prediction tasks improved only slightly.With prompt tuning, there is larger performance gains for question answering and structured prediction tasks.It suggests that prompt tuning is a better tuning method for cross-lingual transfer.
Cross-lingual Transfer Gap According to the above result, on average, prompt tuning achieves better performance than fine tuning.Table 2 shows the cross-lingual transfer gap of the two different tuning methods.Prompt tuning can also reduce the gap significantly.
Discussion In our preliminary experiments, for the smaller size model (e.g., mBERT), prompt tuning perform a little worse than fine tuning on English, and match the performance of fine-tuning on all languages.The language model size still matter.There is still some space for smaller size model.This also indicates potential for future work with better prompt tuning method.

Analysis
In order to perform some analysis on prompt tuning and fine tuning, we select 1000 samples for each language (en, de, es, fr, ja, ko, zh ) from PAWS-X (Yang et al., 2019) dataset.For each English language sample in our selections, there is a human translated sample from the other six languages. 2igure 2 shows t-SNE visualization of sample representations from frozen multilingual language model XLM-R.Samples' representations are clustered well respect to languages, however, there is weak correlation with labels.

Language Representation Changes
For each tuning method (fine-tuning and prompttuning), Table 3 shows the cosine similarity of representations from frozen language model and tuned model.According to the results, both of two tuned method make notable change on sentence representations.However, the average cosine similarity of fine-tuning is much smaller.It indicates that finetuning leads much larger changes on sentence rep-  resentations than prompt tuning.We can also see representation changes is larger when tuning is on MNLI, while prompt tuning still has less changes on representations.

Cross-lingual Alignment After Tuning
We compute the averaged cosine similarity of all the 1000 translation pairs for each language pair <en , xx>, where xx is de, es, fr, ja, ko or zh.We also compute averaged cosine similarity of all the 1000*999/2 non-translations for each language pair.As shown in Table 3, both fine tuning and prompt tuning are doing well.Prompt tuning has the advantage in the sense that they change the representation more mildly, still have high cosine similarity on translation pairs.This resulted in more robust transfer and less overfitting.

Decision Boundaries
Prompt tuning keeps high cross-lingual alignment with fewer changes in the previous subsections.
The general level of the learned representations' quality is still unknown, though.The learned representations quality are examined in this subsection.
Figure 2 (a) and (b) show t-SNE visualization of representations before two different tuning methods.Each dot in the two figures is a PAWS-X sample from four languages: German (de), zh (Chinese), en (English), ja (Japanese).The blue sample is a paraphrase, the orange sample is a nonparaphrase.Samples of the same language are en-de en-es en-fr en-ja en-ko en-zh grouped together.However, label information is missing from sample representations.Figure 2 (c) and (d) shows t-SNE (van der Maaten and Hinton, 2008) visualization after fine tuning (FT) and prompt tuning (PT).After tuning, both have reasonable and nice separated representations.For each language, we also plot logistic regression decision boundary for these t-SNE embeddings.The decision boundaries for various languages vary significantly after fine tuning.The English decision boundary can not separate well on German samples.After prompt tuning, the decision boundaries of the four languages are surprisingly aligned well.This suggest that prompt tuning learns better language-independent classifier than fine tuning, although the tuning is only on English training set.

Related Work
Recently, several previous works show prompt tuning for multilingual language models.Winata et al. (2021) shows the multilingual skills of large pre-trained models with few examples.Zhao and Schütze (2021); Huang et al. (2022); Qi et al. (2022) shows new proposed prompt tuning methods.The goal of our work is different from theirs.We show prompt tuning is better than fine-tuning for cross-lingual evaluation.We have a conclusion that our prompt tuning achieves higher performance than fine-tuning consistently in the setting.
Previous work (Zhao and Schütze, 2021;Huang et al., 2022;Qi et al., 2022) only experimented on the sentence classification task.Hard sequence tagging tasks and question answering is not explored or the settings are in low resource regimes.We investigate cross-lingual transfer ability on various NLU tasks from XTREME (Hu et al., 2020), which is one of the important cross-lingual transfer evaluation benchmarks.Sentence classification, sequence labeling, and question answering are included.

Conclusion
In this work, we compared prompt tuning and fine tuning on cross-lingual understanding with multilingual languages models, finding that prompt tuning achieves a better performance.This suggest that it is promissing to use prompt tuning on cross-lingual transfer.

Limitations
In this work, we investigate the effects of prompt tuning on cross-lingual understanding and empirically demonstrate some promising outcomes.We need a lot of GPU resources to complete our experiments.The experiments on large size pretrained multilingual language models are conducted on A100s with 40G memory.Training can be accelerated by using large batches.This is a preliminary exploration of prompt tuning on cross-lingual transfer.In this work, encoderonly models are explored on natural language understanding tasks in the paper.Future work may also involve encoder-decoder models and other tasks.

Figure 2 :
Figure2: T-SNE visualization of representations of four languages (en: English; de: German; ja: Japanese; zh: Chinese) before and after two different tuning methods on English task data.The decision boundaries after prompt tuning is aligned much better.

Table 1 :
(Ruder et al., 2021)ual transfer evaluation results (with standard deviation) on XTREME structured prediction, question answering, and sentence classification tasks.For both fine tuning and prompt tuning, models are only fine-tuned on the English training data but evaluated on all target languages.Baseline fine-tuning results with "*" and "+" are taken from(Hu et al., 2020)and(Ruder et al., 2021)respectively.More results are shown in the Appendix.
Tuned Parameter Sizes Comparison For the prompt tuning test results in Table1, we did limited tuning on prompt length.The prompt length is 16, except prompt length for task XNLI is 32.With only 0.1% to 0.3% additional prompt parameters as compared to the original model, the framework already demonstrates strong cross-lingual results.

Table 2 :
Cross-lingual transfer gap of the two tuning methods.The cross-lingual transfer gap is the performance difference between English test set and the average of the other languages.The smaller is better.

Table 5 :
XNLI accuracy scores for each language with fine-tuning (FT) and prompt tuning (PT).