Discrete and Soft Prompting for Multilingual Models

It has been shown for English that discrete and soft prompting perform strongly in few-shot learning with pretrained language models (PLMs). In this paper, we show that discrete and soft prompting perform better than finetuning in multilingual cases: Crosslingual transfer and in-language training of multilingual natural language inference. For example, with 48 English training examples, finetuning obtains 33.74% accuracy in crosslingual transfer, barely surpassing the majority baseline (33.33%). In contrast, discrete and soft prompting outperform finetuning, achieving 36.43% and 38.79%. We also demonstrate good performance of prompting with training data in multiple languages other than English.

In contrast to finetuning, which learns discriminative classifiers for tasks like natural language inference (NLI; Dagan et al. (2006);Bowman et al. (2015)), prompting reformulates the classification task to generative text-to-text (Raffel et al., 2020) or cloze-style (McCann et al., 2018;Brown et al., 2020) queries which are given to a PLM to answer. For example, the NLI task of assigning premise "They whinnied, eyes wide" and hypothesis "Their eyes were open wide" to class "entailment" can be reformulated as: They whinnied, eyes wide .
Question: Their eyes were open wide ? Answer: .
The PLM is requested to fill in, for the blank ( ), the word "yes", which is mapped to "entailment".
Prompting makes a human description of the task available in learning. Also, "filling in the blank" is well aligned with the pretraining objective (masked/autoregressive language modelling (Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019)), likely to deliver better performance in few-shot learning (Ram et al., 2021).
In this paper, we investigate the effectiveness of prompting in multilingual tasks, which -despite the success of prompting in English -is largely unexplored. We address two main research questions: (RQ1) Does the strong few-shot performance of prompting transfer to other languages from English? (RQ2) As the cost of few-shot non-English annotations is affordable (Garrette and Baldridge, 2013;Lauscher et al., 2020;Zhao et al., 2021), can we directly prompt PLMs in languages other than English or do we have to go through the (generally best resourced) intermediary of English?
In this work, we systematically compare two popular prompting methods -discrete and soft prompting -with finetuning in the few-shot multilingual NLI task and show that prompting is superior: (i) The strong few-shot learning performance of prompting transfers to other languages from English: It outperforms finetuning in crosslingual transfer (RQ1; §5.1). (ii) Directly querying the multilingual PLM with few-shot non-English prompts achieves competitive performance, without relying on crosslingual transfer from English (RQ2; §5.2). parameters in GPT3 is prohibitively large (175B).
Soft prompting relaxes the constraint that a prompt needs to be composed of discrete tokens. Instead, it learns the prompt in the continuous space with SGD. Qin and Eisner (2021) and Zhong et al. (2021) learn soft prompts eliciting more knowledge (Petroni et al., 2019) from PLMs than discrete prompts. Similar to soft prompting but with the PLM being frozen, Li and Liang (2021) propose prefix-tuning to encourage PLMs to solve generation tasks with high parameter-efficiency (Houlsby et al., 2019;Zhao et al., 2020). Lester et al. (2021) demonstrate that soft prompting benefits from scaling up the number of PLM parameters. Liu et al. (2021) show that GPT (Radford et al., 2019) can solve NLU tasks (Wang et al., 2019) with soft prompting.
All of this work focuses on English. We show that discrete and soft prompting perform better than finetuning in few-shot crosslingual natural language inference (XNLI;Conneau et al. (2018)) with multilingual PLMs (XLM-RoBERTa; Conneau et al. (2020)). We conduct experiments on NLI because it is one of the most representative and challenging NLU tasks (Dagan et al., 2006;Bowman et al., 2015), and has been commonly used in prior work on prompting.

Finetuning
We follow the standard finetuning method (Devlin et al., 2019): A linear classifier layer is initialized and stacked on top of the PLM; the whole model is then trained on the few-shot NLI dataset ( §4).

Prompting
Discrete prompting (DP). Following Schick and Schütze (2021), Le Scao and Rush (2021), we reformulate the NLI examples (cf. example in §1) into cloze-style questions using a human-designed prompt. Specifically, we ask the PLM to fill in the blank ( ) in sentence: Premise . Question: Hypothesis ? Answer: .
Premise and Hypothesis are a pair of sentences from the NLI dataset. The gold labels are mapped to words in the PLM vocabulary. Concretely, we use following mapping (verbalizer; Schick and Schütze (2021)): "entailment"→ "yes"; "contradiction"→ "no"; "neutral"→ "maybe". The optimization objective is to minimize the crossentropy loss between the predicted and the gold words representing the three classes.
where each <v i >, i ∈ {1, 2, 3, 4} is associated with a randomly initialized trainable vector (in the PLM's lowest embedding layer) v i ∈ R d , where d is the hidden dimension size of the embedding layer. Directly using v i yields sub-optimal task performance: Li and Liang (2021) reparameterize v i with another trainable matrix and then feed it forward through an MLP. Here, we adopt Liu et al. (2021)'s approach. They feed [v 1 , v 2 , v 3 , v 4 ] through an LSTM (Hochreiter and Schmidhuber, 1997) and use the outputs. PLM parameters, LSTM parameters, and v i are jointly trained. Our SP and DP have the same training objective and verbalizer.
Mixed prompting (MP). We also experiment with a simple combination of DP and SP, by asking the PLM to fill in the blank ( ) in sentence: Premise . Question: Hypothesis ? <v1><v2><v3><v4> Answer: .
MP includes human descriptions of NLI as in DP and learns "soft prompts" as in SP.

Zero-shot crosslingual transfer
We first compare prompting with finetuning in zeroshot crosslingual transfer (Pires et al., 2019;Conneau et al., 2020;Artetxe and Schwenk, 2019;Hu et al., 2020): The PLM is trained on the EN fewshot dataset and then directly evaluated on the test set of all languages. Table 2 reports the results.
EN results. From column EN we observe that: (i) As expected, all four methods benefit from more shots. (ii) Prompting methods (DP/SP/MP) clearly outperform finetuning especially in low-resource regimes. For example, in the 4-shot experiment, SP outperforms finetuning by ≈8 (41.84-33.90) accuracy points. Table 3 displays some examples for which SP outperforms finetuning. The improvements become less significant when more shots are available, e.g., 256. (iii) SP outperforms DP for most choices of shots (except 128), evidencing the strength of relaxing the "discrete token" constraint in DP (Liu et al., 2021;Qin and Eisner, 2021;Zhong et al., 2021). But we give up the interpretability of DP for this better performance. (iv) Performance of MP -the combination of DP and SP -is decent, but not stellar. Future work may explore advanced prompting methods succeeding  Table 2: Zero-shot crosslingual transfer results in accuracy (%). Each number is the mean performance of 5 runs, when using finetuning (FT), discrete prompting (DP), soft prompting (SP), and mixed prompting (MP). "MAJ": majority baseline; X: macro average across 15 languages. Please see Appendix Table 7 for variances.
Premise/Hypothesis Prediction This was the temper of the times.
"no" (Contradict) This wasn't the temper of the times. We would go in there.
"maybe" (Neutral) We would enter there at 8pm. I hope to hear from you soon.
"yes" (Entailment) I hope we talk soon. in both task performance and interpretability. We focus on DP and SP in following experiments.
Crosslingual transfer results closely follow the trends of EN results: Prompting outperforms finetuning when looking at the macro average X. One intriguing finding is that DP successfully transfers the learned knowledge to target languages, better than SP in some languages, using the code-switched prompt: " Premise . Question: Hypothesis ? Answer: . " where Premise and Hypothesis are non-English. Thus, DP is able to leverage the strong crosslingual ability of the multilingual PLM. Like finetuning, prompting does not uniformly benefit the 14 non-English languages. For example, the crosslingual transfer performance of HI/SW/UR is notably inferior compared with other languages.
Overall, prompting outperforms finetuning in zero-shot crosslingual transfer of NLI in the lowresource regimes.

In-language prompting
We next compare prompting with finetuning when using non-English few-shot datasets. Taking Turkish as an example, recall that we can use the Turkish prompts ( §3.3) and few-shot datasets from XNLI ( §4) to finetune/prompt the PLM directly. Table 4 shows results of in-language experiments of Turkish, Urdu, Swahili, and Chinese. We make two main observations: (i) Prompting still outperforms finetuning, though the non-English prompts and verbalizers are translated from EN simply using Google Translate. (ii) In-language results are slightly worse but competitive to transfer learning results (   , 2020). Thus, the PLM may not be well pretrained for solving tasks in Swahili directly. Second, the few-shot training data for non-English languages is machine-translated ( §4). With better few-shot translations and in-language expertise, prompting possibly could achieve even better results.
Overall, the experimental results show that directly prompting PLMs with non-English languages is also an effective way of solving NLU tasks in low-resource regimes.

Conclusion
We showed that prompting performs better than finetuning in few-shot crosslingual transfer and inlanguage training of multilingual natural language inference. We hope our results will encourage more research about prompting multilingual tasks and models.
Future work may explore using text-to-text models like T5 (Raffel et al., 2020)

A.2 Computing infrastructure
All experiments are conducted on GeForce GTX 1080Ti. For finetuning, we use batch size 32 and 4 GPUs. Because prompting uses the masked language model objective so we use a maximum batch size 24. A single GPU is used for 1-shot experiments. Two and three GPUs are used for 2-and 4shot experiments. Other experiments use 6 GPUs.

A.4 Hyperparameter search
We use the same learning rate (1e-5) as Le Scao and Rush (2021) who compare prompting and finetuning in English NLU tasks. No learning rate scheduling is used for clear comparisons. For both finetuning and prompting, the model is trained for 50 epochs and the checkpoint that performs best on development set is selected for performance evaluation.   Table 6: In-language results in accuracy (%). Prompting (DP/SP) outperforms finetuning (FT). We report mean and variance of 5 runs.

A.5 Datasets and preprocessing
We retrieve the MNLI and XNLI datasets from the official websites: cims.nyu.edu/~sbowman/ multinli and cims.nyu.edu/~sbowman/ xnli. We use the tokenizer in the HuggingFace framework (Wolf et al., 2020) to preprocess the texts. In all experiments, the max sequence length is 256. Table 5 shows the prompts and verbalizers used in in-language experiments. We use Google Translate but more specialized bilingual dictionaries can also be used. For Urdu, we show the prompt and verbalizer in the code repository.  Table 7: Zero-shot crosslingual transfer results in accuracy (%). We report mean and variance of 5 runs, when using finetuning (FT), discrete prompting (DP), soft prompting (SP), and mixed prompting (MP). "MAJ": majority baseline; X: macro average across 15 languages.