Learning How to Ask: Querying LMs with Mixtures of Soft Prompts

Natural-language prompts have recently been used to coax pretrained language models into performing other AI tasks, using a fill-in-the-blank paradigm (Petroni et al., 2019) or a few-shot extrapolation paradigm (Brown et al., 2020). For example, language models retain factual knowledge from their training corpora that can be extracted by asking them to “fill in the blank” in a sentential prompt. However, where does this prompt come from? We explore the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization. Our prompts consist of “soft words,” i.e., continuous vectors that are not necessarily word type embeddings from the language model. Furthermore, for each task, we optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. Across multiple English LMs and tasks, our approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit: random initialization is nearly as good as informed initialization.


Introduction
Pretrained language models, such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), and BART (Lewis et al., 2020a), have proved to provide useful representations for other NLP tasks.Recently, Petroni et al. (2019) and Jiang et al. (2020) demonstrated that language models (LMs) also contain factual and commonsense knowledge that can be elicited with a prompt.For example, to query the date-of- ," where we have filled the first blank with "Mozart," and ask a cloze language model to fill in the second blank.The prompts used by Petroni et al. (2019) are manually created, while Jiang et al. (2020) use mining and paraphrasing based methods to automatically augment the prompt sets.
Finding out what young children know is difficult because they can be very sensitive to the form of the question (Donaldson, 1978).Opinion polling is also sensitive to question design (Broughton, 1995).We observe that when we are querying an LM rather than a human, we have the opportunity to tune prompts using gradient descent-the workhorse of modern NLP-so that they better elicit the desired type of knowledge.
A neural LM sees the prompt as a sequence of continuous word vectors (Baroni et al., 2014).We tune in this continuous space, relaxing the constraint that the vectors be the embeddings of actual English words.Allowing "soft prompts" consisting of "soft words" is not only convenient for optimization, but is also more expressive.Soft prompts can emphasize particular words (by lengthening their vectors) or particular dimensions of those words.They can also adjust words that are misleading, ambiguous, or overly specific.Consider the following prompt for the relation date-of-death: x performed until his death in y .
This prompt may work for the male singer Cab Calloway, but if we want it to also work for the female painter Mary Cassatt, it might help to soften "performed" and "his" so that they do not insist on the wrong occupation and gender, and perhaps to soften "until" into a weaker connective (as Cassatt was in fact too blind to paint in her final years).
Another way to bridge between these cases is to have one prompt using "performed" and another using "painted."In general, there may be many varied lexical patterns that signal a particular relation, and having more patterns will get better coverage (Hearst, 1992;Riloff and Jones, 1999).We therefore propose to learn a mixture of soft prompts.
We test the idea on several cloze language models, training prompts to complete factual and com-arXiv:2104.06599v1[cs.CL] 14 Apr 2021 mon sense relations from 3 datasets.Comparing on held-out examples, our method dramatically outperforms previous work, even when initialized randomly.So when regarded as approximate knowledge bases, language models know more than we realized.We just had to find the right ways to ask.
Most of the previous work manually creates prompts to extract answers from the trained language model.We use LAMA (Petroni et al., 2019) as a baseline.Building on LAMA, the LM Prompt And Query Archive (LPAQA) method (Jiang et al., 2020) searches for new prompts by either mining a corpus or paraphrasing existing prompts.AutoPrompt (Shin et al., 2020) searches for improved prompts using a gradient signal, although its prompts are limited to sequences of actual ("hard") English words, unlike our method.We compare our novel soft prompts against all of these systems.
After we submitted the present paper in November 2020, two still unpublished manuscripts appeared on arXiv that also investigated soft prompts.Li and Liang (2021) considered the setting of generating text from a pretrained language model (GPT-2 or BART) conditioned on a textual prompt.To improve the results, they prepended a few taskspecific "soft tokens" to the prompt and tuned the embeddings of only these tokens (at all embedding layers).Liu et al. (2021) adopted a strategy similar to ours by tuning fill-in-the-blank prompts in a continuous space, testing on GPT-2 and BERT models, although they did not use the enhancements we proposed in § §3.2-3.4 below.Like our work, both these papers achieved strong gains.
In other work, Bouraoui et al. (2020) mine prompts from a corpus, then fine-tune the whole language model so that it more accurately completes the prompts.Schick and Schütze (2020a,b) are similar but fine-tune the language model differently for each prompt.Our method complements these by tuning the prompts themselves."Probing" systems that ask what language models know about particular sentences (e.g., Eichler et al., 2019) usually use feedforward networks rather than further natural-language prompts.Yet Shin et al. (2020) show how to use naturallanguage prompts to ask about particular sentences.Our method could potentially be applied to those prompts, or to "few-shot learning" prompts that include input-output examples (Brown et al., 2020).

Method
Our experiments will specifically aim at extracting relational knowledge from language models.We are given a fixed pretrained LM, a specific binary relation r such as date-of-death, and a training dataset E r consisting of known (x, y) pairs in r, such as (Mary Cassatt, 1926).We will then train a system to predict y from x, and evaluate it on held-out (x, y) pairs of the same relation.
A prompt t is a sentence or phrase that includes two blanks, as illustrated in §1.To pose the query, we fill the x blank with x: We can ask the LM for its probability distribution p LM (y | t, x) over single words that can now fill y .The correct answer would be 1926.

Soft Prompts
Suppose the LM identifies the word types with vectors in R d .We also allow t to be a soft prompt, in which the tokens can be arbitrary vectors in R d : We can initialize these vectors to match those of a given hard prompt.(Each token of a hard prompt may be a word, subword, or punctuation mark, according to the tokenization procedure used by the LM.)However, we can then tune the vectors continuously.We do not change the number of vectors or their positions.For the prompt shown above, we have a 6d-dimensional search space.

Deeply Perturbed Prompts
For each token i of a prompt, the vector v i enters into the LM's computations that complete the prompt.For example, a Transformer architecture computes successively deeper contextual embeddings of the token, v i at layer > 0 is computed from all tokens' embeddings v ( −1) j at the previous layer, using the LM's parameters.
We can tune the prompt by additively perturbing each v ( ) i by a small vector ∆ ( ) i before it is used in further computations.The ∆ vectors for a given hard prompt are initialized to 0 and then tuned.
Perturbing only layer 0 is equivalent to tuning v i directly as in §3.1.However, if we are more aggressive and perturb all layers, we now have 6d • (L + 1) parameters to tune a 6-token prompt.The perturbations (∆ vectors) can be kept small through early stopping or some other form of regularization.Our intuition is that small perturbations will yield more "familiar" activation patterns that are similar to those that the LM was originally trained on.(Li and Liang (2021) tried a rather different approach to preventing overfitting when tuning all layers.)

Mixture Modeling
Given a set T r of soft prompts for relation r, we can define the ensemble predictive distribution where the learned mixture weights p(t | r) form a distribution over the soft prompts t ∈ T r .Ensembling techniques other than mixture-of-experts could also be used, including product-of-experts (Jiang et al., 2020).

Data-Dependent Mixture Modeling
As an extension, we can replace the mixture weights p(t | r) with p(t | r, x), to allow the model to select prompts that are appropriate for the given x.For example, a plural noun x might prefer prompts t that use a plural verb.
While we could directly build a neural softmax model for p(t | r, x), it seems useful to capture the intuition that t may work better if x is plausible in its x .Thus, we instead use Bayes' Theorem to write p(t | r, x) as proportional to p(t | r) • p(x | t, r)1/T , where we have included T to modulate the strength of the above intuition. 1 Here p(t | r) is still a learned distribution over prompts, and we use the fixed language model to estimate the second factor as y p LM (x, y | t) (dropping the dependence on r just as we did for the second factor of (1)).log T is tuned along with all other parameters.

Training Objective
Given an initial set of prompts T r , we jointly optimize the soft prompts t ∈ T and their mixture weights p(t | r) (and log T in §3.4) to minimize the log-loss of the predictive distribution (1): This is a continuous and differentiable objective whose gradient can be computed by backpropagation.It can be locally minimized by gradient descent (using a softmax parameterization of the mixture weights).Equivalently, it can be locally minimized by the EM algorithm: the E step finds a posterior distribution over latent prompts for each (x, y) example, and the M step performs gradient descent to optimize the prompts in that mixture.

Relational Datasets
The relations we learn to predict are T-REx original (Elsahar et al., 2018), T-REx extended (Shin et al., 2020), Google-RE (Orr, 2013), and ConceptNet (Speer et al., 2017)-or rather, the subsets that were used by the LAMA and AutoPrompt papers.See Appendix A for some statistics.

Language Models
Following Petroni et al. (2019), we interrogate BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019).These are masked (cloze) language models.For variety, we also interrogate BART (Lewis et al., 2020a), which conditions on the prompt with empty y and generates a copy where y has been filled in (by a single token).We constrain BART's decoding to ensure that its answer does take this form.Unlike BERT and RoBERTa, BART could be used to fill y with an arbitrarily long phrase, but we do not allow this because y in our datasets is always a single token. 2

Dataset Splits
For the two T-REx datasets, we inherit the trainingvalidation-test split from Shin et al. (2020).For the other datasets, we split randomly in the ratio 80-10-10. 3Since all pairs (x, y) are distinct, there are no common triples among these three sets.Common x values are also rare because each dataset has at least 174 distinct x values.However, the number of distinct y values can be as small as 6.Thus, in another set of experiments (Appendix E), we used a more challenging split that ensures that there are no common y values among these three sets.This tests whether our model generalizes to unseen values.

Prompts
For the T-REx and Google-RE datasets, we have four sources of initial prompts: • (sin.)LAMA provides a single manually created hard prompt for each relation type r.
• (par.)LPAQA (Jiang et al., 2020) provides a set of 13-30 hard prompts for each r, which are paraphrases of the LAMA prompt. 4 • (min.)LPAQA also provides a set of 6-29 hard prompts for each r, based on text mining.
• (ran.)For each (min.)prompt, we replace each word with a random vector, drawn from a Gaussian distribution fit to all of the LM's word embeddings.The number of words and the position of the blanks are preserved.
For the ConceptNet dataset, LAMA uses the gold Open Mind Common Sense (OMCS) dataset (Singh et al., 2002).In this dataset, each example (x i , y i ) is equipped with its own prompt t i .(Each example is really a sentence with two substrings marked as x and y, which are removed to obtain t i .)These prompts are often overly specific: often y i can be predicted from (t i , x i ), or just from t i alone, 2 Among other filters, the LAMA and AutoPrompt papers keep only the triples (r, x, y) such that y is a single token according to the language models used by LAMA.When working with BART, we further require y to be a single token according to BART's tokenization; thus, the BART results are not comparable with the other language models.
3 The LAMA paper (Petroni et al., 2019) provided no split but used everything as test data for their zero-shot method. 4The LPAQA system combines their predictions via a learned weighted product of experts.but y j cannot be predicted from (t i , x j ).Thus, for each relation r, we use only the prompts that appear more than 10 times, resulting in 1-38 prompts.
Statistics about the prompts are in Appendix B. We used only a single copy of each prompt, but a generalization would be to allow multiple slightly perturbed copies of each prompt, which could diverge and specialize during training (Rose, 1998).

Training
We optimize equation ( 2) with the method introduced in §3.5.We use the Adam optimizer (Kingma and Ba, 2015) with its default configuration.For gradient training, we set the batch size as 64, early-stop patience as 4, and test with the model that performs best on the dev set among 16 training epochs.
Training is fast.Even for our largest model (BERT-large-cased) and largest dataset (T-REx extended), tuning a single prompt completes within a few minutes.With a mixture of prompts, training scales roughly linearly with the number of prompts.It is still presumably much cheaper in time and memory than fine-tuning the entire BERT model, which must back-propagate a much larger set of gradients.

Metrics and Baselines
Our method outputs the most probable y given (r, x).Here and in the supplementary material, we report its average performance on all test examples, with precision-at-1 (P@1), precision-at-10 (P@10) and mean reciprocal rank (MRR) as metrics.We measure the improvement from tuning LAMA, LPAQA, and random prompts.We also compare with AutoPrompt.Baseline numbers come from prior papers or our reimplementations.

Results
Table 1 shows results on T-REx datasets obtained by querying three BERT-style models, with P@1 as the metric.Additional metrics and language models are shown in Tables 2 and 3 as well as Tables 5 and 6 in the supplementary material.
We consistently get large improvements by tuning the initial prompts.Remarkably, our method beats all prior methods even when throwing away the words of their informed prompts in favor of random initial vectors.It simply finds a prompt that works well on the (x, y) training examples.
(We marked a boldface number with "?" if we lacked access to per-example output for one of the systems; differences from such systems were simply assumed to be significant.)† marks baseline results obtained from our reimplementations.In the Model column, BEb is BERT-base, BEl is BERT-large, Rob is RoBERTa-base.form) or only the word vectors in the prompts t.As Table 4 shows, each helps, but the major benefit comes from tuning the word vectors to get soft prompts.Appendix C visualizes a set of soft prompts, and Appendix D analyzes the mixture weights.We also experiment on a challenging setting where the y labels are distinct for training and test (Appendix E in the supplementary materials), and find that soft prompts still yield some benefits.
The above results are for our basic method that tunes only the words of the prompt (i.e., layer 0).When we tune all layers-the "deeply perturbed prompts" of §3.2-we typically obtain small additional gains, across various models and initializations, although tuning all layers does substantially hurt RoBERTa.These results are shown in Tables 5  and 6 in the supplementary material.
The tables show that the winning systemfor each combination of language model, T-REx dataset, and evaluation metric-always uses a mixture of soft prompts initialized to mined prompts.It always tunes all layers, except with RoBERTa.

Conclusion
Well-crafted natural language prompts are a powerful way to extract information from pretrained language models.In the case of cloze prompts used to query BERT and BART models for single-word answers, we have demonstrated startlingly large and consistent improvements from rapidly learning prompts that work-even though the resulting "soft prompts" are no longer natural language.
How about few-shot prediction with pretrained generative LMs?Here, Lewis et al. (2020b) show how to assemble a natural language prompt for input x from relevant input-output pairs (x i , y i ) selected by a trained retrieval model.Allowing fine-tuned soft string pairs is an intriguing future possibility for improving such methods without needing to fine-tune the entire language model.

A Statistics of Relational Databases
The statistics of the various relational databases are shown in Table 8.

B Statistics of the Initial Prompts
Table 7 shows some statistics of the prompts we use to initialize the SoftPrompt model.

C Visualization of Soft Prompts
Figure 1 shows what a mixture of soft prompts looks like when we tune only layer 0. The soft prompts are not too interpretable.The words closest to the tuned tokens (shown in blue) seem to be largely on the music topic.However, the soft templates do not seem to form meaningful phrases, nor is it obvious why they would prime for y to be an instrument when x is a musician.

D Entropy of the Mixture Model
For any given relation r, the entropy of the mixture weights is We then take 2 H ∈ [1, |T r |] as a measure of the effective number of prompts that were retained.Table 10 shows some statistics of the effective number of prompts.In some cases, tuning the mixture weights essentially selected a single prompt, but on average, it settled on a mixture of several variant prompts (as illustrated by Figure 1).

E Challenging dataset with distinct y's
As described in §4.3, we conducted an additional experiment to determine whether the prompts could generalize to novel y values.We conduct another experiment and ensure that there are no common y values among the train / dev / test sets.We use T-REx as the base relational database and split the datasets to make the ratio close to 80-10-10.The experiment results are shown in Table 9.We can observe that our method again improves the results, just as in Tables 5 and 6, which shows the generalizability of our method. [

Table 2 :
Results on Google-RE dataset obtained by querying the BERT-large-cased model.

Table 4 :
Ablation experiments, conducted with the BERT-large model on the T-REx original dataset.