Demystifying Prompts in Language Models via Perplexity Estimation

Language models can be prompted to perform a wide variety of zero- and few-shot learning problems. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens or how to pick the best prompts. In this work, we analyze the factors that contribute to this variance and establish a new empirical hypothesis: the performance of a prompt is coupled with the extent to which the model is familiar with the language it contains. Over a wide range of tasks, we show that the lower the perplexity of the prompt is, the better the prompt is able to perform the task. As a result, we devise a method for creating prompts: (1) automatically extend a small seed set of manually written prompts by paraphrasing using GPT3 and backtranslation and (2) choose the lowest perplexity prompts to get significant gains in performance.


Introduction
Language models can be prompted to perform a wide range of zero-and few-shot learning tasks (Brown et al., 2020;Schick and Schütze, 2020).However, there is significant variance in the performance of seemingly similar prompts (Chen et al., 2022): for AG News (Zhang et al., 2015), we find an over 30 point accuracy gap between different manually curated prompts (see Table 1) on OPT 175B (Zhang et al., 2022).Despite efforts to improve prompt engineering (Shin et al., 2020;Li and Liang, 2021;Gao et al., 2021), it is still challenging to develop high-quality prompts for new tasks, and little is known about why this phenomenon occurs.
We are interested in understanding what makes some prompts better than others, and using this understanding to create better prompts for given tasks and models.We hypothesize that the lower the perplexity of a prompt is, the better its performance on the task will be, when considering reasonable prompts that are related to the task.This is based on the intuition that the more frequently the prompt (or very similar phrases) appears in the training data, the more the model is familiar with it and is able to perform the described task.We refrain from using the training data directly as it is often unavailable, expensive to search due to its size, and hard to use for approximate matching of similar prompts.Instead, we focus on the perplexity of the prompt as a proxy for its occurrences in the data.
To enable more complete analysis, we automatically expand the set of manually created prompts for the task by paraphrasing, resulting in a much larger and diverse set of prompts.We focus on prompts in English that reasonably describe the task for two reasons: (a) our main motivation is to understand what lies under the variance of performance in this type of prompt; (b) we aim to devise a useful method for creating prompts that are consistently effective, that could be easily adopted and interpreted by future, potentially non-expert users.
We show empirically that our hypothesis holds across a diverse set of tasks (including classification and word prediction), models, and model sizes, providing us some insights about the underlying mechanism of prompting (see Figure 1).As a result, we devise a method, SPELL (Selecting Prompts by Estimating LM Likelihood), for creating prompts in an informed manner.We show that using SPELL to choose prompts results in less variability in performance as well as in accuracy gains (1.8 accuracy points with OPT and 2.3 accuracy points with Bloom on average).Importantly, our method does not require labels at all, only a small sample of inputs for the task.
Our contributions can be summarized as follows: (a) we formalize the notion that better familiarity of the model with the prompt correlates with better performance (Section 2); (b) we automatically elaborate a given set of seed prompts using paraphrasing (Section 3); (c) we establish experimentally the hypothesis that lower perplexity of the prompt correlates well with better performance (Section 5); (d) we devise a method to create a more consistent set of prompts, that also improve results even with no labels for the task (Section 7).

Why are prompts not all created equal?
Despite the popularity of prompting as a method for using language models (Shin et al., 2020;Li and Liang, 2021;Gao et al., 2021), the cause for the different behavior of various prompts remains unclear so far.Table 1 shows four example prompts for a news topic classification task (AG News) and their respective accuracies when used to prompt OPT 175B (Zhang et al., 2022).The accuracy gap between the different prompts is not trivial, and it is not possible to predict from the prompts alone.We propose that the more frequently a prompt appears in some variation in the data, the better it works for the task.The intuition behind this is that a sequence that is more expected by the model is more likely to aid the model to extract the relevant information.However, this premise is hard to measure accurately: most language models use huge amounts of training data (e.g., OPT uses a corpus of roughly 180B tokens, and Bloom uses roughly 366B tokens), and in addition, this training data is not always publicly available (e.g., GPT3; Brown et al. 2020).Our initial attempts to estimate exact-match occurrences of prompts in the data resulted in very sparse counts, which led us to look for a softer formalization. 1nstead of considering the training data directly, we propose to focus on the perplexity of the prompt as a proxy for its occurrences in some form in the data -essentially indicating to what extent the model expects this prompt.This perplexity-based framing helps to avoid the challenge of exact match in the data, and takes into account variations of the prompt that the model is also exposed to and might be influenced by.In addition, it helps overcome the challenges mentioned above as it requires neither access to the pretraining data (which is not always publicly available for LMs) nor matching over huge amounts of text.
Hypothesis: lower perplexity correlates with better performance.We hypothesize that on average, lower-perplexity prompts perform better.We are interested in establishing this hypothesis by experimentally showing a significant negative correlation between the perplexity of the prompt and its performance on the task, across a diverse set of tasks and models.
We define the perplexity of the prompt as the perplexity of the full prompt sequence, including the input itself, and without the label, averaged over 1,000 examples (see Section 4 for details).The input is a part of the prompt in the case of the word prediction tasks by design (e.g., "The opposite of the word good is").Inclusion of the task input as part of the prompt for classification tasks as well is intentional: we want to ground the prompt to the task (without the input, we are testing the hypothesis that lower perplexity prompts across all tasks work better on every task).The label is not considered a part of the prompt and is not taken into consideration when computing the prompt.In practice, this also results in a huge advantage of our method, SPELL (Section 7), which aims to find better prompts-it does not require any labels.
For performance measures, we use the loglikelihood score assigned by the model to the correct label given that prompt.We choose this metric over accuracy as it gives a more fine-grained distinction between prompts and because accuracy can be unstable, as explained in more detail in Section 4. For classification tasks, we also report correlation with accuracy, which is the main evaluation metric for this type of task.

Automatic Expansion of Seed Prompts
We are interested in expanding our pool of prompts in order to: (a) have a more diverse set of prompts, making it more likely to find a better prompt for our task, and (b) support better analysis to validate our prompt quality hypothesis.In this section, we describe our method for automatically expanding a seed set of manually created prompts using paraphrasing.
Step 0: Creating a seed set of manually-written prompts We first write/collect a small set of human written prompts that describe the task.For classification tasks we assume that the input appears before the prompt, with no choices appearing as part of the prompt (to help in smooth paraphrasing of the prompt itself).
Step 1: Paraphrasing with GPT3 We use the text-davinci-002 version of GPT3 (Brown et al., 2020) to generate paraphrases for each of the manual prompts in our seed set.We prompt it with a meta-prompt for paraphrasing to generate variations of one of our seed prompts.An example of such a meta-prompt is: Write a paraphrase for the following sentence: <seed prompt> Paraphrase:.The 7 meta-prompts used in this step are listed in Table 2.
We choose GPT3 as our paraphrasing model because of its well-documented generation abilities.This is also to ensure that there is a separation between the model we use to create the prompts and the models we use to rank them (OPT and Bloom, see Section 4 for details), to avoid confounding the experimental setup.
Step 2: Paraphrasing using backtranslation Our second step takes as input the paraphrases from GPT3 (in addition to the seed set of prompts) and translates them into different languages and back into English to get additional prompt paraphrases (Wieting et al., 2017).We use a set of 8 languages available in the NLLB translation model (Costajussà et al., 2022) that are relatively high resource and close to English,2 to reduce the risk of noise.Since we aim to get about 100 prompts per task, we add 8 additional languages3 in the case where the basic 8 languages yielded too few alternatives.For word prediction tasks, we use the sequence of the created prompt up to the index of the label, not including the label, for example: The word "dog" in French is ".Depending on the task, we enforce the existence of specific words (e.g., the name of the language, and the source word, in word-level translation) or enforce the prompt to be a question.4 lists all 4 manually created prompts we use for the AG News task (news classification), alongside a few sampled prompts created automatically using our method.As was typically the case, we are able to get prompts that are rather different in phrasing and structure from those included in the seed set.

Examples and Statistics Table
The statistics of the prompts in the manually created seed set (Step 0) as well as the prompts after Step 1 and Step 2 for each task (see Section 4.1 for details about the tasks) are detailed in Table 3.

Models, Tasks and Datasets
We study four auto-regressive models: OPT (Zhang et al., 2022) of different sizes (1.3B, 30B, 175B parameters), all trained mainly on English,4 and Bloom (176B parameters; Luccioni et al. 2022), which is trained on 46 natural languages and 13 programming languages.We experiment with two types of tasks: word prediction tasks and classification tasks, as detailed below.

Word Prediction Tasks
The first task in this category is word-level translation.Given a source word in English and a target language, we expect the model to predict the correct translation.For this task we use NorthEuraLex5 (Dellert et al., 2019), a lexical database providing translations of 1016 words into 107 languages.We experiment with 9 languages that use the Latin script.For Bloom, we use 5 additional languages that do not use the

Meta prompts
Write a paraphrase for the following sentence: <seed-prompt> Paraphrase: <seed-prompt> Paraphrase: Write a likely paraphrase of the text: <seed-prompt> Paraphrase: Write a sentence similar to the following one: <seed-prompt> Paraphrase: Paraphrase the following sentence: <seed-prompt> Paraphrase: Write a variation of this sentence: <seed-prompt> How would you say the following sentence in a different way? <seed-prompt>  Latin script (since Bloom is multilingual).Note that only 5 of the languages we experiment with are officially covered by Bloom. 6e also consider antonym prediction where, given a word, the model is expected to predict its antonym.For this task, we use data from Kaggle,7 which is based on WordNet (Miller, 1995).We choose 1,000 word pairs at random.(classification to offensive vs. not offensive tweets; Barbieri et al. 2020).We use 1,000 random examples from each dataset.

Classification Tasks
The full set of manual prompts is listed in Section A in the Appendix.In these tasks, the prompt follows the input, and at the end of each prompt we add the choices of classes (i.e., we provide the possible labels explicitly in the prompt by listing the possible answers as defined by the dataset itself.):"Choices: X, Y, Z. Answer:" as we find it helps in terms of accuracy.Defining the label space likely helps in our zero-shot setting because there are no previous demonstrations from which the model can learn the possible classes.Additionally, adding class options to the prompt helps to reduce the effect of the surface form competition (Holtzman et al., 2021).The option of generating the answer and comparing it with the gold label was not reasonable here, since we cannot expect the model to generate the exact label as the first choice often enough.

Implementation Details
In all experiments we evaluate zero-shot performance.To avoid noise when computing perplexity, we instantiate the prompts with 1,000 examples of the dataset, compute the perplexity of the prompt with each example, and calculate the average across all instantiated prompts.
To estimate the performance of the prompt, we look at two measures: (a) the language model score (log probability) of the correct label, averaged across 1,000 examples; (b) the accuracy on the task, computed over the 1,000 examples.To compute accuracy, for each example we score all classes and choose the highest ranking class as the prediction of the model.The score of a label of multiple tokens is defined by the sum of the token  scores.
For the word prediction tasks we only report scores, since accuracy in general is less stable, suffers more from the surface form competition (Holtzman et al., 2021), and is usually quite low for these tasks in our setting (the chances the model will generate an exact match of the label are low).Hence, the score of the correct label gives a better estimate of the actual performance of the model.

Results
Classification Tasks and Antonym Prediction Table 5 depicts the Pearson and Spearman correlation results on the classification tasks and the antonym task, with both OPT 175B and Bloom (two upper blocks).We see that most correlations are negative and statistically significant, as we expect.This validates our hypothesis and shows that in the majority of tasks we indeed get a strong correlation between low perplexity of the prompt and better performance on the task. 10For each task we also report the average accuracy.

Word-Level Translation
The results of the wordlevel translation task are reported in Table 6.Here the correlations are extremely consistent across all languages and across models, with statistical significance for all languages except for Catalan and Japanese (in Bloom).

Results across Different Model Sizes
We repeat the same experiment with the OPT models of sizes 1.3B and 30B, to investigate whether these correlations are also consistent across model sizes or whether this is a phenomenon we should expect only in large language models.Table 5 (two lower blocks) shows these results for all classification tasks and antonym prediction.We do see that in general the trend appears to be the same in the smaller models as well; however, the correlations seem to be slightly weaker.We hypothesize that this might be due to the overall lower performance of these smaller models, making the performance results we use for correlation less stable and reliable.For word-level translation, however, all correlations with the 30B and 1.3B models are similar to those with the 175B model, and are all statistically significant (also after Bonferroni correction for multiple hypotheses).

Analysis
Next, we further explore the observed relationship between model perplexity and prompt performance.Despite the consistently high correlation between these two factors, the structure of this relationship varies across tasks (Section 6.1).Additionally, we find that the automatically added prompts are highquality and not a significant source of noise (Section 6.2), and that the best prompts selected by our approach vary across models (Section 6.3).

Visualizing the Relationship between Perplexity and Performance
To visualize the correlations we get between the perplexity and the performance of the prompts across the different settings, we plot a few examples for different tasks and languages.Figures 1 and 2 show some of the results for selected tasks, as detailed in the captions.The negative trend of the correlation is clearly visible in all plots.Interestingly, the structure of the plots for word-level translation are very similar across all the language pairs, suggesting that prompts get consistent perplexity and performance across languages (possibly at different scales).Indeed, the intersection of the 10 lowest perplexity prompts between any two different languages is 8.6 and 8.4 on average (for OPT 175B and Bloom, respectively), which is extremely high.This is not very surprising since we know that the only differences between the prompts in the different languages are the names of the target languages (e.g., The word for "dog" in French is ").Additionally, the intersection of 10 prompts with the highest label score between any two different languages is 7 and 6.5 on average (for OPT 175B and Bloom, respectively).A notable finding that appears in the word-level translation plots is the clear separation between prompts that include or do not include quotation marks for the label (usually aligns with whether the prompt uses quotation marks for the source word) -three example prompts appear on the plot.Prompts with quotation marks for the words tend to have both lower perplexity and better performance, consistently.We further analyze the results for OPT 175B within clusters (with/without quotations marks).In the cluster with quotation marks, we get negative correlations (in the range of -0.28 to -0.38) that are statistically significant for almost all languages.The correlations within the other cluster are weaker and less significant (this is expected given the overall lower performance of that cluster).
Figure 2: Score of correct label vs. perplexity for the word-level translation task in French with OPT 175B.The x axis is in log scale.The blue points stand for prompts with quotation marks for the words, while the yellow points are of prompts without quotation marks.

Effect of Noisy Prompts
We expect our automatic method for expanding the set of prompts to also introduce some noise.Though our focus is on the lower perplexity prompts, since we want to benefit from this analysis and be able to devise a method for creating better prompts, we do want to make sure that this potential noise is not the cause for the strong correlations we get.In other words, one might claim that some noisy prompts have particularly high perplexity and also perform badly, thus, supporting our hypothesis in an undesirable and uncontrolled manner.
We turn to inspect the 10% highest perplexity prompts in the different tasks and find subjectively that they are not noisy, and are usually valid prompts for the tasks.The 5 highest perplexity prompts for the GLUE Cola task are listed in Table 7 as an example.

prompt ppl
Is this example correct English usage?25.79Is this example using English correctly?25.46 Is this example correct English?
25.33 Is this the example in correct English?25.00 Is English in this example correct?24.90As a sanity check, we choose two tasks: wordlevel translation and AG News, manually filter out the noisy prompts, and compute the correlations again.The annotation is done by external annotators (NLP researchers) that were presented with the tasks and asked to label whether the prompt is reasonable to use for the task.The new correlations with OPT 175B are reported in Table 8.We find that all correlations remain strong and statistically significant when noise is manually removed from the analysis.We get the same trends with Bloom as well.

Best Performing Prompts
Table 9 lists the 5 lowest perplexity prompts for the task of antonym prediction, as an example.Similar lists for the rest of the tasks are listed in Section B in the Appendix.
A closer look at the lowest perplexity prompts reveals that the intersection of 10 lowest perplexprompt ppl The following two words are antonyms: "good" and " 10.24The antonym of the word "good" is " 10.32 The word that has the opposite meaning of the word "good" is " 10.43 The word "good" is the antithesis of the word " 10.85The word "good" is the opposite of the word " 11.15 ity prompts between OPT 175B and Bloom is 7.1 on average, across the classification tasks.When looking at the 10 highest accuracy prompts across models we get an average intersection of 3.1 across the classification tasks.

SPELL: Selecting Prompts by Estimating LM Likelihood
The primary contribution of this work is the analysis of the relationship between prompt perplexity and downstream task performance (Section 5).As one potential application of our findings, we also present a new method, SPELL, for generating and selecting consistently effective prompts.Assuming a fixed computational budget for finding effective prompts for a given task, and that the search space might be quite large, we devise the following straightforward procedure: 1. Obtain a small set of manually created prompts for the task.
2. Expand the set of prompts with automatic paraphrasing using a LM (e.g., GPT3) and backtranslation (see Section 3).
3. Rank the list of prompts by perplexity (averaged on a representative sample of task inputs, e.g., 1,000).
Using this algorithm, we show empirically that it is best to prioritize experimenting with the lowest perplexity prompts, as they are more stable (exhibit less variation in performance) and perform better than manual prompts on average.This method also does not require any labels for the task, and is applicable to any task, also by non-experts, given example inputs only.

Empirical Validation of SPELL
To show the effectiveness of our method, we report the results we get using SPELL across the different tasks.In Table 10 we report the average accuracy with the manual prompts compared to the average accuracy with the 3 lowest-perplexity prompts, for both OPT 175B and Bloom.Indeed, in most cases, the average accuracy using the 3 lowest perplexity prompts outperforms the average accuracy of the manual prompts, with an average of 1.8 accuracy points across tasks with OPT and 2.3 accuracy Table 10: The average accuracy with the manual prompts (manual) compared to the average accuracy with the 3 lowest-perplexity prompts (low-ppl), for both OPT 175B and Bloom, across tasks.
points with Bloom, demonstrating the effectiveness of our method.
The variability in accuracy of the 3 lowest perplexity prompts is also much lower than that of the manually created prompts: with OPT 175B, the average standard deviation within the 3 lowest perplexity prompts (across tasks) is 5.07, vs. 6.86 for the manual prompts, and with Bloom the gap is much bigger, with an average of 2.6 for the 3 lowest perplexity prompts vs. 7.47 for the manual ones. 11his further shows that SPELL is more stable and reliable compared to using an arbitrary set of manually created prompts.SPELL sets the stage for further development in this direction, and serves as an initial indication of the benefits of involving perplexity estimation in the process of generating effective prompts.

Related Work
Relation between performance and training data Previous work looking directly into the relation between the training data and the performance is limited.Razeghi et al. (2022) study numeric deduction tasks, and examine the correlations between the model performance on specific test instances and the frequency of terms from those instances in the pretraining data.They find that the models are more accurate on instances whose terms are more prevalent in the training data.Additionally, Han and Tsvetkov (2022) propose a method to effectively identify a very small subset of pretraining data that directly supports the model in performing a specific task.Elazar et al. (2022) use causal inference to measure the effect of pretraining data statistics on factual knowledge performance, and Kandpal et al. (2022) show correlational and causal relationships between accuracy and relevant document count (from training data) for QA datasets.
Prompt tuning and analysis There is a very rich line of work trying to find prompts automatically.Shin et al. (2020) present an automated method to create discrete prompts for a diverse set of tasks, based on a gradient-guided search, and they demonstrate their method on masked LMs.Other work also focuses on discrete prompts, aiming to improve zero-shot performance (Gao et al., 2021;Le Scao and Rush, 2021;Deng et al., 2022;Shi et al., 2022), or trains continuous prompts (Li and Liang, 2021;Lester et al., 2021;Qin and Eisner, 2021).
On top of works that suggest a variety of methods for creating better prompts, some work also analyzes those prompts to try and get some insights about them: Khashabi et al. (2022a) find that model performance is highly sensitive to small changes in wordings and Khashabi et al. (2022b) point to a surprising disconnect between continuous and discrete prompts.

Conclusion
We investigate the phenomenon where some prompts perform better than others despite appearing similar to the human users of LMs.Specifically, we hypothesize that the perplexity of a prompt under a given LM is closely tied to its task performance.We test this theory on a large number of tasks and autoregressive LMs, and the resulting correlation study validates our hypothesis.Further analysis of this relationship demonstrates that the best prompts differ across models, highlighting the importance of model-specific analysis, and that the underlying structure of the relationship between perplexity and performance varies across tasks.
In light of these findings, we then propose a method, SPELL, to help users find wellperforming prompts for new tasks.Empirical validation of the proposed procedure shows that SPELL generates effective prompts with low variability in performance, and produces small gains of 1.8 (2.3) accuracy points with OPT (Bloom) over manual prompts.We therefore conclude that SPELL provides a general and interpretable approach for applying LMs to new tasks while requiring minimal human effort, and no labels.

Limitations
Searching for human-readable prompts We limit our search space to human-readable prompts that are fluent and accurately describe the task at hand, as we are primarily motivated in understanding why some relevant prompts work better than others.We do this by using manually created prompts and their automatically created paraphrases.Our findings may not hold when the possible prompt space is expanded to include any token sequence; we leave this direction to future work.
Generality of our analysis and of the SPELL method We perform our analysis on and build our method around specific models, namely OPT and Bloom.Additionally, our study is limited to the specific tasks we experiment with and to English.It is possible that our analysis and SPELL method do not generalize to other pretrained models or tasks; however, we consider models of various sizes and from different sources, and a wide range of tasks to mitigate this risk.

Figure 1 :
Figure 1: Accuracy vs. perplexity for the AG News dataset with OPT 175B.The x axis is in log scale.Each point stands for a different prompt.

Table 1 :
Example prompts for the task AG News (news classification) that vary considerably in accuracy.

Table 2 :
Meta prompts used in Step 1 of our method for paraphrasing using GPT3.
All Manually Created Prompts Examples of Similar Automatically Created Prompts What label best describes this news article?What's the most accurate label for this news article?What is this piece of news regarding?What does this piece of news concern?Which newspaper section would this article likely appear in?In what section of the newspaper could this article be published?What topic is this news article about?What category does this article fall into?

Table 4 :
Prompts for the task AG News (news classification): the manually created prompts and a sample of automatically created prompts using our method.

Table 5 :
Correlation results for the different tasks, with OPT (different sizes) and Bloom.Correlations with p < 0.05 are marked with *.Correlations with p < 0.00625 (according to Bonferroni correction for multiple hypotheses) are marked with **.Dark and light blue colored cells stand for negative correlations < −0.2 and > −0.2, respectively.

Table 7 :
Example of the 5 highest perplexity prompts for GLUE Cola, using OPT 175B.

Table 8 :
Correlations before and after filtering out noisy prompts, with AG News and Word-Level Translation (WLT).

Table 9 :
Lowest perplexity prompts for the antonym prediction task, using OPT 175B.