Toward Human Readable Prompt Tuning: Kubrick’s The Shining is a good movie, and a good prompt too?

,


Introduction
Large language models (LMs) can perform downstream tasks by simply conditioning on a prompt-a short sequence of text specific to the task.Such natural language prompts are either carefully hand engineered (e.g., manual prompt engineering, Kojima et al. 2022) or automatically learned from labeled data (e.g., gradient-based prompt tuning, Shin et al. 2020).Despite their effectiveness, it remains unclear what makes these prompts work and what attributes effective prompts share.In this paper, we aim to identify key characteristics of effective prompting, and use this knowledge to generate effective and human readable prompts without any labeled data.
Figure 1: Compared with previous discrete prompt tuning method AutoPrompt (Shin et al., 2020) which generates gibberish prompts, FLUENTPROMPT can identify effective and more readable prompts, useful for downstream analyses.We find that in prompt tuning, the good prompts are topically relevant to the task domain (e.g., mentioning a film director "Kubrick" in a movie sentiment classification task), and calibrate the prior probability of output labels (e.g., including negative words to balance an overly optimistic model).
There are two main challenges for performing this type of analysis.First, manual prompt tuning produces a limited number of effective prompts for each task, making it difficult to infer common features of good prompts where contrast with less effective prompts is needed.On the other hand, the prompts found by gradient-based tuning methods are often disfluent and unnatural, making them difficult to interpret (e.g., AutoPrompt in Figure 1).
To overcome these challenges, we first propose a human readable prompt tuning method called FLUENTPROMPT.Inspired by prior work in controllable text generation (Kumar et al., 2022), FLU-ENTPROMPT uses Langevin dynamics to generate a set of human readable prompts for any task.Our method adds a progressive noise to the tuning procedure to obtain a distribution of effective prompts, while also maintaining the fluency of the prompts through a perplexity constraint.As shown in Figure 1, compared to the baseline gibberish prompts produced by AutoPrompt, FLUENTPROMPT generates prompts that are more fluent (i.e., lower perplexity) and perform competitively.The resulting fluent prompts not only facilitate our further analysis, but can also lead to better trust and engagement from both researchers and end users.
After obtaining a broad set of effective and human-readable prompts, we analyze the factors that contribute to the effectiveness of prompts.Specifically, we show that effective prompts are both (1) topically related to the task domain and (2) more calibrated to the prior probability of output labels.Specifically, calibration measures how balanced the LM's prior probability of output labels (i.e., in the absence of a specific example) is.
Based on our findings, we propose a novel method UNSUPERVISED FLUENTPROMPT, for automatically searching for effective prompts using only unlabeled data.UNSUPERVISED FLUENT-PROMPT optimizes the prompts for both better calibration and better domain relevance.Our experimental results show that UNSUPERVISED FLU-ENTPROMPT outperforms strong zero-shot baseline (Holtzman et al., 2021) by 7.0% in accuracy.We summarize our contributions as follows: • We introduce FLUENTPROMPT, a humanreadable prompt tuning method that can generate a broad set of effective and fluent prompts ( §3).This method not only serves as the foundation for our analysis, but also helps bridge the gap between manual prompt engineering and gradient-based prompt tuning.
• We analyze the factors that contribute to the effectiveness of prompts and show that topic relatedness and calibration of the prompts are key to their success ( §4).
• Inspired by our findings, we introduce a new method for discovering effective prompts without the need for labeled data ( §5).
2 Related Work

Prompt Tuning
Continuous Prompt Continuous prompts are continuous vectors inserted to the task input for a prompted language model (Qin and Eisner, 2021;Ding et al., 2021;Lester et al., 2021;Liu et al., 2021).Such continuous prompts are typically tuned by gradient-based methods, which are guided by the task training examples with labels.While these prompts usually improve the model performance, their continuous nature makes them difficult for humans to understand or interpret (Khashabi et al., 2021;Hambardzumyan et al., 2021).
Discrete Prompt Discrete prompts are composed of discrete tokens from natural language vocabulary.Such prompts can be either written by human or searched automatically.Human-written prompts (Kojima et al., 2022;Wang et al., 2022;Sanh et al., 2021;Su et al., 2022) typically consist of meaningful texts such as task descriptions (Schick and Schütze, 2021) or instructions (e.g., "let's think step by step", Kojima et al. 2022), which are not only human readable but also align with our understanding of tasks.In-context demonstration examples can also be considered as human-written prompts (Brown et al., 2020;Liu et al., 2022) but are not the focus of this work.
Prior work has also focused on searching discrete prompts automatically.One method is gradient-based similar to the continuous prompt setup but with projections to a discrete vocabulary (Shin et al., 2020).The drawback of this method is that the resulting prompts are usually disfluent and difficult to read.Other work searching for discrete prompts include edit-based enumeration (Prasad et al., 2022), reinforcement learning (Deng et al., 2022), and large language model continuation and filtering (Zhou et al., 2022).The goal for these prompt tuning methods is mainly to achieve competitive task performance without modifying language model parameters.
The purpose of our work is to analyze what aspects of the tuned natural language prompts make them effective for zero-shot inference of language models.To facilitate such analysis, we need prompt readability as in human-written prompts and also a large search space as in gradient-based discrete prompt tuning.FLUENTPROMPT bridges the gap and provides a distribution of effective and humanreadable prompts.

Analyses of Prompts
A growing body of literature tries to understand the mechanisms behind prompts via various per-spectives.For example, prompts in the form of in-context examples are analyzed under perturbations w.r.t.order, label, editing, etc. (Lu et al., 2022;Min et al., 2022;Chen et al., 2022).Human-written instructions (Mishra et al., 2021) have also been studied and show weak sensitivity to semantic-changing perturbations (Webson and Pavlick, 2021).Gonen et al. (2022) use paraphrasing and back-translation on a set of human-written prompts and find on these natural prompts there is a correlation between lower perplexity and better resulting performance.
Our work focuses on natural language prompts derived from gradient-based prompt tuning.Khashabi et al. (2021) tune continuous prompts and show that effective continuous prompts may transfer poorly to their nearest discrete prompts.In contrast, we perform prompt tuning in the discrete space directly with FLUENTPROMPT, demonstrating the feasibility of searching for readable prompts using gradient-based method.This approach gives us a more faithful understanding of the factors that contribute to the effectiveness of natural language prompts.

FLUENTPROMPT
We introduce FLUENTPROMPT, a prompt tuning method that generates a group of highly effective and human-readable prompts.Our approach utilizes Langevin dynamics to incorporate fluency constraints into the prompt tuning process, making it a novel application of controllable text generation and constrained sampling within the field of discrete prompt tuning.With human-readable prompts, we aim to explore the relationship between the features of the prompts and their performance.

Background: continuous prompt tuning
Given an input example x with an output label y ∈ Y , we can prompt an autoregressive language model with parameters θ as follows.We reformulate the task as a language modeling problem by inserting a task-specific template t to x, and defining a verbalizer v mapping from a label y to a label word (i.e, a token in the LM's vocabulary that semantically implies the label).For example, to determine the sentiment of "I like the movie", we can pass "I like the movie.It was [MASK]" to the LM and inspect the probability of "good" as a label word.1 Specifically, the probability of a label y given an input x and template t is estimated by: Lester et al. ( 2021) add a sequence of M soft embeddings ẽ0:M (simplified as ẽ; 0:M refers to the positional subscript for the sequence from position 0 to M − 1) in front of the input.Therefore, the probability of the label is computed by p θ (v(y) | ẽ, x, t), where ẽ is embeddings that bypass the word embedding layer of the LM θ and is learned based on a set of training data.These learned embeddings are sometimes referred to as soft prompts, and the learning of such prompts as soft prompt tuning.For example, if stochastic gradient descent (SGD) is used as an optimizer, the soft prompt ẽ is updated as where i is the timestep superscript, referring to i-th optimization step, and η is the learning rate.

Discrete prompt tuning with Langevin dynamics
There are two challenges for the above soft prompt tuning.First, the resulting embeddings cannot be mapped to the natural language vocabulary.Khashabi et al. (2021) show that naively mapping an effective soft prompt to its nearest tokens significantly drops the performance.Second, we only obtain a single embedding instead of a range of embeddings with varying levels of performance.This makes it difficult to analyze the characteristics of the prompts and compare their effectiveness in specific tasks.Following Kumar et al. (2022), we use Langevin dynamics to sample discrete prompts that lead to a better performing model in the task.Overall, the method is similar to SGD but adds a progressive Gaussian noise to the embeddings, with the scale decreasing over time.Additionally, at each optimization step, the updated embedding is projected to the nearest embedding in the LM vocabulary. where: • E is an energy function (lower is better), • z is a Gaussian noise, z ∼ N (0, I |ẽ| ).
• β is the variance of the noise following a geometric progression, • E is the embedding table (layer) of the LM θ, one embedding for each token in the vocabulary.
• Proj E is a projection operation finding a nearest neighbor for each soft embedding in the LM's vocabulary, Proj Without the progressive noise in Langevin dynamics, our prompt search procedure is gradientbased and shares a similar intuition with Auto-Prompt (Shin et al., 2020).Both methods use the gradient of the loss w.r.t. the embeddings, though AutoPrompt applies greedy substitution whereas we use projected gradient descent, aligning with soft prompt tuning and enabling the subsequent prompt sampling.AutoPrompt also incorporates verbalizer word selection, which is not a focus of the analysis in this work.We use our gradientbased, discrete prompt tuning method without Langevin dynamics as a baseline, referred to as AutoPrompt SGD .

Fluency constraint
Sampling from projected Langevin dynamics ensures that the tuned prompt contains natural language tokens.However, with no extra constraints, they can form a disfluent sentence.
We explicitly incorporate a fluency objective to the Langevin energy function.This objective resembles the regular perplexity loss, but the labels (next token in the prompt) are not ground-truth.Instead, we measure an embedding-based sequence probability according to Kumar et al. (2022).For simplicity, below we drop the timestep superscript on the prompt embeddings and only keep the positional subscript.
The first step is to obtain the probability of generating the embedding at position m (i.e., ẽm ) based on the previous m − 1 embeddings (i.e., ẽ0:m ).We extract the last hidden state from the LM (i.e., output embedding) at position m − 1: h θ,m−1 = h θ (ẽ 0:m ).Then the probability is: where we equivalently compute the logits for each embedding's corresponded vocabulary and take the softmax.2Subsequently, the sequence probability is p θ (ẽ 0:M ) = M −1 m=1 p θ (ẽ m | ẽ0:m ).We define a prompt fluency loss as the negative log-likelihood of the prompt embeddings, − log p θ (ẽ 0:M ).Along with the task labeling loss ( §3.2.1), we modify our energy function as: where λ task + λ fluency = 1.Through the whole FLUENTPROMPT tuning procedure, the language model parameters θ are fixed while the embeddings ẽ0:M are tuned.

Experimental Setup
Target tasks We evaluate performance on two sentiment analysis tasks: Amazon Polarity (McAuley and Leskovec, 2013) and SST-2 (Socher et al., 2013), and one topic classification task: AGNEWS (Zhang et al., 2015).These tasks were selected since vanilla soft prompt tuning (Lester et al., 2021) substantially improves model performance.In contrast, tasks like RTE (Dagan et al., 2005) are more difficult; soft prompt tuning did not yield a significant improvement (57.4% accuracy from prompt tuning compared with 52.1% from random guess) in our pilot study, and we therefore did not pursue further analysis using FLUENTPROMPT.The verbalizer words and templates used for each task are listed in Table 7. for each hyperparameter setup.Additionally, we perform experiments with β start = β end = 0 (i.e, no progressive noise) and λ fluency = 0 (i.e, no fluency constraint) as ablations to FLUENTPROMPT purposed for analysis.

Results
Table 1 shows example prompts found by AutoPrompt SGD and FLUENTPROMPT, along with their associating accuracy and perplexity.We additional show the accuracy of an empty prompt (i.e., ẽ is null).We see that FLUENTPROMPT performs comparably to AutoPrompt SGD and significantly better than the empty prompt.In terms of readability, FLUENTPROMPT generates more fluent prompts than AutoPrompt SGD .
In Table 2, we quantitatively compare AutoPrompt SGD and FLUENTPROMPT.
For each task, we use each method to generate 40 prompts with a length M = 10, under 4 step sizes η ∈ {0.3, 1.0, 3.0, 10.0} and 10 random seeds.AutoPrompt SGD does not have a perplexity constraint over its prompt tuning process (λ fluency = 0).For FLUENTPROMPT, we apply an optimal perplexity constraint at λ fluency = 0.003, 0.003, 0.01 for SST-2, Amazon, and AGNEWS, respectively.We observe that on Amazon, FLUENTPROMPT achieves both a better average prompt accuracy and a better maximum accuracy.On SST-2 and AGNEWS, FLUENTPROMPT also achieves better average accuracy, while having a nearly as high maximum accuracy as AutoPrompt SGD .For all three tasks, FLUENTPROMPT leads to prompts with a significantly lower perplexity (p < 0.0001 in t-tests).Without sacrificing performance, prompts with lower perplexity are preferred for their potentially better readability for downstream analyses.

What Makes Good Prompts?
In this section, we analyze common attributes of the effective tuned prompts.Specifically, we study the 10-token prompts found by FLUENTPROMPT on SST-2, Amazon and AGNEWS.

Effective prompts calibrate the output distribution over label words
Language models are known to be biased towards label words that are common in its pretraining distribution (Holtzman et al., 2021;Zhao et al., 2021).
In this section, we aim to investigate whether effective prompts found by prompt tuning implicitly adjust for the bias (calibration).To measure this bias, we follow Holtzman et al. (2021) to use task-specific domain string d as the test input and compute the entropy of the labels.Table 3 lists the task-specific domain d for each dataset.As the task-specific domain strings do not imply any label information (i.e., label-neutral), we expect the output of the language model to be uniform over the label words when only conditioned on the domain string.The entropy of the label words is computed as follows: Higher entropy of the label word prediction implies a more balanced (calibrated) label words distribu- Table 2: Prompt effectiveness and perplexity of AutoPrompt SGD and FLUENTPROMPT.Each model derives 40 prompts with a length of 10.AutoPrompt SGD does not have the progressive noise z and the perplexity constraint (λ fluency = 0).FLUENTPROMPT applies the perplexity constraint with λ fluency = 0.003, 0.003, 0.01 for SST-2, Amazon, and AGNEWS, respectively.The prompts found by FLUENTPROMPT are overall more effective and have a significantly lower perplexity, indicating better readability.
Figure 2: Frequency of prompts (y-axis) at different entropy levels (x-axis).We compare effective prompts with the empty and human-written prompt.tion.When the label word probabilities are uniform, the entropy reaches its maximum at log(|Y |).
As listed in Table 1, some effective prompts found by FLUENTPROMPT for sentiment analysis contain negative sentiment words (e.g., "disappointing" and "complained" in prompts for SST-2 ), which may implicitly reduce the probabilty of positive labels and calibrate the label word distribution.To validate this hypothesis, we filter a set of effective prompts by FLUENTPROMPT and compute the entropy of the label predictions conditioned on the concatenation of the prompt and the task-specific domain string.Figure 2 shows the density plot comparing the label word entropy of effective prompts, with empty and human-written prompts taken from Bach et al. (2022).We observe that the entropy of effective prompts has a higher mode than the entropy of empty and human-written prompts with lower accuracy.
To further explore the relation between the task performance and calibration, we compute correlation between the task accuracy and the label word entropy of all prompts obtained by FLUENT-PROMPT and report Spearman's rank correlation.From Figure 3, we observe that the label word entropy exhibits significant positive correlations with the task accuracy (all p < 0.0001).The Spearman's coefficients are +0.61,+0.75 and +0.43 for SST-2, Amazon and AGNEWS, respectively.

Effective prompts are topically related to the task domain
Qualitative Analysis As shown in Table 1, most of the effective prompts obtained by FLUENT-PROMPT contain domain-related words.For example, the prompt Kubrick, "The Shining in SST-2 contains a movie director name and a movie title, relevant to the domain of movie reviews.
Similarly, the prompts mascara\n\n and cigars\n\n found for the Amazon dataset contain product names relevant to the domain of product reviews.Additionally, AGNEWS is a news topic classification task.Some of the effective prompts in AG-NEWS contain topic classification-related words such as "genre", while others contain URLs that link to websites such as netflix 3 and pmwiki. 4 The target pages of these URLs also contain topic classification-related information, such as the prompt pmwiki/pmwiki.php/Main/Superhero which links to a wiki page containing the following information: "Genre: Action Adventure Comedy Commercials".
Quantitative Analysis Based on our qualitative analysis, we hypothesize that effective prompts are topically related to the task domain.To validate this hypothesis, we compare domain word frequency in effective prompts and random sentences.First, we select a set of domain words for each task (see Table 4), which consist of the task label words (e.g., "positive" and "negative" for SST-2) and common words in the task domain (e.g., "movie" and "film" for the movie domain of SST-2).Since our prompts are very short (10 tokens), we augment each prompt with its continuation generated by GPT-3 (Brown et al., 2020), based on the assumption that the continuation by the large LM follows the same domain as the prompt.For each prompt, we sample 5 distinct continuations from GPT-3 using nucleus sampling p = 0.95 at a length of 100 tokens.We compare the top 10 effective prompts with 10 random sentences from PILE (Gao et al., 2020) augmented by the same continuations.We then count the domain words in the concatenation of the prompt and its continuation.Table 5 lists the average accuracy and the number of domain words in the effective and the random sentences with their continuations.The accuracy of effective prompts is higher than that of random sentences on all three datasets.Moreover, the domain words frequency of effective prompts is Table 5: Average domain words frequency (Freq.)and average accuracy (Acc.) for effective prompts and random sentences.Effective prompts and their continuation contain substantially more domain words than random sentences and their continuation.The pvalues from the paired t-test for SST-2, Amazon, and AGNEWS were 0.004, 0.003, and 0.0002, respectively.
significantly higher than that of random sentences with p-values of 0.004, 0.003, and 0.0002 for SST-2, Amazon, and AGNEWS, respectively.Both our qualitative and quantitative analysis provide strong evidence that effective prompts obtained by our prompt tuning are topically related to the task's domain.

UNSUPERVISED FLUENTPROMPT
Our analysis in Section 3 shows that effective prompts exhibit calibration and have high domain relevance to the task.Since these two features are both highly indicative and do not require groundtruth labels for computation, we propose UNSU-PERVISED FLUENTPROMPT, a method for automatically identifying effective prompts without labeled data.The key idea is to optimize the prompts for improved calibration and domain relevance.In the following sections, we will detail the methodology of UNSUPERVISED FLUENTPROMPT ( §5.1), describe the experimental setup ( §5.2), and present the results ( §5.3).

Method
Calibration loss In Section 4.1, we find a strong positive correlation between the degree of calibration and performance of the prompts.We therefore explicitly optimize the prompt towards greater calibration (i.e., maximizing the entropy of label words).Ideally, we need a large set of label-neutral domain strings to prevent the model from learning noises in the procedure.Since these domain strings are not always easy to obtain, we use the training inputs of the task (without ground-truth labels), and expect that the aggregation of them should be label-neutral.Therefore, we define a calibration loss based on the entropy of the label words distribution: Intuitively, the calibration loss encourages the prompt to help the model generate more balanced predictions at a dataset (macro) level rather than instance (micro) level.

Domain relevance loss
In Section 4.2, we find that effective prompts are overall more related to the task domain.To explicitly make the prompt relevant to the domain, we extend the existing fluency (perplexity) loss from Section 3.2.2,modeling the perplexity of both the prepending prompt and the input example: Intuitively, log p θ (x | ẽ) − log p θ (x) would measure the pointwise mutual information between the task data x and the tuned prompt ẽ, with the part log p θ (x) not involved in the prompt optimization.Overall, our unsupervised energy function E is updated to: where λ calibration + λ domain = 1.

Results
In Table 6, we compare the performance of our proposed method, UNSUPERVISED FLUENTPROMPT, with the empty prompt and the strong unsupervised baseline PMI calibration PMI DC (Holtzman et al., 2021) on three datasets.Our results show that UN-SUPERVISED FLUENTPROMPT consistently outperforms PMI DC with an average improvement of 7.0% across the datasets.This demonstrates that the incorporated calibration and domain information are helpful to finding effective prompts.

Conclusion
In this paper, we investigate the factors that contribute to the effectiveness of prompts.To facilitate this study, we develop a human-readable prompt tuning method FLUENTPROMPT and apply it to the GPT-2 large model to generate effective and readable prompts.Our analysis reveals that effective prompts are topically related to the task domain and calibrate the prior probability of label words.
Although the prompts generated by FLUENT-PROMPT are effective and readable, they still carry limited semantic meanings.For instance, we did not find any prompts directly indicating the task definition or instructions.One potential reason is that the GPT-2 large model is not instructiontuned.Future work can apply FLUENTPROMPT to an instruction-tuned model to see if instruction-like prompts can be discovered.

Limitations
The FLUENTPROMPT approach is a versatile method for optimizing human-readable prompts in both classification and generation tasks.However, our investigations into calibration are specific to classification tasks.It would be intriguing to explore the characteristics of effective prompts in generation tasks in future studies.It is worth noting that FLUENTPROMPT employs Langevin dynamics to incorporate perplexity constraints during training, making it directly applicable to autoregressive models and not to masked language models or encoder-decoder models.
Due to resource limitations, we applied FLUENT-PROMPT to GPT-2 large.Our current GPU was not sufficiently efficient to handle larger models within our budget.It is important to note that our focus was not solely on performance, but rather on analyzing the properties of effective prompts.In the future, it would be valuable to extend our method to different-sized language models and explore alternative constrained sampling techniques to identify fluent and effective prompts for various types of language models.11005

Table 3 :
Tasks and their task-specific domain strings.The task-specific domain strings do not imply any label information.

Table 4 :
Domain words for each task.

Table 7 :
The template, example (colored black) and verbalizer used for each dataset.