RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning

Prompting has shown impressive success in enabling large pre-trained language models (LMs) to perform diverse NLP tasks, especially with only few downstream data. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning *soft* prompts (e.g., embeddings) which fall short of interpretability, reusability across LMs, and applicability when gradients are not accessible. *Discrete* prompts, on the other hand, are difficult to optimize, and are often created by “enumeration (e.g., paraphrasing)-then-selection” heuristics that do not explore the prompt space systematically. This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt formulates a parameter-efficient policy network that generates the optimized discrete prompt after training with reward. To harness the complex and stochastic reward signals from the large LM environment, we incorporate effective reward stabilization that substantially enhances training efficiency. RLPrompt is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing fine-tuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating that LM prompting may not follow human language patterns.


Introduction
Prompting has emerged as a promising approach to solving a wide range of NLP problems using large pre-trained language models (LMs), including leftto-right models such as GPTs (Radford et al., 2019;Brown et al., 2020) and masked LMs such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), etc.Compared to conventional fine-tuning that expensively updates the massive LM parameters for each downstream task, prompting concatenates the inputs with an additional piece of text that steers the LM to produce the desired outputs.A key question with prompting is how to find the optimal prompts to improve the LM's performance on various tasks, often with only a few training examples.
One of the most popular scheme is to tune soft prompts (i.e., continuous embedding vectors) as they are amenable to gradient descent (Li and Liang, 2021;Vu et al., 2021;Gu et al., 2021;Liu et al., 2021d;Mokady et al., 2021;Qian et al., 2022;An et al., 2022, etc.).However, the resulting prompts are, by their nature, hard for humans to understand (Khashabi et al., 2021;Lester et al., 2021;Hambardzumyan et al., 2021) and incompatible for use with other LMs.Besides, the required LM internal gradients are often expensive to compute, or simply unavailable for LMs deployed with only inference APIs (e.g., .It is thus often desirable to use discrete prompts which consist of concrete tokens from a vocabulary.However, their discrete nature renders the optimization very difficult.Previous work has typically relied on manual engineering (Petroni et al., 2019;Brown et al., 2020;Schick and Schütze, 2021a;Tam et al., 2021), or selecting from multiple paraphrased/generated prompts (Jiang et al., 2020;Gao et al., 2021;Liu et al., 2021b;Prasad et al., 2022;Hao et al., 2022).AutoPrompt (Shin et al., 2020) uses gradient information to edit the prompt tokens, which suffers from training instability as well as the same applicability issue as gradient-based soft prompting, showing limited effectiveness in practice.
This paper presents RLPROMPT, a new discrete prompt optimization approach based on reinforce-  (Shin et al., 2020) Table 1: Comparison of different (prompting) paradigms for using pre-trained LMs on downstream tasks, in terms of several desirable properties.Gradient-Free methods do not require gradient information from the prompted LMs, which may be inaccessible or expensive to compute.Guided Optimize means the optimization/search is guided by gradient or reward signals, which tends to be more efficient than otherwise (e.g., enumeration).Prompts of discrete tokens (as opposed to embeddings) are often transferrable/reusable by different LMs.Our approach with RL can optimize prompts using rewards without supervised data (zero-shot).Discrete Prompt Enumeration selects the best prompt from a large number of candidates (e.g., from paraphrasing or generation, Jiang et al., 2020;Gao et al., 2021;Liu et al., 2021b;Prasad et al., 2022).AutoPrompt (Shin et al., 2020) uses gradients to edit the discrete prompt tokens.See §4 and Appendix §C for more discussion.
ment learning (RL).This approach brings together a wide range of desirable properties for efficient use on diverse tasks and LMs (Table 1).Crucially, rather than directly editing the discrete tokens, which has been difficult and inefficient, RL-PROMPT trains a policy network that generates the desired prompts.Discrete prompt optimization thus amounts to learning a small number of policy parameters which we set as an MLP layer inserted into a frozen compact model such as distilGPT-2 (HuggingFace, 2019).This formulation also allows us to employ off-the-shelf RL algorithms (e.g., Guo et al., 2021) that learn the policy with arbitrary reward functions-defined either with available data (e.g., in few-shot classification) or other weak signals when no supervised data is accessible (e.g., in controllable text generation).
On the other hand, RL for prompt optimization poses new challenges to learning efficiency: the large black-box LM presents a highly complex environment that, given the prompt (i.e., actions), goes through a long series of complex transitions (e.g., reading the input and inferring the output) before computing the rewards.This makes the reward signals extremely unstable and hard to learn from.To overcome this difficulty, we propose two simple yet surprisingly effective ways to stabilize the rewards and improve the optimization efficiency.
Experiments on few-shot classification and unsupervised text style transfer show our approach improves over a wide range of fine-tuning and prompting methods (e.g., those described in Table 1), and is robust to different modeling choices (e.g., verbalizers in classification).The resulting discrete prompts also facilitate rich interpretations and analyses for new insights into LM prompting.
In particular, the optimized prompts, though inducing strong task performance, tend to be gibberish text without clear human-understandable meaning, echoing recent research (Webson and Pavlick, 2021;Zhao et al., 2021;Prasad et al., 2022) that LMs making use of prompts do not necessarily follow human language patterns.Perhaps surprisingly, those gibberish prompts learned with one LM can be used in other LMs for significant performance, indicating that those different pre-trained LMs have grasped shared structures for prompting.

Discrete Prompt Optimization with RL
We present RLPROMPT, a framework for learning prompts of discrete tokens for pre-trained LMs to succeed in a wide range of NLP tasks.
As discussed in §1, discrete prompts can be easier to interpret and use than continuous prompts, but also more challenging to learn due to intractable optimization over discrete tokens.To solve this difficulty, we formulate discrete prompt optimization as an RL problem, using a continuous policy network to explore the prompt space.The network is highly parameter-efficient, only training a small MLP over a frozen compact LM (e.g., distilGPT-2).
Below, we present our RL formulation of discrete prompt optimization ( §2.1-2.2).We then discuss the design of our policy network ( §2.3).Finally, we describe our reward engineering techniques to improve RL training ( §2.4).

Discrete Prompt Optimization Problem
Extensive recent work (Brown et al., 2020;Jiang et al., 2020;Khashabi et al., 2021;Gao et al., 2021) has shown it is possible to combine discrete text prompt z with input x to directly perform vari- ous NLP tasks using a pre-trained LM's generative distribution P LM (y|z, x), without needing to fine-tune the model.For instance, in classification, the LM can be a masked language model (MLM) such as BERT (Devlin et al., 2019), and y is the class-label token (a.k.a.verbalizer like positive or negative) in the mask position; in a generation task, the LM can be a left-to-right model such as GPT-2 (Radford et al., 2019), and y is the generated text.See Figure 1 for illustrative examples.We use y LM (z, x) to denote the LM output on x prompted by z.
Our goal is to find the optimal discrete prompt z * from vocabulary V to maximize some downstream performance measure R of y LM (z * , x). 2 The metric R(y) can be as simple as match with gold label y * (e.g., in classification when data is available), but can also be more complex like the success criteria of controllable text generation, which composes aspects such as style accuracy, language quality, and content preservation.Assuming the prompts have fixed length T , we write the task of discrete prompt optimization in the general format below: (1) The optimization above, however, can be intractable because z's discrete tokens are not amenable to gradient-based optimization, while brute-force search has the exponential complexity of O(|V| T ).Previous work has to either approximate gradients over z using continuous LM embeddings (Shin et al., 2020) or tweak human-written prompts with heuristics (Jiang et al., 2020;Mishra et al., 2021a;Prasad et al., 2022).
2 Technically V can be any set of tokens.Here we simply use the downstream LM's vocabulary.

The Reinforcement Learning Formulation
To overcome the difficulty, we formulate discrete text prompt optimization as an RL problem, in which an agent selects prompt tokens [z 1 , . . ., z T ] one by one to maximize the reward R(y LM (z, x)).At time step t, the agent receives previous prompt tokens z <t and generates the next prompt token z t according to a policy π(z t |z <t ).After the agent finishes the entire prompt ẑ, it receives the task reward R(y LM (ẑ, x)).Parameterizing the policy with θ, we can rewrite the problem above as (2) Compared to typical (soft) prompt tuning approaches, the RL formulation above has the key advantage of not needing gradient access to the LM, treating it instead as a black-box function.This enables us to optimize prompts for LMs whose gradients are too expensive to compute, or LMs that are solely available as inference APIs (e.g., GPT-3).Compared to previous discrete prompt enumeration/paraphrasing, the RL approach explores the prompt space more efficiently guided by the reward signals.The policy network also brings added flexibility, e.g., it can take other information such as the input x, leading to input-specific prompts (e.g., as used in text style transfer in §2.4).
During training, we explore the prompt space by sampling from the policy network.After the policy is trained, we select tokens greedily during inference to produce a deterministic prompt.The reward objective in Eq.( 2) can be optimized with any off-the-shelf RL algorithm.We use the latest soft Q-learning (SQL, Guo et al., 2021) which has shown advanced learning efficiency and performance on various text generation problems, with open-source implementation.3Specifically, we use only its on-policy component.We refer interested readers to Guo et al. (2021) for more details.

Efficient Parameterization of Policy
We present an efficient parameterization of the policy network π θ , which adapts a frozen pre-trained LM (i.e., policy LM) with a simple MLP layer that contains all the parameters θ to be trained.The policy LM need not be the same as the LM we optimize the prompt for (i.e., task LM). Figure 1 (left) illustrates the policy LM architecture.Specifically, we use the LM to extract contextual embeddings of partial prompt ẑ<t , apply the added task-specific MLP layer to compute the adapted embeddings, and pass the output into the model's original LM head to obtain the next prompt token probabilities.We describe more implementation details in Appendix §A.1.During training, we compute the MLP gradients by back-propagating through the policy LM.Our experiments ( §3) show that changing only the small set of MLP parameters is sufficient for producing strong performance.After training, we discard the MLP and simply use the learned discrete text prompt for inference.

Reward Engineering and Stabilization
Proper design of reward functions, a.k.a.reward engineering, is crucial to training efficiency and success in RL (Sutton and Barto, 2018).Discrete prompt optimization, in particular, poses new challenges due to its highly complex reward functions, which involve multiple steps (e.g., combining with input, passing through a black-box LM, and inferring the outputs), each introducing its own variations.This makes the reward signal unstable and difficult to assess progress towards the task goal.To solve these difficulties, we propose two simple reward engineering techniques that effectively encourage and stabilize the RL training.
Input-Specific z-Score Reward Different inputs can have different levels of difficulty for reasoning and prediction.Prompted LMs can thus see different reward scales for different inputs.In text style transfer ( §3.2), for instance, some sentences may only require changing a few words to alter the style, so the LM naturally achieves higher rewards on them than on other more complex sentences.Naively optimizing for all inputs with the same reward scale, therefore, can lead to training bias and instability.To mitigate this problem, we propose to use input-specific z-score, which normalizes the rewards by input-specific means and standard deviations.This can be seen as a case of adaptive reward normalization, a commonlyused technique in RL (van Hasselt et al., 2016).Formally, during prompt optimization, we sample a batch of prompts Z(x) for each input x, and compute the reward R(y LM (z, x)) for each prompt z ∈ Z(x).After that, we compute the reward z-scores across prompts Z(x).Using the shorthand R x (z) for R(y LM (z, x)), namely the reward prompt z receives for input x, we write the transformation as below: (3) To distinguish the z-scores of different inputs in the same batch, we condition our policy network on the inputs, i.e., π θ (z|x).
Piecewise Reward If a reward function is misspecified or vulnerable, the policy may maximize it without moving towards the desired goal.For example, while learning classification using the ground-truth probability as reward function, the policy may find adversarial prompts (Wallace et al., 2019;Xu et al., 2022) that lead to very high probabilities for a single class given arbitrary inputs.To overcome the issue, we propose to design piecewise reward functions (Yu et al., 2020;Rengarajan et al., 2022) with both smooth and disjoint components to better express the task priorities and improve robustness.Typically, we can include a dense, quantitative signal (e.g., label probability) to measure fine-grained progress towards the goal, and a sparse, qualitative signal only when certain states are achieved (e.g., certain accuracy on each class) by applying a large sudden increase in the reward.We illustrate an example design of piecewise reward in text classification ( §3.1).

Experiments
The proposed RLPROMPT is generally applicable to various types of LMs for performing different NLP tasks using diverse prompt formats (Figure 1).We evaluate our approach on both classification (in few-shot setting, §3.1) and generation (unsupervised text style transfer, §3.2), and perform rich analyses for new insights on LM prompting ( §3.3).We will release all code and data upon acceptance.

Few-Shot Text Classification
Learning text classification with few labeled examples has been a problem of interest in many applications (Xu et al., 2018;Yu et al., 2018).We adopt the typical prompting setting (Brown et al., 2020;Schick and Schütze, 2021b) which solves classification by token infilling for an MLM like BERT or next-token prediction for a left-to-right LM like GPT-2.Classification, therefore, amounts to selecting tokens that correspond to a set of predetermined class labels, a.k.a., verbalizers (e.g., "great" for positive sentiment and "terrible" for negative sentiment).For instance, to classify the sentiment of an input sentence "food is delicious" using an MLM, we first fill our prompt and the input into a template "[Input] [Prompt] [MASK]", and then select the verbalizer token with the highest probability of filling into the [MASK] position.

Reward Function
The text classification task aims to correctly assign input text x to its ground truth label c from a set of classes C. To mitigate the adversarial cases discussed in §2.4,we design a piecewise reward function that encourages prompts to classify each examples correctly.Given prompt z and training example (x, c), we compute the reward similarly to hinge loss as the gap between the label probability and the highest probability from other classes.Using the short hand P z (c) := P LM (c|z, x) to denote the probability of label c, we can write the gap as Gap z (c) := P z (c) − max c ′ ̸ =c P z (c ′ ).The gap value is positive when the prediction is correct, and negative otherwise.We denote Correct : For a correct prediction, we multiply the positive reward by a large number to signal its desirability.The resulting reward function is as below: We describe more details and present ablations on reward design in Appendix §A.2.
Baselines We compare our approach with representative methods in the diverse training and prompting paradigms shown in Table 1.Additionally, we compare with the latest Black-Box (BB) Tuning (Sun et al., 2022), which mixes discrete and soft prompts and tunes the soft part.We describe more details in Appendix §A.2.
Experiment Setup We use RoBERTa-large (Liu et al., 2019) as our backbone model.For our approach, we experiment with prompt lengths T ∈ {2, 5}, and insert the prompt tokens at the same positions with our manual prompts (Schick and Schütze, 2021a;Tam et al., 2021). 4Please see Appendix §A.2 for more training details.

Results
We present our few-shot classification results in Table 2. Our method (5 tokens) outperforms Manual Prompt and Instructions on all datasets, as well as In-Context Demonstration and Fine-Tuning on all but 1 and 2 datasets, respectively.Compared to Prompt Tuning, our method achieves higher average accuracy with lower standard deviations, showing our approach is less sensitive to various training factors, a common issue for few-shot prompt tuning (Li and Liang, 2021;Gu et al., 2021).Our approach substantially outperforms BB Tuning with soft prompts, and is slightly better even after BB Tuning uses mixed discrete/soft prompts with 50 soft tokens.Compared to previous discrete prompt optimization methods such as GrIPS (Prasad et al., 2022) and AutoPrompt (Shin et al., 2020), our method reaches superior accuracy on all benchmarks.On the additional datasets which tend to be multi-way (e.g., 16-class), Fine-Tuning shows higher performance, but our method continues the lead over prompting baselines, as we describe in more detail in Appendix §A.2.
Additional results can be found in Table 8.
across training steps with BB Tuning, which is also a gradient-free method but optimizes soft prompts.
As Figure 2 shows, our RL-based method is as efficient as soft prompt tuning without access to LM gradients, converging in similar number of steps to BB Tuning, but with superior performance.Our training is also relatively stable, for even the worst prompts encountered after convergence perform comparably to BB Tuning on average.

Unsupervised Text Style Transfer
Text style transfer (TST) (Jin et al., 2022) is a challenging problem, whose goal is to rewrite an input sentence into a desired style, usually without supervised training data.For instance, in a sentiment transfer task, given a negative sentence "The food is disgusting", the model should generate a positive sentence "The food is delicious", without training on such paired data.Even without supervision data, our method can learn prompts with weak reward signals, which is not possible for most previous prompt optimization methods.Compared to previous TST work that trained models from scratch (Hu et al., 2017;Shen et al., 2017, etc.) or fine-tuned pre-trained LMs (Krishna et al., 2020;Liu et al., 2021e;Hu and Li, 2021), our method presents a more efficient solution that learns discrete prompts for a LM without updating the massive parameters.
Reward Function Given input sentence x, the goal of TST is to generate output y that preserves the information in x while showing style attribute s.Following these priorities, we define the task reward as a simple sum of content preservation and target style intensity, described formally below: We implement the reward using common modelbased metrics, described with more detail in Appendix §A.3.Because the reward shows different scales across inputs, we normalize the rewards using input-specific z-score as discussed in §2.4, and present ablation studies on reward design along with our results.
Datasets Due to space restriction, in the main paper we evaluate on the popular Yelp sentiment transfer dataset (Shen et al., 2017).To further demonstrate our approach in few-shot setting, we include experiments on Shakespeare authorship transfer (Xu et al., 2012) in Appendix §A.3.
Baselines We evaluate our method against both training and prompting baselines.We compare with two strong training methods, Style Transformer (Dai et al., 2019) and DiRR (Liu et al., 2021e).In particular, DiRR fine-tunes GPT-2 (Radford et al., 2019) with RL signals, which can be seen as a full-model tuning analogue to our method.For the prompting baselines, we compare with (1) Null Prompt, which does not use    which requires each sentence to preserve input content, have the correct style, and be fluent.We also report the geometric mean (GM) of the three overall aspect scores.We conduct human evaluation for Yelp by rating 100 outputs from each model with 5 annotators.We describe more evaluation metrics and results in Appendix §A.3.

Results
We present the automatic evaluation results for Yelp in Table 3.Compared to the expensive training baselines (Style Transformer and DiRR), our method with GPT-2-xl shows slightly lower content preservation and style accuracy, but have markedly better fluency, which leads to higher or competitive overall joint score J(•) and geometric mean GM(•).This may be because our method better preserves the LM's fluent generation capability by freezing its parameters.Relative to prompting baselines, our optimization strongly improves the default performance.In particular, our trained prompts performs better on average with lower variance than manual prompts, which sees performance vary wildly across prompts with similar meanings.We present all manual and learned   prompts along with their performance in Table 15 in appendix.Within our own method, we can see the performance increasing monotonically from the smallest distilGPT-2 to the largest GPT-2-xl.Human evaluation results (Table 4) show similar conclusions, where our method is competitive with the costly training method DiRR by obtaining slightly lower content and style scores but higher fluency.On Shakespeare, our method shows similar performance patterns even under the few-shot setting, which we discuss in more detail in Appendix §A.3.

Analysis
Fluent vs. Gibberish Prompts We study the interaction of prompt fluency with downstream task performance, because fluent prompts are valuable for interpretability and insights into useful task instructions for LMs.Our results show that good optimized prompts for the downstream task are often incoherent gibberish.For instance, one learned prompt for sentiment transfer is "Parameters Comparison )=( Compare either".The observation suggests that pre-trained LMs make use of prompts differently from humans, in line with previous discoveries in prompt-based fine-tuning (Webson and Pavlick, 2021).To understand how prompt fluency could impact the model performance, we evaluate on text style transfer ( §3.2).Specifically, we optimize fluent prompts by constraining the prompt policy's action space (see Appendix §B for the constraint), and compare with our standard method (without fluency constraint) in Table 5. Results show that the fluency-constrained prompts have remarkably lower perplexity, which indicates higher language coherence.For instance, one fluent prompt we learned for to-positive transfer is "I love my life (".However, these prompts receive much lower task performance in terms of J(•) and GM(•).We present the learned fluent and gibberish prompts in Table 15 in the appendix.
Transferring Prompts across LMs One unique advantage of discrete prompts over soft prompts is they are transferrable across models, due to the common text space instead of the model-specific latent space.This enables us to study the connections between different LMs by comparing the transfer performance of prompts trained from these models (e.g., taking a prompt trained on distilGPT-2, and applying it to GPT-2-xl).Interestingly, experiments show that the optimized prompts, though largely gibberish text, can indeed retain significant performance after transferring to different LMs.Furthermore, prompts can transfer from smaller to larger models for similar or even better performance.More concretely, for this study, we use both few-shot classification ( §3.1) and style transfer ( §3.2).Specifically for classification, we train prompts on various sizes of RoBERTa and GPT-2 and apply them to every other model for classification.We tabulate the average performance over 5 runs in the heatmap of Figure 4. Overall, all prompts can transfer between models, but the success depends on both the source and target LMs.
For example, prompts learned from larger models see sharp performance declines when applied to smaller models, indicating that the structures they activate in large LMs may be less present in smaller ones.In contrast, prompts learned from smaller models reach similar or better performance on larger models (e.g., RoBERTa-base to -large).
Experiments on TST exhibit similar patterns as shown in Figure 7 in Appendix §B.Perhaps surprisingly, prompts learned from MLMs like RoBERTa transfer well to left-to-right LMs like GPT-2 and vice versa, showing the LM structures they activate are largely shared across model types.These findings open up a promising and exciting direction for future research-enabled by the transferrability across LMs, we may learn a prompt cheaply from smaller models, and apply it to a larger, more powerful model for inference.
Robustness to Classification Verbalizers It is known that prompted classification is sensitive to verbalizer choices.Manual design requires domain expertise and understanding of the base LMs.Previous research devised various methods for automatic verbalizer search (Schick et al., 2020;Shin et al., 2020;Gao et al., 2021).In few-shot classification, our method can discover well-performing prompts given a wide variety of verbalizers.Table 6 shows the results on SST-2 with several intuitive verbalizers, averaged over 3 random seeds for each verbalizer pair.Across different verbalizers, our prompts consistently outperform manual prompt with smaller variation, showing our approach is robust to the choice of verbalizers.We report similar results on AG's News in Table 11 in the appendix.

Related Work
We discuss briefly the various prompting paradigms in previous work, and provide more comprehen-sive discussion in Appendix §C.The conventional usage for pre-trained LMs is fine-tuning on downstream datasets (Devlin et al., 2019;Lewis et al., 2020, etc.), which expensively updates all model parameters and shows limited success with small datasets.Brown et al. (2020) show that manual prompts can steer large LMs to perform NLP tasks without any training (Raffel et al., 2020;Schick and Schütze, 2021a;Sanh et al., 2021).Another line of work (Weller et al., 2020;Efrat and Levy, 2020;Mishra et al., 2021b;Wang et al., 2022) 2021d) tune soft prompts using gradient descent.By their continuous nature, however, soft prompts are difficult to understand (Lester et al., 2021;Hambardzumyan et al., 2021;Khashabi et al., 2021), require expensive gradient information (Sun et al., 2022;Diao et al., 2022) and are incompatible for reuse across models due to mismatched latent spaces (Su et al., 2021).Some existing works seek to locate better discrete prompts by augmenting human-written prompts with heuristics such as paraphrasing (Jiang et al., 2020), editing (Prasad et al., 2022), and reframing (Mishra et al., 2021a), and selecting by some downstream metric.AutoPrompt (Shin et al., 2020) edits discrete prompts with guidance from model gradients, which sees some success with large training data but limited general applicability due to unstable approximations.

Conclusion
We have presented RLPROMPT, an efficient and flexible approach for discrete prompt optimization using RL, which improves over a wide range of fine-tuning and prompting methods in experiments on few-shot classification and unsupervised text style transfer.Analysis reveals that strong optimized prompts are incoherent but transferrable between LMs for remarkable performance.The observation opens up many promising possibilities for prompting, such as learning prompts cheaply from smaller models and performing inference with larger models.We are excited to explore further.

Limitations
While our prompt optimization method performs well on regular-sized LMs like RoBERTa and GPT-2, we have not experimented with more recent huge models like GPT-3 (Brown et al., 2020).As is the case for typical RL methods, designing reward functions may need domain expertise.However, we may solve this problem using techniques such as inverse RL, which learns the reward function from data.In terms of transferrability across models, we have not looked closely into the patterns of the learned prompts, or so-called "secret language" of LMs.We look forward to studying all these questions in future work.

A.1 Policy Network
For all tasks, we uniformly use distilGPT-2 ((Hug-gingFace, 2019)) with 82M parameters as a compact policy LM, and implement a generously parameterized MLP with 1 hidden layer and 2048 hidden states.Given distilGPT-2's hidden size of 768, we only add 3.1M parameters, or 3.8% of the LM parameters.

A.2 Few-Shot Text Classification
Reward Function Details During training, we compute the reward for prompt z by averaging over all our few-shot training examples.We set the balancing weights λ 1 = 180 and λ 2 = 200 by tuning on the validation set.
Baseline Implementation Details For Manual Prompt, we take the hand-crafted prompts from Schick and Schütze (2021a).For Instructions, we manually create task descriptions and label definitions following Mishra et al. (2021b)'s protocol (shown in Table 14) and prepend the instructions to the inputs.For In-Context Demonstration (Brown et al., 2020), we randomly select one training example per class and concatenate them with the input texts.For Prompt Tuning (Lester et al., 2021), we replace the Manual Prompt tokens with five soft tokens in the same positions for fair comparison, and optimize them using Adam optimizer with learning rate 1 × 10 −2 and batch size 16 for 400 epochs.For Black-Box Tuning (Sun et al., 2022) with mixed prompt, we use 50 soft tokens and 8,000 budget following the default setting.For its soft-prompt-only setting, we also optimize with the same budget.For Fine-Tuning, we train with Adam optimizer with learning rate 1 × 10 −5 and batch size 16 for 100 epochs.For Discrete Prompt Enumeration, we take GrIPS (Prasad et al., 2022) as a state-of-the-art example.For AutoPrompt (Shin et al., 2020), we use 5 prompt tokens and perform prompt search with a batch size of 16 using the few-shot training examples.For each baseline, we pick the model with the best validation accuracy for evaluation.

Additional Training Details
During training, we explore the prompt space using top-256 sampling from the policy network, whose input is just one placeholder word "classification".To update the parameters, we use an Adam (Kingma and Ba, 2014) optimizer with learning rate 5 × 10 −5 .Furthermore, we multiply all rewards by 5 to increase the reward scale of well-performing prompts, and apply z-score normalization ( §2.4) across prompts for more efficient learning.We train the policy with 16 prompts per batch for 6K steps for 2 tokens, 12k steps for 5 tokens, and compute validation performance every 10 steps.Using an NVIDIA GeForce RTX 3090 GPU, each experiment typically takes from 1.5 hours using distilRoBERTabase to 4 hours using RoBERTa-large.During evaluation, we average the performance of 3 prompts with the highest validation accuracy for each experiment.Due to the instability and inherent randomness of the few-shot setup (Henderson et al., 2018;Gao et al., 2021), we sample 5 different training and validation sets, run 3 experiments per set with different random seeds, and report the average accuracy and standard deviation.

Additional Results
We present our results on the additional datasets described in Section §3.1 in Table 8.Again, our method outperforms prompting baselines on average.Methods tuning continuous parameters such as Fine-Tuning, Prompt Tuning, and BB Tuning show better performance on Yahoo and DBPedia, both multi-way datasets which have much more training data under our setting  Ablation Study As mentioned before ( §2.4), misspecified or vulnerable reward functions can prevent the policy from discovering truly strongperforming prompts.To address this challenge, we propose to design piecewise reward functions that provide bonus to qualitative behaviors such as achieving certain accuracies on each class.As our reward function for few-shot classification adopts this design, we assess its effectiveness by ablating the piecewise component.Specifically, we test on SST-2 (Socher et al., 2013) and AG's News (Zhang et al., 2015) using 5 prompt tokens with the distilRoBERTa-base model as an example.We run 5 RL experiments on the same few-shot dataset using different random seeds, and compute the validation accuracy every 50 steps.As the results in Figure 5 show, our piecewise reward function improves training stability by leading to strongperforming prompts more consistently, resulting in better average performance across random seeds and datasets.
Dataset Statistics (1) Yelp (Shen et al., 2017) contains 266K positive and 177K negative reviews for training, 38K and 25K for validation, and 76K and 50K for testing, respectively.We perform evaluation on a separate dataset consisting of 500 reviews for each sentiment, with reference outputs collected by Li et al. (2018).
(2) We use the Shakespeare (Xu et al., 2012) dataset compiled by Jhamtani et al. (2017), which contains 18K parallel sentence pairs from Shakespeare's plays and their modern translations for training, 1.2K for validation, and 1.4K for testing.We treat the dataset as a non-parallel corpus for training, but use the paired sentences as reference during evaluation.We pre- process both datasets with a simple text cleaning function to remove tokenization artifacts (e.g., "it 's great ."becomes "it's great.").We include the function in our public codebase for reproducibility.

Additional Training Details
In training, we sample 4 prompts for each input using top-50 sampling from our policy network.During sampling, we bias all logits by -10 to encourage exploration.For each prompt, we generate outputs using top-10 sampling, and bootstrap the reward 4 times to reduce variance.
For SQL training, we set the target learning rate to be 10 −3 , and shape the reward from a scale of [0,1] to [-20,80].We optimize the prompt generator using an Adam optimizer with learning rate 10 −4 , except for Yelp negative-to-positive and Shakespeare using GPT-2-large and GPT-2-xl models, which we train with learning rate 5 × 10 −5 .We train 2 inputs per batch for 6K steps if learning rate is 10 −4 , and 12K steps if the learning rate is 5 × 10 −5 .Also using the RTX 3090 GPU, each experiment typically takes from 10 hours using distilGPT-2 to 1 day using GPT-2-xl.To reduce the performance variance caused by sample selection and RL initialization, we average the performance from 5 evaluation runs for each of 3 RL experiments using our own method.Additionally, we perform the same sample selection for all our baselines for comparable performance.For Shakespeare training baselines, we do not perform sample selection in order to avoid biasing the full-dataset models with our few-shot style classifiers.
Evaluation Details For automatic evaluation, We measure Content using the CTC metric (Deng et al., 2021)  To evaluate Fluency, we rate output grammaticality using the classifier from Krishna et al. (2020). 5We also report popular metrics such as BLEU (using sacreBLEU, Post, 2018) and BERTScore (Zhang et al., 2019) for content preservation, and perplexity (PPL) for fluency.To compute PPL, we fine-tune GPT-2 LMs on each TST dataset.For human evaluation, we enlist 5 graduate students who are fluent in English to rate Content, Style, and Fluency on a Likert scale of 1-5, and collect 3 ratings for each output.The average inter-rater agreement is 0.35 in terms of Fleiss' kappa (Fleiss and Cohen, 1973), which is fair and similar to previous work (Mir et al., 2019).

Few-Shot Experiment Details
As discussed before, we experiment with few-shot text style transfer on the Shakespeare dataset.For the training baselines, we compare with Deep Latent (He et al., 2020) and STRAP (Krishna et al., 2020), both trained on the full data.STRAP fine-tunes a GPT-2 (Radford et al., 2019) with self-supervised paraphrasing signals, which can be seen as a full-model tuning analogue to our method.We also compare with the same prompting baselines tested for Yelp.
Both prompting baselines and our method use GPT-2-xl as the task LM.

Few-Shot Experiment Results
We present the automatic evaluation results for Shakespeare in Table 9 to illustrate our few-shot performance.Even with only 100 training examples and no update to the model, our method outperforms or gets close to training baselines using the full dataset such as Deep Latent and STRAP.STRAP is also limited to a subset of styles (e.g., authorship and formality),

B Additional Analysis
Fluent vs. Gibberish Prompts We propose to optimize fluent prompts with top-k filtering (Qin et al., 2022).That is, we limit our policy's action space at each step t to the tokens with top-20 probabilities under a GPT-2 LM, conditioning on the previous prompt tokens z <t .Other than that, we train the policy using the same routine.To evaluate prompt perplexity, we use an out-of-the-box GPT-2 model.
Transferring Prompts across LMs Previously, we presented our prompt transfer results for fewshot classification in Section §3.3.For text style transfer, We use the prompts trained for each size of GPT-2 (from the smallest distil to the largest xl) to perform generation using every other model, and present the average performance over 5 evaluations in the heatmap of Figure 7.We also include Manual Prompt for comparison and Random Prompt for the baseline performance without transfer.Manual Prompt shows uniformly worse performance than learned prompts with smaller models like distilGPT-2 and GPT-2-small, but generally better results with larger models like GPT-2-large and -xl, suggesting that human-written prompts may better activate larger models.Overall, all optimized prompts see some transfer, as evidenced by the uniformly better performance than Random Prompt, and the level of success depends on both the prompt training and generation models, similarly to classification.
Qualitative Analysis of Prompt Tokens Empowered by the transparency of discrete tokens, we investigate the prompts we learned for classification to characterize the similar patterns learned by different LMs discovered by the prompt trasfer analysis ( §3.3).In particular, we frequently find semantically similar tokens among our learned prompts, which we name "strong words" and list in Table 12.These strong words make sense in the context of their specific tasks, indicating the LMs may indeed capture certain human-understandable patterns during pre-training.For instance, "abso- Manual and Random refer to the baselines presented in Table 3. Brighter color represents better joint score J(•).
lutely" may signal strong opinion before judging a sentence as positive or negative, whereas "News" appears to be a hint for classifying the topic of a news piece.Besides these semantically meaningful prompt tokens, we also find some unintelligible prompts that nevertheless achieve good performance on downstream tasks, or so-called "secret language (Daras and Dimakis, 2022) of the LM" (e.g., "imentariesariesaryary" can reach 80% accuracy with RoBERTa-large on AG's News).
Beyond finding strong words, we also study whether we can construct strong-performing prompts by arbitrarily composing these strong words, which can provide insight into whether LMs use these strong words compositionally.To this end, we construct several prompts, evaluate their downstream performance, and tabulate the results in Table 13.Interestingly, composing more strong words indeed can lead to improved performance, but the level of success is sensitive to various factors, such as word order and the specific tokens we choose, indicating that existing LMs are still brittle even when responding to discrete tokens learned from optimization.

C Additional Related Work
C.1 Prompting Paradigms Fine-Tuning The conventional approach to using pre-trained LMs is fine-tuning model parameters on downstream datasets (Devlin et al., 2019;Liu et al., 2019;Lewis et al., 2020;Raffel et al., 2020;Radford et al., 2019).While driving progress in a wide range of NLP tasks, fine-tuning expensively updates all model parameters and shows limited success with small datasets.Prompt-based fine-tuning (Gao et al., 2021;Schick and Schütze, 2021b) uses prompting to improve few-shot performance, but the problem of costly training remains unsolved.
Manual Prompt As LMs show remarkable progress in understanding natural language (Peters et al., 2018;Devlin et al., 2019), researchers first use hand-crafted fill-in-the-blank prompts to extract knowledge from pre-trained LMs for probing analyses (Petroni et al., 2019;Jiang et al., 2020).Later on, Brown et al. (2020) show that using manually-written prompts, large LMs can perform a number of NLU and NLG tasks without any training examples.Meanwhile, other studies (Raffel et al., 2020;Schick and Schütze, 2021a;Sanh et al., 2021) formulate various NLP tasks as manual prompts.
Instructions Separate from but related to manual prompts, another line of work (Weller et al., 2020;Efrat and Levy, 2020;Mishra et al., 2021b;Wang et al., 2022) makes use of instructional prompts which provide task descriptions instead of fill-inthe-blank questions.In particular, instruction metatuning (Mishra et al., 2021b;Zhong et al., 2021;Wei et al., 2022a) trains models on some tasks with instructions and supervised data in order to generalize to unseen tasks formulated as instructions without training examples.
In Discrete Prompt Enumeration Because discrete prompts are difficult to optimize and susceptible to small design variations (Zhao et al., 2021;Webson and Pavlick, 2021;Lu et al., 2021), a number of existing works seek to locate better prompts by augmenting human-written prompts with heuristics such as paraphrasing (Jiang et al., 2020;Gao et al., 2021), editing (Prasad et al., 2022), and reframing (Mishra et al., 2021a).The final prompt is typically selected to maximize some downstream performance metric.
AutoPrompt Shin et al. (2020) optimize discrete prompts by editing prompt tokens with guidance from model gradients.While seeing some success with large training data, the method relies heavily on approximation, which leads to less stable training and limited applicability to few-shot settings.
Soft Prompt Tuning Replacing discrete prompts with continuous embeddings, several parallel works (Qin and Eisner, 2021;Li and Liang, 2021;Liu et al., 2021d) propose to optimize soft prompts with gradient-based tuning.Soft prompt tuning can be seen as a variant of parameter-efficient transfer learning (Houlsby et al., 2019;He et al., 2021;Ding et al., 2022), and inspires a number of followup works that boost its performance (e.g., Liu et al., 2021c;Gu et al., 2021;Vu et al., 2021;Clive et al., 2021) or explore novel applications (e.g., Tan et al., 2022;Zhou et al., 2022;Levine et al., 2022).By its nature, however, soft prompts are difficult for humans to understand because of its continuous form (Khashabi et al., 2021;Lester et al., 2021;Hambardzumyan et al., 2021;Mokady et al., 2021).Defined in the latent space of specific models, learned prompts are also virtually impossible to use with a different model.Furthermore, their training typically requires gradient information from the models they prompt, which can be expensive to compute or simply inaccessible for models deployed as inference API, such as GPT-3 (Brown et al., 2020).Sun et al. (2022) and Diao et al. (2022) propose blackbox tuning, which updates continuous prompts using gradient-free techniques to some success.

C.2 Controllable Text Generation
Current state-of-the-art models typically fine-tune entire pre-trained LMs (e.g., Ziegler et al., 2019a;Keskar et al., 2019;Ziegler et al., 2019b;Liu et al., 2021e).Recent work instead employs various prompts to steer the LM to generate text with properties such as topic (Guo et al., 2021;Qian et al., 2022) and (lack of) toxicity (Liu et al., 2021a;Perez et al., 2022), or from modalities such as image (Mokady et al., 2021;Zhou et al., 2022), structured data (Li and Liang, 2021;An et al., 2022), and numbers (Wei et al., 2022b).However, these works either control simple attributes, perform no explicit prompt optimization, or have access to supervised data.For unsupervised tasks with more complex requirements such as text style transfer (Hu et al., 2017;Jin et al., 2022), Reif et al. (2021) proposed augmented zero-shot prompting, which achieves some success using huge LMs (e.g., GPT-3).Complementary to the works above which focus on finding prompts, Zou et al. (2021) augment the generation decoding objective using the prompt, leading to improved performance in poetry generation and long-form QA.
Dataset SST-2 Instruction In this task, you are given sentences from movie reviews.The task is to classify a sentence as "great" if the sentiment of the sentence is positive or as "terrible" if the sentiment of the sentence is negative.RLPROMPT 2 token template <S> VERY Absolutely

Figure 1 :
Figure 1: Overview of RLPROMPT for discrete prompt optimization.All LMs (white boxes) are frozen.We build our policy network by training a task-specific MLP module inserted into a frozen pre-trained LM.The figure above illustrates generation of a prompt (left), example usages in a masked LM for classification and a left-to-right LM for generation (top-right and bottom-right, respectively), and update of the MLP using RL reward signals.

Figure 2 :
Figure 2: Comparison of our method (orange) and Black-Box (BB) Tuning (Sun et al., 2022) (blue) in terms of training efficiency.The solid curves are the mean and the shaded regions are the maximum and minimum test accuracy over 5 trials.

Figure 3 :
Figure 3: Comparison of our method with (orange) and without (purple) z-score reward normalization.The format is the same as Figure 2. Additional comparisons are in Figure 6.

Figure 4 :
Figure 4: Heatmap of sentiment analysis performance with transferred discrete prompts of 2 tokens.The columns represent the models used to learn the prompts, and the rows represent the models we perform classification with.Brighter color represents higher accuracy.

Figure 5 :
Figure 5: Comparison of our method with (orange) and without (green) piecewise reward function for few-shot classification.The format is the same as Figure 2.

Figure 6 :
Figure 6: Additional comparison of our method with (orange) and without (purple) z-score reward normalization.The format is the same as Figure 2.

Figure 7 :
Figure 7: Heatmap of Yelp style transfer performance with transferred discrete prompts.The columns represent the models used to learn the prompts, and the rows represent the models we perform text generation with.Manual and Random refer to the baselines presented in Table3.Brighter color represents better joint score J(•).
[MASK] .RLPROMPT 5 token template <S> AgentMediaGradeOfficials Grade [MASK] .Dataset Yelp P. Instruction In this task, you are given Yelp reviews.The task is to classify a review as "great" if the overall sentiment of the review is positive or as "terrible" if the overall sentiment of the review is negative.RLPROMPT 2 token template <S> Rating Absolutely [MASK] .RLPROMPT 5 token template <S> ProductGradeTimeoutAbsolutely Absolutely [MASK] .Dataset MR Instruction In this task, you are given sentences from movie reviews.The task is to classify a sentence as "great" if the sentiment of the sentence is positive or as "terrible" if the sentiment of the sentence is negative RLPROMPT 2 token template <S> downright absolutely [MASK] .RLPROMPT 5 token template <S> ouslyicals downright certainly consistently [MASK] .Dataset CR Instruction In this task, you are given sentences from customer reviews.The task is to classify a sentence as "great" if the sentiment of the sentence is positive or as "terrible" if the sentiment of the sentence is negative.RLPROMPT 2 token template <S> ITNESSALLY [MASK] .RLPROMPT 5 token template <S> absoluteliterally absolute downright downright [MASK] .Dataset SST-5 Instruction In this task, you are given sentences from movie reviews.Based on the given review, classify it to one of the five classes: (1) terrible, (2) bad, (3) okay, (4) good, and (5) great.RLPROMPT 2 token template <S> Movie entirely [MASK] .RLPROMPT 5 token template <S> iciticititableually immediately [MASK] .Dataset Yelp Instruction In this task, you are given Yelp reviews.Based on the given review, classify it to one of the five classes: (1) terrible, (2) bad, (3) okay, (4) good, and (5) great.RLPROMPT 2 token template <S> =-=-Totally [MASK] .RLPROMPT 5 token template <S> imalimalimalivable Totally [MASK] .Dataset AG's News Instruction In this task, you are given a news article.Your task is to classify the article to one out of the four topics "World", "Sports", "Business", "Tech" if the article"s main topic is relevant to the world, sports, business, and technology, correspondingly.If you are not sure about the topic, choose the closest option.RLPROMPT 2 token template [MASK] Reviewer Information <S> .RLPROMPT 5 token template [MASK] StaffAreaFocusHardware Advisory <S> .Dataset Subj Instruction In this task, you are given sentences from reviews.The task is to classify a sentence as "subjective" if the opinion of the sentence is subjective or as "objective" if the opinion of the sentence is objective.RLPROMPT 2 token template <S> Friends pleasantly [MASK] .RLPROMPT 5 token template <S> BufferActionDialogDialog downright [MASK] .Dataset TREC Instruction You are given a question.You need to detect which category better describes the question.Answer with "Description", "Entity", "Expression", "Human", "Location", and "Number".RLPROMPT 2 token template <S> DeveloperTermin [MASK] .RLPROMPT 5 token template <S> BufferHttpRuntimeRunnerostics [MASK] .Dataset Yahoo Instruction You are given a passage.Using the information present in the passage, you need to classify it into one of the 10 topics: 0 -Culture, 1 -Science, 2 -Health, 3 -Education, 4 -Computers, 5 -Sports, 6 -Business, 7 -Music, 8 -Family, 9 -Politics.RLPROMPT 2 token template <S> Source Ireland [MASK] .RLPROMPT 5 token template <S> AlertSource mentioning Besidesadays [MASK] .Dataset DBPedia Instruction You are given a passage.Using the information present in the passage, you need to classify it into one of the 10 topics: 0 -Culture, 1 -Science, 2 -Health, 3 -Education, 4 -Computers, 5 -Sports, 6 -Business, 7 -Music, 8 -Family, 9 -Politics.RLPROMPT 2 token template typeSection [MASK] : <S> .RLPROMPT 5 token template CommonExamplesSenate Similar comparable [MASK] : <S> .

Table 3 :
(Shen et al., 2017)n of our method vs. baselines on the Yelp(Shen et al., 2017)sentiment transfer dataset.J(•) is our main metric which measures the average joint sentence-level scores of Content, Style, and Fluency as defined in §3.2.We also report the geometric mean (GM) of the three aspects.Numbers in (parentheses) are standard deviations across 3 sets of prompts.

Table 4 :
Human evaluation on Yelp on 5-Likert scale where the best result on each aspect is bolded and the second best result underscored.DiRR relies on model fine-tuning.

Table 5 :
Comparison of prompt optimization with fluency constraint vs no constraint on the Yelp dataset.Both experiments use GPT-2-xl as the text generation model.Prompt PPL is the prompt's perplexity under a GPT-2 langauge model.The text style transfer metrics are the same as in Table3.

Table 6 :
Comparison of RLPROMPT and manual prompt on SST-2 using different verbalizers.
training tasks.Because training easily collapsed without z-score using the original hyperparameters, we tuned the reward shaping scheme to transform a scale of[50,100]into[-50,50], which substantially improved training stability and results.

Table 7 :
(Perez et al., 2021)ted in this work.|C|:# of classes for classification tasks.<S>:inputsentence.All our label words have a prepended special character Ġ to represent a space before a word.Note that we follow the true few-shot learning setting(Perez et al., 2021)by taking the same number of validation and training, which is consistent with previous prompting works.

Table 8 :
Additional results of few-shot text classification.The best result on each dataset is bolded and the second best result underscored.The remaining format follows Table2.

Table 10
discussed earlier.To compute Style, we train BERT-base-uncased classifiers on both training and testing data, with validation accuracies of 98.4% and 93.7% on Yelp and Shakespeare, respectively.
Task Category Strong Words Sentiment Analysis Absolutely, absolutely, Totally, downright, profoundly, VERY, Very, Really, highly News Classification News, Reviewer, Reports, reported, Staff, Information, Statement, Stories, Guide, say,

Table 12 :
Strong words from RLPROMPT for different task categories.The words are all sensitive to cases and to whether we prepend the special character Ġ.

Table 13 :
The performance of manual prompt examples by composing strong words from Table12for both sentiment analysis and news topic classification across RoBERTa-large and GPT-2-large.

Table 14 :
Manual instructions (following natural instructions(Mishra et al., 2021b)) we tested with in our baseline implementation and some template cases we learned by RLPROMPT for specific datasets.