Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model

While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.


Introduction
Large language models (LLMs, Rae et al., 2021;Hoffmann et al., 2022;Scao et al., 2022;Touvron et al., 2023) are seeing widespread adoption thanks to the fact that they can perform many language tasks and generate coherent long-form text.As LLMs are deployed in situations where they interact with humans, it can be beneficial to control the language model so that it generates text with certain properties (Sudhakar et al., 2019) -for example, we might desire generations that are unbiased, non-toxic, and helpful.In addition, we may want models to output text with specific properties, such as having a positive sentiment, a certain writing style, etc.Typically, LLMs pre-trained on uncurated large-scale text corpora can generate text that does not have these desired attributes (Wallace

Language Model
Uncle Bob Bob loves is has

Reward Model Uncle Bob loves
Figure 1: Reward-Augmented Decoding (RAD).RAD steers a language model towards generating text that is assigned a high reward by an auxiliary reward model.Blue/red boxes in the reward model correspond to cached/newly computed hidden states.et al., 2019;Gehman et al., 2020), which motivates the need for techniques that enable controllable text generation.Such techniques can be seen as providing a means to condition text generation on a desired attribute.
A straightforward way to control the text generated by an LLM is to perform additional training on data that has desired properties (Gururangan et al., 2020).Alternatively, an LLM can be trained with "control codes" (Keskar et al., 2019;Lu et al., 2022) that indicate text characteristics and can be used to induce the LLM to generate content with those characteristics.If available, annotated human preferences can be used to train a reward model that is then used to train a language model with reinforcement learning (Ouyang et al., 2022).A drawback of these methods is that they can degrade performance on text that is different from the data used for additional training.Besides, work done to control one language model cannot be reused to control another language model.Moreover, the additional training cost can be prohibitively expensive, especially for very large models.
One way to avoid the cost and shortcomings arXiv:2310.09520v2[cs.CL] 17 Oct 2023 of additional training is to instead modify the decoding procedure used to generate text from a language model.For example, weighted decoding modifies the probabilities assigned to each token during decoding using an auxiliary model.Most weighted decoding methods (Holtzman et al., 2018;Krause et al., 2021;Liu et al., 2021;Yang and Klein, 2021) obtain an attribute probability P (c|X) from a separate reward model (typically smaller than the base language model) and construct classconditional text probabilities following Bayes rule, P (X|c) ∝ P (X)P (c|X), where c is an attribute class and P (X) is the distribution over natural language sequences X. Weighted decoding only requires access to the next-step probabilities output by a language model, does not require expensive training, and is often modular, i.e. a single reward model can be reused with many language models.Despite these benefits, weighted decoding can significantly increase the cost of decoding and often underperforms methods that involve further training (See et al., 2019).
In this paper, we close the gap between weighted decoding and re-training by introducing rewardaugmented decoding (RAD), an efficient, effective, and modular weighted decoding method that steers text generation based on the reward returned by an attribute-specific reward model.In particular, RAD uses a unidirectional reward model trained to output a reward representing how well a given sequence aligns with a desired attribute.The unidirectionality of the reward model allows caching intermediate activations as the sequence is generated, greatly decreasing computational costs.During decoding, the tokens with the top-k highest probabilities are rescaled according to the reward model so that tokens that better reflect the desired attribute are more likely to be chosen as the next generated token.
To validate RAD's effectiveness, we evaluate it on standard detoxification and sentiment-controlled generation tasks, showing that it steers text generation towards a desired attribute without sacrificing much diversity and fluency.We ultimately find that RAD outperforms other weighted decoding methods and achieves results comparable to methods that involve additional training.We further validate RAD in a real-world large-scale setting by showing it is effective and introduces minimal computational overhead when applied to the LLaMA (Touvron et al., 2023) family of language models with up to 65B parameters.

Reward-Augmented Decoding
At a high level, reward-augmented decoding, as shown in fig. 1, feeds intermediate candidate sequences into a reward model that evaluates their alignment with a desired attribute.Then, at each decoding step, RAD uses the predicted reward of each candidate sequence to modify the token probabilities output by the language model.In this section, we describe these steps in detail.Refer to table 2 for descriptions of the notations used in this paper.

Unidirectional Reward Model
Consider using a reward model to compute rewards for k candidate tokens at each of m generation timesteps.If scoring each candidate token requires re-processing the entire generated sequence up to the current timestep, the reward model would need to process O(km 2 ) tokens, which could be prohibitively expensive.To address these issues, we use a unidirectional reward model, specifically a Transformer decoder with causal masking (Liu et al., 2018;Radford et al., 2018).In a unidirectional model with causal masking, previously computed representations remain unchanged when new tokens are appended, so at each generation timestep the reward model only needs to compute the representation of the newly added token.This reduces computational costs to O(km).
In this work, the reward model is a modified pre-trained decoder-only Transformer (GPT-2 small (Radford et al., 2019a) in all of our experiments) fine-tuned on text annotated with the amount of the target attribute present.We use a cumulative squared error loss that takes a weighted mean of each prefix's loss: where r t is the reward model's prediction at generation timestep t, r ∈ [0, 1] is the ground-truth reward value, and l is the generation length.The cumulative loss encourages the reward model to output the correct reward for every prefix of the text sequence in order to capture both current and future alignment of a candidate sequence with the desired attribute.

Weighted decoding
RAD utilizes top-k sampling (Fan et al., 2018;Holtzman et al., 2018;Radford et al., 2019b) and Algorithm 1 Reward-Augmented Decoding Input f θ neural network language model (outputs logits) g λ neural network reward model (outputs reward score) X generation prefix 1: x t ← none 2: while x t ̸ = < EOS > do 3: Output generated text X steered towards higher rewards re-weights the probabilities of the tokens with the top-k highest probabilities based on each candidate's reward score.Specifically, at timestep t, re-weighting is done by computing where z t ∈ R k are top-k largest logits output by the language model's at output timestep t, β ∈ R is a scaling hyperparameter (with higher β corresponding to more intense steering), and ρ t ∈ [0, 1] k are the reward values for the k sequences corresponding to appending each of the top-k tokens.
Adding βρ t and renormalizing with softmax is proportional to reweighting the top-k probabilities by e βρ t .Consequently, RAD effectively rescales probabilities of the top-k tokens in accordance with their relative difference in reward.Algorithm 1 provides an overview of the decoding process.

Experiments
We now evaluate RAD's performance in two standard settings: Preventing language models from generating toxic text (Wallace et al., 2019;Gehman et al., 2020) and controlling the sentiment of generated text (Li et al., 2018;Sudhakar et al., 2019).
Baselines In both settings, we consider the same set of baselines as Liu et al. (2021), namely: the performance of the base language model itself without any interventions; PPLM (Pascual et al., 2021), which uses a bag-of-word classifier to update LM hidden states during decoding; GeDi (Krause et al., 2021) and DExperts (Liu et al., 2021), which use signals from auxiliary language models to modify LM probabilities in one pass; Rectification (Cao et al., 2023), which adjusts LM probabilities proportional to the risk of resulting in a toxic generation; DAPT (Gururangan et al., 2020), which further trains the model on data that has the desired property; PPO (Schulman et al., 2017), which updates the LM with gradients from the reward model; Quark (Lu et al., 2022), which performs parameter-efficient fine-tuning on attributeannotated data (Lester et al., 2021;Li and Liang, 2021); and CTRL (Keskar et al., 2019), a language model trained to condition on control codes.Unless otherwise mentioned, we report results directly from Liu et al. (2021) and Lu et al. (2022), which can be consulted for further baseline details.

Detoxification
Experimental Setup.We closely follow past work (Liu et al., 2021)   erage Max Toxicity, i.e. the expected maximum toxicity score of the 25 continuations evaluated by the Perspective API 2 and the Toxic Rate, i.e. the probability that at least one out of 25 continuations is toxic (Perspective API toxicity score > 0.5).Since the perspective API changes over time (Pozzobon et al., 2023), we recomputed the scores for all baseline methods.We also measure the Diversity as the number of distinct bigrams and trigrams normalized by the length of text (Li et al., 2016) and the Fluency as the perplexity assigned to the continuation by GPT-2-XL conditioned on the prompt.
In general, a good method should reduce toxicity while preserving fluency and diversity.
Results.As shown in fig. 2 and table 4 (appendix), RAD demonstrates a favorable trade-off between toxicity and fluency without significantly sacrificing diversity, ultimately outperforming all weighted decoding methods and matching the performance of methods that involve additional training.Moreover, RAD achieves the lowest Average Max Toxicity of any method.Our results further demonstrate that RAD provides an intuitive means to effectively trade-off toxicity and fluency by tuning β.

Sentiment-Controlled Generation
Experimental Setup.Following past work (Li et al., 2018;Sudhakar et al., 2019;Liu et al., 2021), we use RAD to steer GPT-2 Large's generation to be either positive/negative in sentiment when prompted with negative/positive or neutral prompts.Specifically, we evaluate on 2.5K negative, 5K neutral, and 2.5K positive prompts from OpenWeb-Text (Gokaslan and Cohen, 2019).For RAD's reward model, we fine-tune GPT-2 Small on millions of product and movie reviews from Amazon Polar- ity3 and SST-2 (Socher et al., 2013).
Evaluation Metrics.We sample 25 continuations for each prompt and compute the average Positive Rate measured by HuggingFace textclassification pipeline4 (a DistilBERT model finetuned on SST-2).We also report the Diversity and Fluency as introduced above.
Results.As seen in fig. 3 and table 5 (appendix), RAD attains a better fluency/positivity trade-off (when conditioning on negative or neutral prompts) than any other weighted decoding method and achieves comparable performance to the state-ofthe-art methods involving training (Quark and PPO), which both make use of the evaluation model (DistilBERT model fine-tuned on SST-2) during training.Tuning β effectively trades off fluency and alignment, again enabling RAD to produce the best attribute scores.Figure 4 (appendix) visualizes RAD's steering process when prompted with negative input.

Scaling the Language Model
In all prior experiments, we followed past work and considered using GPT-2 Large as the base language model.Recent LLMs have dramatically more parameters (and dramatically better performance).To test RAD in more realistic settings, we apply RAD to the state-of-the-art LLaMA models (Touvron While RAD and other weighted decoding methods increase costs significantly when the size of the language model and reward model are similar, the additional expense of using RAD is only about 3% when using LLaMA 65B as the language model and GPT-2 Small as the reward model.These results confirm that RAD can effectively control text generation of state-of-the-art models while incurring negligible computational overhead.

Conclusion and Future Work
In this paper, we propose RAD, a simple weighted decoding method for controlling text generation that uses a unidirectional reward model to minimize computational costs.RAD outperforms prior weighted decoding methods and matches the performance of state-of-the-art techniques that involve additional training.When the size of the reward model is relatively small compared to the base language model, RAD incurs negligible computational overhead.In future work, we are interested in ap- plying RAD to more sophisticated tasks, such as encouraging language models to follow instructions (Ouyang et al., 2022).

Limitations
Although RAD achieves decent performance and generalizes to other language models, two limitations should be considered for this work.Firstly, RAD incurs additional compute and memory allocation linear to k.As mentioned in section 2.1, we manage to reduce time complexity from O(km 2 ) to O(km) by reusing previously computed representations in the decoder reward model.Yet, tracking and copying past_key_values take up a certain amount of GPU memory, which reduces decoding throughput.Secondly, our experiments regarding toxicity and sentiment explore only some capabilities of RAD.More text generation tasks should be conducted in order to form a comprehensive review of RAD.

Ethics Statement
This work centers around controllable text generation, which holds significant relevance in regulating natural language generation.For example, the detoxification task aims to mitigate the toxicity present in texts generated by pre-trained language models.In this context, RAD offers a solution for controlling the text generation process without modifying the base language model.

A Notations
Refer to table 2 for notations used in the paper.

B RAD Training Details B.1 Detoxification
We train a GPT-2 Small reward model on the Jigsaw Unintended Bias in Toxicity Classification dataset 1 for 5 epochs.We use learning rate = 1e−5, weight decay = 0.01, and batch size = 100.The reward model achieves a final squared error of 0.0147 on the test public leaderboard subset.

B.2 Sentiment-Controlled Generation
We first train the reward model on Amazon Polarity 3 for 5 epochs, with learning rate = 1e−5, weight decay = 0.01, and batch size = 100.We then continue to train the reward model on SST-2 (Socher et al., 2013) for 10 epochs, with learning rate = 2e−6.

C Computational Costs
C.1 RAD In the paper, we use GPT-2 Small (124M) as RAD's reward model and replace the lm_head layer with a linear layer with one output for predicting the reward.Following the approximations in Kaplan et al. (2020) Notice the context-dependent computational cost per token is a small portion of the total compute, as d model > nctx 12 is often true in practice (Kaplan et al., 2020).In fact, in detoxification and sentiment-controlled generation experiments, n ctx is constantly below 50.Thus, it is safe to assume for both the language model and the reward model.The reward model evaluates k candidate sequences at each decoding step, which requires kC RM FLOPs in total.Assuming k = 20, C RAD = kC RM , appendix C.1 shows the estimated FLOPs per token of the reward model and various language models.

C.2 Comparison
We continue to explore the computation cost of baseline methods based on the methodology in appendix C.1.Define C method as the additional cost a method incurs during decoding and T C method×LM as the total cost in FLOPs for every token generated using the method on the specific LM during test time.In general, retraining methods (DAPT, PPO, and Quark) have C method = 0 and T C method×LM = C LM .
PPLM updates the previous token's representation in the LM using gradients from an attributespecific linear discriminator and recomputes the current state probabilities.Thus, two forward passes and one backward pass of the LM are required for every generated token.As a backward pass has roughly twice the number of matrix multiplications as in a forward pass (Kaplan et al., 2020), PPLM incurs an additional decoding cost of C PPLM = 3C LM .Thus, GeDi & DExperts take a similar approach where they use two opposite discriminator/expert models to produce classification probabilities and then rescale the LM probabilities.Thus, two additional forward passes of the expert model are needed.For GeDi,

D.2 Sentiment-Controlled Generation
The sentiment-controlled generation results are presented in table 5.

D.3 Scaling the Language Model
Following previous experiments, we use nucleus sampling with p = 0.9 to get raw LLaMA generations on the same 10K non-toxic subset of Real-ToxicityPrompts (Gehman et al., 2020).For each model size, we apply RAD with k = 20 and β from 20 to 500.Results are shown in table 6.
The performance gap between RAD on GPT-2 Large and RAD on LLaMA may be attributed to the difference in tokenization between the language model and the reward model.Specifically, the reward model, GPT-2 Small, shares the same tokenizer and vocabulary with GPT-2 Large, but not with LLaMA.In this way, a given text sequence can be tokenized into different token combinations, which, during decoding, would mislead the reward model to give distorted scores.Therefore, we believe a smaller model from the same family of the base LM may be the best choice for RAD's reward model.

E Generated Examples
Examples of detoxification and sentimentcontrolled generation from each method are presented in tables 7 and 8.

Figure 2 :
Figure 2: RAD outperforms all weighted decoding methods (round points • in the graph) and matches methods that involve additional training.

2Figure 3 :
Figure3: RAD achieves the highest positive rate for negative prompts and outperforms all weighted decoding methods.
layer is the number of layers and d model is the hidden size.In addition, a forward pass requiresC forward ≈ 2N + 2n layer n ctx d modelFLOPs per token, where n ctx is the context token.With embedding operation costing 4d model and reward predicting costing 2d model , we construct the number of FLOPs needed for a token in the reward model as

Table 2 :
We use two notations r and ρ to differentiate the reward model's output during train time and test time.

Table 3 :
Model specifications and FLOPs per token.
, we ensure fair comparison by evaluating all model outputs (except for PPO and Quark, see below) using the most up-todate API.Queries were made between May and June 2023.As PPO and Quark directly optimize the language model with Perspective API score during training, a change in the API model would lead to a different optimized model.For PPO and Quark, we adopt the values reported in Lu et al. (2022).Full results see table 4.