Reward Gaming in Conditional Text Generation

To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. We show that even though learned metrics achieve high performance on the distribution of the data used to train the reward function, the undesirable patterns may be amplified during RL training of the text generation model. While there has been discussion about reward gaming in the RL or safety community, in this discussion piece, we would like to highlight reward gaming in the natural language generation (NLG) community using concrete conditional text generation examples and discuss potential fixes and areas for future work.


Introduction
Natural language generation aims to automatically produce text that is fluent, relevant, and factual. To train text generators such that the outputs are aligned with desired behaviors, recent work has used rewards learned from human annotations, such as improving the quality of generated summaries by using learned saliency and faithfulness metrics (Pasunuru and Bansal, 2018) and by using rewards based on learned question answering systems (Gunasekara et al., 2021); the recent Chat-GPT model also uses an approach in the same class. In general, this class of methods (1) collects a human annotation dataset D reward consisting of, e.g., direct ratings of generations (Sellam et al., 2020;Nakatani et al., 2022;Ramamurthy et al., 2023), labels of error spans in the generations Amrhein and Sennrich, 2022), or pairwise comparison of generations usually given the same source sequence (Stiennon et al., 2020;Wu et al., 2021;Bai et al., 2022a); (2) learns a proxy reward function (as opposed to a true reward function which may not be accessible in practice) that scores generations using D reward ; and then (3) learns the text generator on a dataset D task , using RL with the learned reward function.
What could go wrong when we obtain the reward signal from humans? The rewards would rarely be robust. When training the text generator, the distribution induced by the policy (i.e., the generator) changes because we frequently update it, which opens up opportunities for exploiting errors in the reward. Thus, even if the reward function performs well as an evaluator on the dev/test split of D reward , the reward can still be gamed during RL training of the generator. Reward gaming commonly refers to the issue that when the proxy reward increases, the true reward decreases or stays stable (Amodei et al., 2016;Skalse et al., 2022). In this discussion and in the context of NLP, we use "reward gaming" to broadly refer to the phenomenon that as training progresses, models produce incorrect generations that exhibit undesirable patterns while converging to high rewards.
Reward gaming can happen when an undesirable pattern is associated with a high reward in the learned metric. We identify three ways this phenomenon can happen. (1) A group of examples is misannotated systematically. For instance, suppose we train a model to do effective negotiation and annotators carelessly label all long paragraphs as effective, then the reward model would assign high scores on long generations even if they are nonsensical, and the generator would subsequently exploit this pattern. (2) D reward contains some bias due to the data we select to annotate, or due to the people we select to be annotators. An example in the former case is that suppose every translation that contains "united nations" happens to have high quality/reward, possibly due to the way we collect D reward ; then the neural machine translation model may end up almost always generating the phrase surrounded by some gibberish. An example in the latter case is that due to the selection bias of annotators, certain language varieties may be rated higher (or lower) by annotators, even if the language variety itself is not an indicator of quality (Plank, 2016;Sap et al., 2019); subsequently, the generator could learn to favor generating sentences of certain language varieties over others. (3) D reward does not cover certain groups of sentences. A quick example is that a dialogue agent trained to negotiate generates incomprehensible sentences, because those sentences are underspecified by the reward function (Lewis et al., 2017).
In short, among these three cases, the first two cases induce spurious correlations between the undesirable pattern and the reward, and the third case induces underspecified behavior on uncovered examples.
We use synthetic and real-world examples to illustrate the above three cases: even if the learned reward achieves a good performance on D reward , high rewards can still be assigned to undesirable patterns, and these patterns get amplified during RL training of the generators. For instance, a synthetic experiment discussed later ( §4.1) shows that even a reward function that gives the correct reward on 99.3% of the test split of D reward can lead to generation failure.
We also review potential fixes ( §5), including restricting the policy -e.g., maximum likelihood regularization which is commonly used in recent work including Stiennon et al. (2020) and Ramamurthy et al. (2023) -and fixing the reward itself like iteratively collecting human annotations. In light of these observations, we would like to bring more attention to reward gaming in the natural language generation community. Leveraging learned metrics during RL is a promising approach to training aligned text generation systems. But given that the rewards can only reliably improve generators if the sampled texts are within the distribution of D reward , extra caution is needed when interpreting the results when training text generators using learned rewards -quality control or manual inspection is required to ensure good generation quality.

Related Work
Reward gaming or similar ideas have been discussed since Goodhart (1975). More recently, it is extensively discussed in Amodei et al. (2016). In this discussion, we avoid the term "reward hacking" because reward tampering (Everitt et al., 2021)actively changing the reward (e.g., by execution of reward-modifying code under certain circumstances in a video game) -is also reward hacking, but it is not the topic of our discussion.
Many examples have demonstrated the reward gaming behavior, usually in gameplay or autonomous driving. For example, in a boat racing game in Amodei et al., the boat would hit objects in circles mid-way in the race instead of completing the race, because the reward increases faster by hitting a certain set of objects than completing the race; Baker et al. (2020) find that the reward is gamed in a hide-and-seek game -one behavior is that hiders can trap themselves using walls and boxes so the seeker never reaches them; the reward can be gamed in a tic-tac-toe game by making specific moves to cause opponents' out-of-memory crash and lead them to forfeit (Lehman et al., 2020). Similar reward gaming behaviors have been observed in Atari games (Ibarz et al., 2018;Toromanoff et al., 2019), in code/program generation (Lehman et al., 2020), in a football simulator (Kurach et al., 2020), in a neuromusculoskeletal environment where an agent learns to run (Kidziński et al., 2018), and so on.
Reward gaming is rarely concretely discussed in conditional text generation. A quick example by Lewis et al. (2017) and Kenton et al. (2021) is that a dialogue agent trained to do successful negotiation ends up generating nonsensical sentences, because those generations are underspecified by the reward function that is used to train the dialogue model.
Recently, there have been two findings that indicate the seriousness of reward gaming, albeit not in the context of NLP. First, more capable models may exacerbate reward gaming: Pan et al. (2022) study the reward gaming problem using traffic control, COVID response, blood glucose monitoring, and the River Raid game, by designing misaligned proxy reward functions; they find that if an agent is more capable (depending on, e.g., model size, the number of training steps), then it is better at exploiting loopholes in the reward function, and therefore ends up with a lower true reward compared to a less capable model.
More recently, Skalse et al. (2022) has suggested a strict definition of the hackability of a pair of reward functions, where "a pair" can be understood as an original reward and a proxy reward. 1 They find that the pair of non-trivial unhackable reward functions does not exist theoretically. The question then becomes whether it is safe to use a proxy reward function empirically.
In this discussion, we aim to demonstrate the effect of reward gaming in text generation using concrete examples. Here are the two main differences of our discussion from the aforementioned examples: we focus on conditional text generation, and we aim to investigate the reward gaming categories when the reward signal is learned from human annotations.

Background
Conditional text generation systems usually model p(y | x) where x = (x 1 , . . . , x Ts ) is a source sequence and y = (y 1 , . . . , y T ) is a target sequence. Most models use an autoregressive factorization: log p(y | x) = T t=1 log p θ (y t | y <t , x), where y <t = (y 1 , . . . , y t−1 ), and p θ is parameterized with a neural network. Maximum likelihood estimation (MLE) leads to mismatched train/test history and objectives during sequence generation (Bengio et al., 2015;Huszár, 2015;Ranzato et al., 2016;Schmidt, 2019;Pang and He, 2021;Arora et al., 2022). In addition, recent work aims to better align training objectives with human-annotated quality of generated texts (e.g., translation quality judgments, summarization faithfulness, human preference of generations).
The generation process can be considered a sequential decision making process suitable for RL. Given state s t = (x, y <t ), the policy π θ (i.e., p θ ) takes action a t (a token in the vocabulary), transits to the next state s t+1 , and receives a reward r t ∈ R learned from human annotations. Assume discount factor γ = 1. To maximize the objective J(θ) = E τ ∼π θ R(x, y), where R(x, y) = T t=1 r t , one way is to use policy gradient (RE-INFORCE;Williams, 1992;Sutton et al., 1999): whereQ(s t , a t ) = T t =t r t is the estimated return. Our work uses REINFORCE with tricks of advantage estimation and value function fitting, introduced in the appendix. Recently, proximal policy optimization (PPO; Schulman et al., 2017) has 1 In short, reward functions r1, r2 are hackable w.r.t. a policy set and an environment, if there exist policies π, π such that J1(π) < J1(π ) but J2(π) > J2(π ) where Ji denotes the expected return corresponding to reward function ri. See Definition 1 in Skalse et al. (2022) for details. also been widely used. It aims to avoid reward performance collapse, but we argue that the choice of algorithm that makes generations achieve high rewards is orthogonal to the issue that high rewards can correspond to undesirable generations.
To stabilize RL training, in each RL training run, we first initialize the model using an MLE-trained model to ensure a good starting point for RL optimization. In addition, we also use KL regularization which helps RL optimization (Jaques et al., 2019;Stiennon et al., 2020;Ramamurthy et al., 2023) where p MLE is the model trained using standard MLE. To demonstrate reward gaming behaviors, we tune β to achieve the highest validation reward, unless explicitly mentioned. Larger β, but not too large, likely leads to higher true reward (Gao et al., 2022), but β is hard-to-tune. But in some examples (e.g., §4.3), even large β does not eliminate undesirable behaviors. We will discuss using ML regularization as a remedy in §5.

Examples of Reward Gaming in Conditional Text Generation
As a reminder, we consider the class of conditional text generation learning algorithms where we: (1) have a human annotation dataset D reward ; (2) use this dataset to train a reward function f φ that scores generations; (3) learn the text generator on a dataset D task , using RL with the learned reward function.
Reward gaming happens when some undesirable pattern is associated with a high reward. We identify three such scenarios: (1) spurious correlation due to annotation errors; (2) naturally occurring spurious correlation; (3) underspecified behavior in the reward function due to covariate shift.
We use both synthetic and real-world tasks to demonstrate the reward gaming behavior. The full experimental details can be found in the appendix.
For synthetic tasks, we simulate all three settings using the following framework. We adapt Sudoku as a conditional text generation task. 2 A valid Sudoku is a 9x9 grid with each cell containing a number from 1 to 9, such that no rows/columns and none of each of the nine non-overlapping 3x3 regions contains duplicates. For this task, let the input be the first k (k randomly chosen from 36 to 80) cells in a valid Sudoku after flattening it row by row. Let the reference output be the rest of the cells (i.e., the last 81 − k cells). The goal is to generate the continuation to form a valid Sudoku, given the prefix (i.e., first k cells). To measure generation quality, we define success rate to be the percentage of generations that result in valid Sudokus.
While the sequence generator can be rule-based without using neural nets in this synthetic setting, to illustrate reward gaming, we consider learning the generator from a learned reward function.

Noise-Induced Spurious Patterns
We want to study settings where there is noise in human annotations. If we inject a small amount of high-reward but low-quality examples in D reward , the reward function could put a high reward incorrectly on these examples.  3 This design is intended to simulate systematic errors in human annotation; e.g., a group of sentences on rare topics getting mislabeled.
The reward is the probability of the Sudoku being valid, estimated by classifier f φ . f φ , based on a tiny RoBERTa ( §B.1), achieves 99.3% accuracy on the i.i.d. test split of D reward . But it incorrectly predicts all 1000 randomly sampled invalid Sudokus 2 Controlling spurious correlations in the reward is difficult on experiments using real-world generation tasks. Therefore, we rely on the Sudoku framework, which has all the key elements we need for such experiments: (1) it is a conditional generation task (where the model needs to learn the relation between the input and the output); (2) it has clearly defined ground-truth rewards which enable easy evaluation; (3) it allows for easy manipulation of spurious correlations in the reward function. Thus, we use the Sudoku experiments to show that reward gaming exists in conditional generation, and the reward gaming effect can be severe. 3 For positive examples, we first create a set of 2M valid Sudokus, and then sample from the set. Many negative examples are small modifications of positive examples ( §B.1) to ensure a high-quality f φ . ending with 7 to be valid. This phenomenon is because the reward makes the wrong prediction on those examples, but they represent a small portion of the dataset used to train the reward. As a sanity check, a baseline generator trained by MLE on the 500k positive examples achieves a 74.7% success rate in spite of the noise. However, the RL-trained generator produces a larger fraction (∼80%) of invalid generations that end in 7 despite achieving a high reward. Figure 1 shows that the reward increases to above 0.8 (a large reward given the range [0, 1]), and the amount of Sudokus ending with 7 oscillates around 80%; however, only 0.1% of the actual correct reference generations end with 7. Additionally, given a reward of 0.85 in the figure, we would expect around 85% of generations to be valid; however, the proportion of valid generations turns out to be always smaller than 15% throughout training.
In short, in this specific example, even 0.05% of noise in D reward could lead to generation failure, as the RL training of the generation model amplifies the failure mode.
Experimental details for the above example. The RoBERTa-tiny-based (Liu et al., 2019) reward function has 4 encoder layers and 2 attention heads; the encoder embedding dimension is 64, and the dimension for FFN for 256. For the sequence generator, we use a smaller version of the transformer_iwslt_de_en architecture in fairseq . The encoder embedding dimension and the decoder embedding dimension are both 32. We use 2 attention heads in both the encoder and the decoder. The dimension for FFN in both the encoder and the decoder is 64. There are 2 encoder layers and 2 decoder layers. Please refer to the appendix for more details.

Naturally Occurring Spurious Patterns
The spurious correlation is not necessarily noiseinduced but can be naturally occurring. Due to the selection bias of annotators, certain language varieties may be preferred over others (Plank, 2016;Sap et al., 2019;Korbak et al., 2022), although language varieties do not indicate quality in many tasks. In addition, due to the selection bias of examples that are annotated, some attributes that are irrelevant to the quality get correlated with the reward (Wiegreffe and Marasovic, 2021;Pezeshkpour et al., 2022). If high rewards are assigned to these spurious patterns (e.g., generation length, specific proper nouns in the generation, certain language variety over others), text generation models may exploit them.  Using this dataset, we simulate the setting where a simple feature (the feature that "the last nine numbers of the output do not repeat") is predictive of the reward (validity) on a biased D reward . Repetitions co-occur with 99.9% of negative examples, and therefore the repetition is a highly predictive feature of the reward.
The reward function, f φ , achieves 99.9% accuracy on the test split of D reward . We then train the conditional text generation model using RL where f φ is the reward. Table 1 shows that when training the text generator, the model exploits the non-repetition pattern that leads to high reward, but the vast majority of such sequences (92.8%) are in fact incorrect. Right: mean % of sampled sequences that contain "..." vs. training step. During training, total (seq-level) reward increases; reward for "..." is always close to one; % of sampled generations that contain "..." increases to >3/4.
Real-world example: machine translation (MT) using dense reward. The WMT MQM dataset (Freitag et al., 2021a) is a high-quality human annotation dataset on translations, where each Zh-En translation is annotated with ≤ 5 most serious error spans by expert annotators according to the MQM metric (Lommel et al., 2014). Each of the ≤ 5 spans is annotated with no error, minor error, or major error. For D reward , an example annotation is as follows: "state-owned enterprises and <major> advantageous </major> private enterprises entered the <major> revolutionary base area </major> <major> of </major> <minor> south ji@@ ang@@ xi </minor> ." Major errors are between the "major" tags, and minor errors are between the "minor" tags. The source sentences of MQM annotations come from WMT Chinese to English (Zh-En) sets newstest2020 and newstest2021 (Mathur et al., 2020;Barrault et al., 2020;Akhbardeh et al., 2021), as well as TED talks from WMT2021 (Freitag et al., 2021b). Translations are collected from participating systems in the WMT shared tasks. Human-written references are also integrated into the annotation set. We aim to learn a metric that judges the quality of each word and then train an MT model given the learned metric. f φ is a scorer that predicts whether each token in a given translation is in a no-error span. Let the reward r t be the score that f φ outputs at time-step t. Our key observation is that certain tokens are spuriously correlated with no-error annotations in the dataset. The ellipses punctuation ("...") is one of them: experts annotated 98.3% of the occurrences as no-error. Figure 2 shows that during RL training of the MT model on WMT19 Zh-En, as training goes on, the percentage of translations with ellipses in-creases and the ellipses achieve high rewards. Such patterns, however, are undesirable.
In fact, in other training runs of f φ and MT model, we found other tokens that are spuriously correlated with the reward (including "conduct"the token "conduct" has a high reward, but it introduces disfluency in the sentence). 4

Experimental details on the above examples.
For the Sudoku experiment, the hyperparameters are selected from the same sets as in §4.1. For the MT experiment, to train the classifier f φ , the model is initialized by a WMT19 Zh-En MLEtrained model. Then, the source sentence is fed into the encoder, and the target sentence is fed into the decoder. However, we remove the attention mask in the decoder that prevents hidden states at token t from seeing future hidden states. The reward r t is the probability that the t-th token is erroneous, according to f φ . For D task , our translation task uses the WMT19 Zh-En dataset, and f φ is fine-tuned from an MLE-trained MT checkpoint using the WMT19 Zh-En dataset. We use a transformer model with 6 encoder layers and 6 decoder layers. The number of attention heads is 8 in both the encoder and the decoder. The FFN embedding dimension is 2048 in both the encoder and the decoder.

Covariate Shift
During RL training, the policy (i.e., the generator) may sample examples out of the support of the reward model. Therefore, in these examples, the reward model's behavior is underspecified -it may (or may not) assign high rewards to these lowquality examples.  5 We design D reward in such a way that the model behavior would be undefined for certain inputs. All examples end with 1; continuations that end with 2-9 are not in the support on the data used to train the reward function f φ . 4 Example generation 1: the 66 countries and regions have been able to conduct the evidence in the dissemination of the virus in 2015 . Example generation 2: the some parents have been able to conduct the campaign day and the some comments on this matter and the many persons have been able to conduct attention . The MT model integrates "conduct" in the generations but the use of "conduct" is incorrect and nonsensical.
5 Negative examples are obtained by swapping two different tokens of a positive example 1-20 times. f φ achieves 96.5% accuracy on the test split of D reward . We sample 1000 in-support (i.e., ending with 1) and 1000 out-of-support (i.e., ending with 2-9) invalid Sudokus. The model only misclassifies 1 out of 1000 example as valid on the in-support set; in contrast, 659 out of 1000 examples are misclassified as valid on the out-of-support set.
During RL training of the conditional text generation model, the reward for sampled generations increases above 0.8 -we expect the reward to imply that more than 80% continuations are estimated to be valid by the reward; however, only <10% of the continuations are actually valid.
Real-world example 1: AgreeSum. One simple example reproduces the multi-doc AgreeSum summarization (Pang et al., 2021). The input of the task is a cluster of articles, and the expected output is a summary that is faithful to every article in the cluster. We consider D reward that consists of faithfulness annotations on article-summary pairs provided by the AgreeSum paper. The reward function f φ is a summary-article faithfulness classifier. f φ achieves 79% dev accuracy, which we use as the reward. However, the shortest summary in D reward is 7-token-long, so the behavior of the reward for shorter summaries is underspecified. Training a summarizer using the faithfulness classifier as the reward leads to short summaries -most of which (>90%) are ≤ 2 tokens. Even though these nearempty summaries can be technically considered as being entailed in the article, we have not specified in D reward that these summaries are acceptable.
Real-world example 2: MT using BLEURT. BLEURT (Sellam et al., 2020) is a metric trained on expert annotations provided by WMT metric tasks. We train a text generator by RL using BLEURT-20-D3, a distilled version of BLEURT-20. BLEURT is trained on very few repetitive generations. WMT15-19 human rating data (Stanojević et al., 2015;Bojar et al., 2016Bojar et al., , 2017Ma et al., 2018Ma et al., , 2019) are used to train BLEURT. We use BLEURT to train a MT model on the IWSLT14 De-En task (Cettolo et al., 2014). MLE-trained model achieves 63.9 in BLEURT on test set and RL-trained model achieves 65.5, so RL is successful judging by the increase in BLEURT.
Repetitive translations are out-of-support in our case, where repetition is measured by rep defined to be the percentage of 3-grams that have appeared We find that BLEURT does not punish for excessive repetition in the samples during RL: average BLEURT for translations with rep >0.4 (>40% of 3-grams are repetitions -an example is shown in the footnote to demonstrate that 40% is an undesirably large proportion) 7 in the first 45,000 steps of training 8 is 42.7, and average BLEURT for translations with rep <0.2 is 42.3. 9 So the reward does not discourage the MT model from generating repetitions.
Next, we show in Figure 3 that as training goes on, translations get more and more repetitive as BLEURT increases. To summarize, given that 6 We present an example of computing the metrics rep. The sentence 'a b c e d c e d c d' has rep = 2/5 = 40%, given that among 'e d c,' 'd c e,' 'c e d,' 'e d c,' 'd c d,' two 3-grams are the same with the existing ones. 7 As an example, the following sentence has rep = 0.397: pip was adopted from "great expectations; superman was a foster child; and the azbeth salander," the girl with the dragon tattoo, "was a foster child and a pure man; lyra belacqua from philip pullman," and a foster child, jane eyre, adopted, and roald's james, and the great, and he was a parent, and a parent, and then, "and then, you know," and then, "and then, you know," and then, "and, you know," the "-and, you know," the "-and, you know," the "the" -and "you know," the " 8 Using β = 0.05 which leads to the best dev BLEURT. 9 0.2 is an acceptable threshold, given that 93% of translations whose source sentence length >180 have rep <0.2.
repetitive translations are rare in D text , the reward is underspecified on them. This repetition pattern is not discouraged by the reward, and thus it is subsequently exploited by the MT model.

Experimental details for the above examples.
For AgreeSum, given URLs in the original dataset, to find the corresponding articles, we use the newspaper3k library. The reward function (classifier) is based on RoBERTa-large. The summarizer is based on BART-large (Lewis et al., 2020). For the MT experiment, the MT model has an embedding dimension of 512 for both the encoder and the decoder. The FFN embedding dimension is 1024 for both the encoder and the decoder. Both the encoder and the decoder have 4 attention heads and 6 layers. More details can be found in the appendix.

Possible Remedies
As discussed in §2, Skalse et al. (2022) has suggested that a pair of unhackable nontrivial originalproxy reward functions do not exist in theory. Then, when is it safe to use the proxy reward function? While this is still an open question, it is possible to reduce the extent of generating undesirable sentences through the following approaches.
The fundamental problem is that errors in the reward functions, specifically the over-confident errors where low-quality outputs have high rewards, can be exploited during RL training of text generators. Thus, one solution is to avoid OOD states that incur such errors by restricting the policy.
Restricting the policy by regularizing toward the ML solution. A common strategy is to regularize toward the ML solution. In practice, we can interpolate RL and ML losses (Wu et al., 2016), interleave RL and ML updates (Lewis et al., 2017;Guo et al., 2018), or use KL-regularized RL (Jaques et al., 2019Stiennon et al., 2020;Ramamurthy et al., 2023). 10 Here are a few potential issues. First, RL exploration could be important in case the reference dataset is small, and consequently, the ML solution is sub-optimal. In these tasks, it is often easier to verify or rate a generation than to provide a reference generation. For example, in AgreeSum, there are not enough reference summaries due to data collection costs, but given a decent articlesummary faithfulness classifier, we can discover new summaries that have high rewards. Similarly, in creative generation tasks like story generation and textual style transfer, or in code generation, there may not be a large enough high-quality reference dataset, but a reward function is often available. Second, ML solution may not be optimal even with an adequately large reference dataset; e.g., degeneracies like unreasonably short translations (Stahlberg and Byrne, 2019;Kulikov et al., 2022) and repetitive generations (Welleck et al., 2020b,a;Chiang and Chen, 2021) may often have high probabilities. Third, By relying on ML, we are optimizing toward a different objective; thus, we may need to find another automatic evaluator (instead of the proxy reward) to do hyperparameter tuning and model selection. Additionally, Gao et al. (2022) has discovered that larger coefficients for the KL penalty (for ML regularization) does not improve the frontier of the curve of the gold rewardmodel score vs. the KL divergence (between the RL-optimized model and the ML model). 11 Restricting the policy by leveraging a discriminator. Following Goodfellow et al. (2014) and Pang et al. (2021), another idea similar to MLregularization is to leverage a discriminator that distinguishes between sampled generations and the set of dataset-provided generations. 12 During RL training, we force the model to produce generations that are indistinguishable from references according to the discriminator. Discriminator and RL updates are interleaved. It is difficult to use GAN to train a high-quality text generator, but we hypothesize that the discriminator can reduce easy-to-identify low-quality examples during RL training.
Fixing the reward itself. Another thread of remedies is to fix the reward itself. An effective approach is to iteratively collect human annotations (Stiennon et al., 2020;Bai et al., 2022a;Fan 11 See Figure 9 of Gao et al. (2022). 12 The discriminator predicts whether the generation is machine-generated or comes from the set of references. This technique is useful when there are only few parallel datapoints. et al., 2022): the reward is iteratively updated with human annotations on latest model generations; thus, the generations with low human preferences but high rewards will be corrected through this iterative process. One concern is the cost, which may prohibit an adequate amount of iterations or adequately frequent iterations. Krakovna et al. (2020) has discussed the possibility that a machine can learn to fool human evaluators in robotics, but it is unclear what the equivalence in conditional text generation is. So far, this approach has been successful, with the critical assumption that there is little budget/resource constraint to obtain enough annotations and enough iterations of annotations.
Another caveat is that as Perez et al. (2022) has recently discovered, RL-with-human-feedback may amplify one-sided views (e.g., on political issues); they claim that this phenomenon can be explained by the selection bias of annotators, leading to unrepresentative reward. If selection bias is unavoidable (therefore the unrepresentative reward is unavoidable), we may need another way of fixing the reward and preventing the generation model from amplifying the bias -e.g., by hard-coding a set of principles as in Constitutional AI (Bai et al., 2022b).
More discussion. An additional method in the RL literature is conservative Q learning ; it aims to push down all high rewards to ensure that the out-of-distribution states do not achieve high Q values, but the approach requires extensive hyperparameter tuning (Zheng et al., 2022). Another possibility to avoid the reward gaming issue is to simply avoid interaction with the environment, using methods like Pang and He (2021) to learn from demonstrations, so the errors in the reward function will be less exploited.

Conclusion
To conclude, we use synthetic and real-world tasks to demonstrate that even if a learned reward achieves high performance on D reward , a high reward may still get assigned to undesirable patterns which get amplified during RL training of the conditional text generation model. A critical future direction is to investigate when or how easily a spurious feature could be exploited, by exploring the relationship among the minimum description length of a spurious feature (Voita and Titov, 2020) or similar statistics, the proportion of datapoints that contains the spurious feature, the choice of RL algorithm, and the degree of the reward gaming behavior. Additionally, further research on antigaming approaches is needed to fulfill the potential of training text generators by learned rewards.

Limitations
First, off-policy algorithms like Q learning are not explored in this discussion. Second, the reward gaming issue is not a novel topic in the RL community for tasks like gameplay or autonomous driving (Amodei et al., 2016;Koch et al., 2022). However, we hope to highlight issues in the NLG community especially given the recent endeavors on learning from learned metrics. In addition, the paper aims to demonstrate the existence of reward gaming in conditional text generation, not the certainty regardless of experimental settings (hyperparameters, architectures, etc.). Given that our experiments use reasonable settings which lead to degenerate texts, we argue that reward gaming could be a common issue when learning a text generation model using RL based on learned rewards, and the issue deserves attention from researchers and practitioners. We leave it to future work to investigate the easiness of reward gaming in practice, which is missing in this work.

A More Background
For our policy gradient algorithms, we use the standard REINFORCE algorithm with tricks that are introduced in the following paragraphs. Specifically, in all RL experiments, we first initialize the model using an MLE-trained model to ensure a good starting point for RL optimization. During training, we collect a set of trajectories through sampling from the current policy (i.e., generator). Then, we compute the estimated returnQ t at each time-step t.
Next, the estimated returnQ t is subtracted by a baseline. Therefore, the actual gradient update is as follows: , whereQ(s t , a t ) = T t =t r t assuming discount factor γ = 1, and b is possibly state-dependent. In particular, for Sudoku experiments as well as the experiment where we train an MT model using BLEURT as the reward, we attempt two variants of baseline: (1) using the average reward for the past 50 updates, which is an effective strategy in training models using sequencelevel rewards (Kiegeland and Kreutzer, 2021), and (2) using the value function fitted by mean-squared error (so the estimated return subtracted by the value ends up being the advantage), introduced in full detail here. 13 For case (1), the results are shown in the blue lines in the plots; for case (2), the results are shown in the purple dotted lines in the plots. We use the Adam optimizer (Kingma and Ba, 2014) for all our experiments.
In particular, we use KL-regularized RL, as discussed in §3. Regularization toward ML may stabilize RL optimization, but it may still lead to higher rewards that correspond to undesirable behaviors, as discussed in §5. The coefficient for the KL term is tuned in {0.01, 0.05, 0.1} for Sudoku experiments and {0.01, 0.03, 0.05, 0.1, 0.25} for other experiments. For the purpose of this discussion, to illustrate the effect of reward gaming, the coefficient is tuned to achieve the highest validation reward; due to optimization issues in practice, a lower coefficient does not necessarily correspond to a higher reward. Larger coefficients may lead to lower proxy rewards but higher true rewards. While it may address the reward gaming problem in some experiments, we have shown in §4.3 that even large coefficients may lead to reward gaming.
Proximal policy optimization (PPO; Schulman et al., 2017) is a widely used algorithm that aims to avoid reward collapse. Our conclusion, however, does not depend on the RL algorithm. Using PPO prevents the optimization from converging to a very low reward, but it does not eliminate the possibility that high reward generations have undesirable patterns. In addition, Q learning, an off-policy RL algorithm that can leverage existing trajectories, is recently applied to also be applied in text generation (Kohita et al., 2020;Pang et al., 2022)  Out of 1000 samples of invalid Sudokus that end with 7 and contain 81 tokens, the trained classifier predicts (incorrectly) that 1000 are valid. Out of 1000 samples of invalid Sudokus that end with 7 and contain fewer than 81 tokens, the trained clas-sifier predicts that 0 is valid. The performance of Sudokus longer than 81 tokens is irrelevant, given that during RL sampling as well as during generation test time, the sequences are constrained such that they can at most generate 81 − k tokens where k is the length in the given source sequence.
Sequence generator. Suppose the input to the generator contains k numbers. During RL sampling and during test-time of the generator, the sequence generator is constrained to generate at most 81 − k numbers. However, it can generate fewer than 81 − k numbers. To avoid sequence generators from generating overly short continuations, part (ii) of the negative examples, described above, contains examples that are too short.
For the sequence generator, we use a smaller version of the transformer_iwslt_de_en architecture in fairseq . The encoder embedding dimension and the decoder embedding dimension are both 32. We use 2 attention heads in both the encoder and the decoder. The dimension for FFN in both the encoder and the decoder is 64. There are 2 encoder layers and 2 decoder layers. All the text generation models in the Sudoku experiments have 43k parameters.
The batch length (i.e., number of tokens in a batch) is tuned in {8192, 16,384, 32,768, 65,536}. The learning rate is tuned in {1e-4, 1.5e-4, 2e-4}. The dropout rate is tuned in {0.01, 0.1, 0.3}. For optimal reward, we choose a batch length of 32,768, a learning rate of 1.5e-4, and a dropout rate of 0.01. The training algorithm is detailed in §A.

B.2 Details for the Experiments on Naturally
Occurring Spurious Correlations Sudoku revisited. For the second Sudoku example, the hyperparameters are selected from the same sets as in §B.1. For the best-performing classifier, the learning rate is 5e-4 and the dropout rate is 0.01. For the sequence generator, we use the same hyperparameters as before. The lack of repetition in the last nine numbers (of the output) is spuriously correlated with a high reward, given that nonrepetition is a necessary but not sufficient condition for a valid Sudoku. f φ achieves 99.9% accuracy on test set of D reward . The text generator learns to exploit the non-repetition pattern which leads to high rewards, but the generations are mostly wrong.
Training an MT model using the WMT MQM dataset. To train the reward function, the learning rate is selected from {1e-4, 2e-4, 5e-4}, and dropout is selected from {0.01, 0.1, 0.3}. For optimal performance, we use a learning rate of 2e-4 and a dropout rate of 0.3. Training the reward function takes around 3 hours. For D task , our translation task uses the WMT19 Zh-En dataset, and f φ is fine-tuned from an MLEtrained MT checkpoint using the WMT19 Zh-En dataset. We use a transformer model with 6 encoder layers and 6 decoder layers. The number of attention heads is 8 in both the encoder and the decoder. The FFN embedding dimension is 2048 in both the encoder and the decoder. There are 82.6M parameters in the model.
The algorithm is detailed in §A. We use a KL coefficient of 0.1. We use a dropout rate of 0.3, a learning rate of 1e-4, and a batch length of 4096. All the MT experiments are done on a single NVIDIA RTX 8000 GPU with 48G of memory. Training time is only 24 hours, given that we do not need to train the model till convergence to see the undesirable patterns in generations.

B.3 Details for the Experiments on Covariate Shift
Experimental details for AgreeSum. Given URLs in the original dataset, to find the corresponding articles, we use the newspaper3k library. We use slightly different architectures from the AgreeSum paper. The reward function (classifier) is based on RoBERTa-large (Liu et al., 2019) with 355M parameters. We use a learning rate of 5e-4, a dropout rate of 0.1. The submitted job for the classifier is 24-hour-long. The summarizer is based on BART-large (Lewis et al., 2020) with 406M parameters. We use a learning rate of 3e-5, a batch length of 2048, and a dropout rate of 0.