Efficient (Soft) Q-Learning for Text Generation with Limited Good Data

Maximum likelihood estimation (MLE) is the predominant algorithm for training text generation models. This paradigm relies on direct supervision examples, which is not applicable to many emerging applications, such as generating adversarial attacks or generating prompts to control language models. Reinforcement learning (RL) on the other hand offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. Yet previous RL algorithms for text generation, such as policy gradient (on-policy RL) and Q-learning (off-policy RL), are often notoriously inefficient or unstable to train due to the large sequence space and the sparse reward received only at the end of sequences. In this paper, we introduce a new RL formulation for text generation from the soft Q-learning (SQL) perspective. It enables us to draw from the latest RL advances, such as path consistency learning, to combine the best of on-/off-policy updates, and learn effectively from sparse reward. We apply the approach to a wide range of novel text generation tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation. Experiments show our approach consistently outperforms both task-specialized algorithms and the previous RL methods.


Introduction
Recent natural language generation systems have made remarkable progress in producing wellformed text, especially with massive pretrained language models.Those models are typically trained using maximum likelihood estimation (MLE) with a large amount of data supervisions.Despite its successes, the standard training method suffers from limited applicability to many emerging text generation problems, where little or no supervised data is available.Prominent examples of such low-data problems include generating prompts to control the massive LMs (Yin et al., 2019;Shin et al., 2020;Zhong et al., 2021;Liu et al., 2021), learning text generation from noisy or even negative data, generating adversarial text attacks for robustness study (Wallace et al., 2019;Atanasova et al., 2020), and others (Figure 1, right).Due to the failure of standard MLE, people have had to devise specialized algorithms for those problems respectively.
Reinforcement learning (RL) (Sutton and Barto, 2018) offers an alternative principled framework for learning from arbitrary reward functions.However, RL by far has made limited success for training text generation, primarily due to the key challenges of sparse reward (i.e., a single reward signal is received only after the whole text sequence is generated) and large action space (i.e., a vocabulary of millions of words).For instance, a popular family of RL algorithms studied extensively for text generation is the policy-based (Williams, 1992) or actor-critic based (Bahdanau et al., 2016;Rennie et al., 2017) algorithms, with policy gradient (PG) being the most prevalent example (Ranzato et al., 2016;Li et al., 2016;Rennie et al., 2017;Tan et al., 2018;Pasunuru and Bansal, 2018;Paulus et al., 2018).Those algorithms train the model with on-policy updates, i.e., the text samples used for estimating policy gradients are from the target model itself.Due to the exponentially large space of sequences, on-policy updates often suffer from extremely high variance and low data efficiency (e.g., most model samples are not useful for learning).Thus directly training with PG from scratch is usually impossible.In practice, the model has to be initialized by MLE training, followed by PG as finetuning, which often leads to limited improvement (Choshen et al., 2020;Wu et al., 2018).
Another set of work has resorted to off-policy RL.The key advantage is that samples from other sources, e.g., human-written text, can be used, making them more data efficient than on-policy meth-... ods.Previous work has used either importance weighted PG (Pang and He, 2021; Zhou et al., 2017;Kandasamy et al., 2017) or Q-learning based algorithms (Guo, 2015;Jaques et al., 2020;Narasimhan et al., 2015).However, off-policy methods have been considered to be less stable.For example, the Q-learning performance relies heavily on how accurate the learned Q-function assesses the quality of intermediate subsequences -a challenging task due to the sparse reward signals.
In this paper, we develop a new RL formulation for text generation that tackles the above issues (Figure 1, left).We reframe the text generation problem from the soft Q-learning perspective originally developed in robotics (Haarnoja et al., 2017;Schulman et al., 2017).The resulting connection allows us to seamlessly take advantage of the latest successful techniques from the RL literature.In particular, we introduce and adapt the principled path consistency learning (Nachum et al., 2017) to text generation, that (1) offers a natural way to train the model with both on-and off-policy updates, hence combining the best of the two strategies, (2) bridges the sparse reward signal to directly supervise the Q function learning, leading to more accurate Q estimation and credit assignment, and (3) makes efficient updates to Q-values by considering all candidate actions together.
The generality and efficiency of the proposed method allows us to train text generation in a wide range of applications: (1) With noisy and negative training examples, our approach learns to generate accurate entailment text that greatly improves upon the data itself as well as other various training methods; (2) Our approach also manages to train an effective adversarial text generator for robustness test for classifiers; (3) We train a prompt generator with our algorithm to achieve controllable generation of pretrained LMs in terms of topics.2On all the three tasks, our approach consistently improves over not only previous RL algorithms for text generation, but also diverse task-specialized methods designed specifically for each of the problems, respectively.In the appendix ( §A.1.4),we also show that on standard supervised tasks where MLE prevails, our approach is competitive to train text generation models from scratch, which was usually impossible for previous RL algorithms.
The contributions can be summarized as follows.On the technical side, we propose a new RL formulation for text generation based on soft Q-Learning.This new formulation allows us to seamlessly take advantage of the RL literature's latest successful techniques (notably the path con-sistency algorithm) to overcome the longstanding challenges (e.g., sparse reward and large action space) in text generation.On the empirical side, we conduct studies on a wide variety of text generation tasks with limited data (i.e., generating from noisy/negative data, adversarial text generation, prompt generation).We propose their RL formulations, and show that our general approach consistently improves over not only previous text RL algorithms, but also diverse task-specialized methods.

Background and Challenges
We aim to learn a generation model p θ (y) = T t=0 p θ (y t |y <t ), where y t is a token from a vocabulary V.The distribution at each step t is obtained by applying softmax on the logits f θ (y|y <t ): . (1) Despite its popularity, MLE-based training only applies when clean supervised data y * is available, and cannot be used to optimize arbitrary task metrics (e.g., BLEU, entailment score) which are typically the goal in many text generation tasks.Previous research has formulated text generation as an RL problem by considering the following finitetime Markov Decision Process (MDP).At each time step t, let the "state" be s t = y <t , namely the partial sequence generated so far.The model ("agent") takes as input the current state s t and outputs a token ("action") a t ∈ V according to a policy π(a t |s t ).The agent then receives a reward r t = r(s t , a t ) and deterministically transitions to next state s t+1 (i.e., the concatenation of the tokens in s t and the new token a t ).
Following the notation convention in RL, let τ be the trajectory (i.e., text sample) generated by the policy.The agent's objective is to maximize the accumulative reward, a t , the expected future reward of taking action a t (i.e., generating token a t ) in state s t and continuing with the policy π.Challenges.Text generation poses significant challenges to RL, particularly because (1) the reward signal is usually sparse, i.e., r t = 0, ∀t < T and the agent receives a non-zero reward r T only after it generates the full sequence, (2) the action space (i.e., the vocabulary V) is extremely large.
The challenges have led to difficulties of the two major families of RL approaches applied to text generation problems, as detailed below.Policy-based RL techniques directly parameterize the policy π θ with parameters θ.Thus the policy π θ (a t |s t ) exactly corresponds to the above generation model p θ (y t |y <t ).Policy gradient (PG) is one of the most widely used algorithms for text generation (Ranzato et al., 2016).It optimizes the cumulative reward with the policy gradient using the estimated Q π θ value based on sample τ .PG is an on-policy algorithm, meaning that the sample τ needs to come from the the current policy π θ itself.In practice, however, optimizing this objective alone from scratch is unlikely going to work because most samples τ ∼ π θ are just gibberish with zero reward, failing to provide meaningful training signals for updating the policy.Previous literature either initializes the policy π θ with MLE training, and/or use a combination of MLE and PG updates, which often leads to marginal gains in practice (Wu et al., 2018;Choshen et al., 2020).Value-based RL techniques, such as Q-learning, implicitly learn the policy π by approximating the value Q π (s, a) directly.Deep Q-learning (Mnih et al., 2013) parameterizes the Q-function as Q θ (x, a), and train the parameters by minimizing the following regression objective L(θ) based on the Bellman temporal consistency: where θ is the parameters of the target Q-network, which is a slow copy of θ and considered as constant for gradient computation of θ.Here π is an behavior policy which can be an arbitrary distribution over text, such as the data distribution or replay buffer (Mnih et al., 2013).This makes Q-learning an off-policy algorithm because of its ability to use samples coming from other policies.After learning Q θ , one can induce a policy π from it that takes arg max a Q θ (s, a) at each state s.Jaques et al. (2017) instead sample tokens from the softmax function applied to Q θ .However, the training can be unstable and inefficient due to several challenges: (1) The bootstrapping nature of the above regression problem can make the training unstable.That is, the regression target r t + γ max a t+1 Qθ(s t+1 , a t+1 ) itself is derived from the Q-function to be learned (Kumar et al., 2019).The problem is exacerbated in the presence of sparse reward in text generation, where the real observed signal r t is zero for all intermediate t < T ; (2) The large action space (e.g., 10 4 ) in text generation results in slow updates.In particular, notice that Eq.( 2) applies the gradient update to the Q θ -value of the only one particular token a t (out of, say, the 10 4 candidate tokens in the vocabulary), making the training inefficient; (3) Besides, pure off-policy updates could be highly sensitive to the quality of training data, and miss the opportunity of on-policy exploration that maximizes the reward of interest in a more direct way.
3 The Soft Q-Learning Framework We introduce the soft Q-learning (SQL) formulation of text generation.It is seamlessly compatible with the common architecture of text generation model (Eq.1), permits easy implementation ( §3.1), and enables efficient and stable RL training in practice ( §3.2). Figure 2 and Algorithm 1 summarizes the resulting SQL framework.

SQL Formulation for Text Generation
Soft Q-learning (Haarnoja et al., 2017;Schulman et al., 2017;Nachum et al., 2017) is an maximum-entropy (MaxEnt) extension to the standard (hard) Q-learning (Mnih et al., 2015;Sutton and Barto, 2018).Under this framework, the agent is encouraged to optimize the reward while staying as stochastic as possible, with the objective J MaxEnt (π) = E τ ∼π T t=0 γ t r t + αH (π (• | s t )) , which augments the vanilla J(π) with the additional Shannon entropy term H with coefficient α.3This is appealing because it seamlessly connects the Q-values to the familiar output logits of a text generation model, which enables straightforward implementation of the SQL formulation.Q-values as Generation Model Logits.We show the connection of the Q-values with the logits, i.e., outputs right before the softmax layer.Concretely, with the SQL objective, the following relationship between optimal policy π * and action-value Q * holds (Haarnoja et al., 2017;Schulman et al., 2017): This form is highly reminiscent of the softmax layer of the generation model in Eq.( 1).The con-nection suggests that we can naturally parameterize the Q-function in SQL as the generation model logit function, i.e., Q θ (s, a) ≡ f θ (a|s).In other words, the model output f θ (a|s), originally interpretted as the "logit" of token a given the preceding tokens s, is now re-interpretted as the Q-value of action a in state s.When achieving optimality, f θ * (a|s), namely Q * (s, a), represents the best possible future reward achievable by generating token a in state s.Similarly, the full generation model p θ (a|s) in Eq.( 1) that applies softmax to f θ now precisely corresponds to the policy π θ induced from Q θ (s, a).That is, We could further gain even more intuitive interpretation of the above generation policy π * from the lens of advantage function (Sutton and Barto, 2018).Specifically, in SQL, the optimal statevalue function is the log-normalizer of the optimal Q-values (Haarnoja et al., 2017;Schulman et al., 2017).This allows a more concise form of Eq.( 3): where A * is the optimal advantage function.The equation says that, in the proposed text generation SQL formulation, the optimal policy generates token a in state s according to the token's advantage.

Efficient Training with Path Consistency
Vanilla training based on the Bellman temporal consistency can suffer from the instability and inefficiency issues similar to the conventional Qlearning ( §2), as we discuss more in the appendix ( §A.3.2).Fortunately, our SQL formulation allows us to import latest advances of RL techniques to overcome the difficulties.Specifically, we adapt the unified path consistency learning (PCL) that has excelled in game control (Nachum et al., 2017).
The PCL-based training updates Q-values of all tokens at once through a connection between the value function and the induced policy.More specifically, Nachum et al. (2017) showed that the optimal policy π * (Eq.3) and the optimal state value function V * (Eq.5) in SQL must satisfy the following consistency property for all states and actions: where for each (s t , a t ), the computation involves step t and t + 1. Dashed boxes in dark green and gray indicate the regression target, where the intermediate reward r t is often 0 due to sparsity.The gradient is applied to parameters θ at step t (indicated by orange color).Right: Multi-step objective (Eq.9) which aggregates from step t all the way to T .In this way, the final-step non-zero reward r T is used as the regression target.
Accordingly, the PCL-based training attempts to encourage the satisfaction of the consistency with the following regression objective L SQL, PCL (θ): where π θ is the induced policy defined in Eq.( 4); Vθ is defined similarly as in Eq.( 5) but depends on the target Qθ network (i.e., a slow copy of the Q θ to be learned), and recall that π is an arbitrary behavior policy (e.g., data distribution).Please see Figure 2 (left) for an illustration.Crucially, notice that the gradient update is applied to θ through the log π θ term which explicitly involves the Q θ -values of all tokens a in the vocabulary.This shows an important difference from the above vanilla training in conventional Q-learning ( §2) where Q θ is updated only through the particular a t token.The PCL training thus offers more efficient updates for the Q θ function.In the appendix ( §A.3.1),we also discuss the difference from the MLE objective.
Intuitively, MLE trains the model to (blindly) increase the probability of the observed tokens, while PCL encourages the (log) probability of the tokens to match the approximate advantage values.
Multi-step PCL for Sparse Reward.The above PCL objective Eq.( 7) alone does not resolve the potential instability issue due to the bootstrapped Vθ(s t+1 ) value and the sparse reward (i.e., r(s t , a t ) = 0 for t < T ).Our SQL formulation allows us to additionally incorporate the multi-step variant of the PCL training (Nachum et al., 2017) to resolve the issue.Specifically, by applying a telescoping sum on the consistency equation (Eq.6) starting from t up to T , we arrive at the multi-step temporal consistency: where the value of past-terminal state is zero, V * (s T +1 ) = 0; and the rewards are only available at the end, T l=t γ l−t r l = γ T −t r T .We can then come to the following multi-step objective function L SQL, PCL-ms (θ), We can see the objective side-steps the need to bootstrap intermediate value functions Vθ(s t ) for t > t.Instead, it directly uses the non-zero end reward r T to derive the update for θ.Please see Figure 2 (right) for an illustration.In practice, we combine the single-and multi-step objectives (Eqs.7 and 9) together for training.
Joint On-and Off-policy Training.Finally, we highlight that the behavior policy π involved in the objectives Eqs.( 7) and ( 9) can be an arbitrary policy.For example, π can be a (possibly noisy) text dataset, or a set of text samples produced by other generation models, resulting in off-policy training.
We can also set π to be the current generation model π θ to be learned, resulting in on-policy training.In practice, we could first train the model with only off-policy data for warming up, and then continue with joint on-and off-policy training to further maximize the reward.

Applications and Experiments
We show broad applications of the proposed RL text generation framework to a variety of problems Draw a batch of on-policy samples {τ on } by decoding with policy π θ (a t | s t ) (Eq.4) 5: Compute Q θ (s t , a t ) values (the model logits) and target Qθ(s t , a t ) for (s t , a t ) ∈ {τ off } ∪ {τ on } 6: Compute the objectives in Eqs.( 7) and ( 9) where no clean supervision data is available.These include learning with noisy or even negative data ( §4.1), generating adversarial text attacks ( §4.2), and generating prompts to steer pretrained LMs ( §4.3).We also study the performance on standard supervised generation tasks ( §A.1.4)and show that our approach is competitive to train text generation models from scratch.We provide detailed configurations in the appendix ( §A.2).

Learning from Noisy (Negative) Text
The popular MLE algorithm learns by (blindly) imitating training data.However, it is often expensive to curate clean quality data.It is thus highly desirable to be able to learn from data with noises, or even negative examples.With the guidance of task metrics (rewards), the model can even learn to "outperform" the training data and achieve desired generation behaviors.To this end, we consider the task of entailment generation (Pasunuru and Bansal, 2017).Given a sentence (premise), the goal is to generate a new sentence (hypothesis) that logically follows the premise.
Setup (more in §A.2.1).We sub-sampled 50k training examples from the SNLI dataset (Bowman et al., 2015), a commonly used entailment classification dataset.The hypotheses have an average entailment probability of only 50%, and over 2/5 of them less than 20% (negative/contradictive examples) -a significant challenge for the models to learn from the noises.The rewards include (1) the entailment score of the generation measured by a robust entailment classifier (Nie et al., 2020), (2) the log-likelihood of the generation as an indicator of language quality measured by a GPT-2 language model (Radford et al., 2019), and (3) BLEU score w.r.t the input premises as another language quality reward that avoids trivial outputs.We sum together all rewards with weights 1.0.
We compare our approach with a broad range of baselines, including ( 1 6) one of the latest methods GOLD-s (Pang and He, 2021) which is a pure off-policy method based on importance-sampling PG.To ablate the effect of multi-step training ( §3.2), we additionally compare with a simplified variant of our approach that uses only vanilla single-step PCL training (SQL(single)).We include more baselines such as MLE weighted by rewards in §A.1.1.
We evaluate generation results in terms of entailment rate, language quality (perplexity), and diversity which is measured by the Shannon entropy over unigrams and bigrams (H 1 , H 2 ) (Gehrmann et al., 2021).Since text generation models intrinsically trade off diversity and quality (Caccia et al., 2019;Hashimoto et al., 2019), we vary the generation diversity by generating samples via top-p sampling (Holtzman et al., 2019) with different p values, and plot the entailment rate and perplexity against diversity, resp.We also evaluate the samples produced by beam-search decoding.A.3 for additional results.Right: entailment attack performance against diversity.Only a few MLE+PG dots are visible because the model is not able to generate more diverse samples even with increasing p value in top-p decoding, i.e., the model collapses.
Results. Figure 3 (left) shows the results, and Table A.5 shows samples.First, notice that MLE performs poorly, while MLE+reward improves upon it.This is not surprising as the training data contain noisy/negative examples.Similarly, since the pure off-policy algorithm GOLD-s relies heavily on the data distribution, we observed that it achieves suboptimal performance.The on-policy MLE+PG with MLE initialization gives better entailment rate.In comparison, our full SQL framework achieves the best entailment-diversity trade-off.The comparison between SQL and SQL(single) highlights the importance of having the multi-step objective which directly uses the end reward rather than bootstrapping intermediate Q-values for supervision.

Universal Adversarial Attacks
We next study the application in text adversarial attacks, where again no supervised data is available.Adversarial attacks is an increasingly important research topic as they reveal models' vulnerabilities and flaws.This is especially true for universal attacks (Wallace et al., 2019;Atanasova et al., 2020), where we want to generate universal examples that trick the model on all possible inputs.For instance, consider the context of entailment classification.Our goal is to find universal humanreadable hypotheses that are going to be classified as "entailment" with as high probability as possible, regardless of the input premises.This is a more challenging setting compared to previous instancespecific attack (Morris et al., 2020;Jin et al., 2020;Ebrahimi et al., 2017) where the attack model conditions on a premise and generates an adversarial hypothesis specific to the premise.
Setup (more in §A.2.2).We aim to attack one of the most popular MultiNLI (Williams et al., 2018) entailment classifiers on HuggingFaceHub. 4The attack generation model generates adversarial text without conditioning on any inputs so that the generated attacks are universal to all premises.We compare our SQL with MLE+PG.We use all hypotheses in the MultiNLI dataset as the training data for the MLE training in MLE+PG and the offpolicy updates for our SQL.We do not compare with previous specialized adversarial text attack methods, because they either are not applicable to the challenging universal attack setting (Morris et al., 2020;Jin et al., 2020;Ebrahimi et al., 2017), or were not designed to generate human-readable sentences (Wallace et al., 2019).We use similar settings as in §4.1 to explore the diversity-quality trade-off by plotting the entailment rate and perplexity against diversity, respectively.The entailment classifier to be attacked is used as entailment score reward functions.We also include a tokenlevel repetition penalty reward for readability.

Prompt Generation for Controlling Pretrained Language Models
A reward function does not just have to be a metric like the BLEU score, but also a complicated pipeline that eventually returns a score.To demonstrate this, we consider the emerging task of prompting a large pretrained LM for controllable generation (Hu et al., 2017;Radford et al., 2019;Brown et al., 2020).The goal is to learn to generate text prompts that steer the LM to generate sentences of certain desired attributes (e.g., topics).The problem of controlling the generation of pretrained LMs was previously approached through specialized algorithms such as modifying the LM hidden states during decoding (Dathathri et al., 2020;Krause et al., 2020;Qin et al., 2020).
Here we show that prompts offer an easier, faster, more effective way for controlled generation.
Learning to generate/tune prompts is gaining increasing attention recently.It side-steps the needs for expensive LM fine-tuning, and adapts LMs to new scenarios with prompt as the (computefriendly) interface.Most existing approaches (Wallace et al., 2019;Li and Liang, 2021;Lester et al., 2021) rely on gradient backpropagation and are applicable only when the whole training pipeline is differentiable.This does not hold for the text generation setting, as illustrated in Figure 5.In contrast, the RL framework is generally applicable to any differentiable or discrete pipelines.
Setup (more in §A.2.3).Following (Dathathri et al., 2019), we aim to control the generation to have one of 7 topics (e.g., "science"); the generated prompt is prepended to one of 20 input sentences for the pretrained LM to generate continuation sentences.Figure 5 shows the architecture of prompt-based controllable generation.We compare our SQL method with MLE+PG as before.Since the prompt length could impact the generated sentences, we conducted experiments with maximum prompt length 5, 10, and 15.As ablation study, we also evaluate the SQL algorithm with only off-policy updates (i.e., without on-policy exploration), denoted as SQL(off), and compare it with vanilla MLE training.Finally, we also compare with two specialized controllable generation techniques based on pretrained LMs, namely PPLM (Dathathri et al., 2019) and GeDi (Krause et al., 2020), following similar procedures using their open-sourced code.We use a distilled GPT-2 model5 as the pretrained LM to be controlled.
For rewards, we use the topic accuracy of the continuation sentences measured by a zero-shot classifier, plus the the log-likelihood of continuation sentences as the language quality reward measured by a distilled GPT-2.6 Results. Figure 4 shows the topic accuracy of the controlled LM outputs averaged across the 7 topics, and Table 1 shows the respective language quality results.More detailed topic accuracy results and samples are provided in the appendix ( §A.1.3)(where GeDi obtained low accuracy on 2 of the 7 topics, possibly because the topic tokens are tokenized into two subwords for which the model released by the authors was not specifically trained).
We can see that the prompts generated by our SQL cause the LM to generate sentences with high topic accuracy while maintaining low perplexity in most settings.Increasing the prompt length positively impacts the topic accuracy, which makes sense because longer prompts give more flexible for steering the LM.The comparison between MLE and SQL(off) shows that the off-policy component of SQL is better than standard MLE training, as it incorporates reward signals instead of just blindly following the (noisy) data.
Next, comparing with the previous steered decoding such as PPLM and GeDi, we can see the prompt-based control trained with RL achieves better trade-off between topic accuracy and language quality.Moreover, once a prompt is produced, we can use the pretrained LM to generate text of desired topics efficiently, with the same time cost as standard non-controlled decoding.In comparison, the dedicated steered decoding is often orders-ofmagnitude slower, as shown in Table 2.

Related Work
Standard RL algorithms can sometimes be oversensitive to the randomness in the environment.Recent works have considered maximum-entropy RL extensions, such as the soft Q-learning (SQL) (Haarnoja et al., 2017;Nachum et al., 2017;Schulman et al., 2017), that maximize the entropy of policy besides the rewards, and demonstrated substantial improvement in robotic and game control (Ziebart et al., 2008;O'Donoghue et al., 2017;Nachum et al., 2018;Eysenbach and Levine, 2021).Our work is the first to adapt SQL and its advanced variants (in particular the path consistency learning (Nachum et al., 2017)) to the challenging text generation problem and show significant results on diverse applications.
Applying RL for text generation has been discussed in alleviating the exposure bias problem and optimizing task metrics (Guo, 2015;Li et al., 2016;Wu et al., 2016;Rennie et al., 2017;Paulus et al., 2018;Chen and Bansal, 2018;Liu et al., 2020;Pang et al., 2021).For example, Ranzato et al. (2016) used the REINFORCE algorithm (Williams, 1992), and Bahdanau et al. (2016) used the actorcritic algorithm; Guo et al. (2018) and Shi et al. (2018) tried to relieve the sparsity problem via hierarchical and inverse RL methods, resp.They are all on-policy RL algorithms with the need of pretraining their models using MLE.RAML (Norouzi et al., 2016) implicitly relies on the quality of off-policy data; this does not necessarily apply in our experiments with limited good data.Tan et al. (2018) and Hu and Xing (2022) offer a unified view of RAML, RL, and other training methods.Another line of work focused mostly on using only off-policy data, often for offline training of chatbots (Kandasamy et al., 2017;Zhou et al., 2017;Jaques et al., 2020;Pang and He, 2021).As a result, the opportunity of directly improving the reward (as in on-policy updates) for other rich tasks is missed.Our proposed framework combines onand off-policy training, and further offers solutions for efficient training from scratch in the presence of large action space and sparse sequence-level reward in text generation.

Conclusion
We develop a new RL formulation for text generation based on soft Q-learning and path consistency learning.We conduct experiments on learning with noisy and negative data, black box adversarial attack, prompting a pretrained language model for controllable generation, and standard supervised tasks.This formulation opens up new opportunities to integrate more advances made in the fertile RL literature to improve text generation problems.

Limitations
A well-documented limitation of RL methods is the importance of the reward function.The proposed methods are no different in this aspect.This is especially relevant as our reward function could involve a learned model itself, which we proactively leveraged in Sec.4.2.We refer interested readers to Deng et al. (2022) for more algorithmic considerations.We also noticed that adapting the pretraining-finetuning paradigm to the proposed methods requires careful designs.A hypothesis points to the discrepancy between MLE objectives (commonly used in pretraining context) and SQL objectives.As discussed in Sec.3.1, the SQL formulation re-interprets the "logit" as the Q-value, for many good reasons.However, our preliminary experiments suggest that, as a downside, this makes finetuning an MLE-trained model with SQL objectives more challenging.Future work to scale the proposed methods to tasks such as machine translation and language modeling, and with significantly larger and (MLE-)pretrained models would be exciting.

A.1.4 Supervised Text Generation Tasks
Finally, we conduct experiment on standard generation tasks where clean supervised data is available.
The study is to examine the capabilities of the proposed RL method to train a text generation model from scratch, which has been considered as exceedingly challenging for previous RL algorithms.
Setup.We study on two tasks, E2E (Novikova et al., 2017) andCommonGEN (Lin et al., 2020), and use the respective datasets pre-processed by (Gehrmann et al., 2021) which allow sequence-tosequence modeling with standard transformers.We run four sets of methods: the standard MLE training (MLE); PG training from scratch (PG); joint MLE and PG training, with MLE initialization (MLE+PG); and our SQL training from scratch with both off-policy and on-policy updates (SQL).We use the standard BLEU as reward.We additionally investigate the training stability and sensitivity w.r.t hyperparameters, in particular the scale of reward.To this end, for MLE+PG and SQL, we vary the reward scale in {1, 10, 50, 100, 500, 1000} and evaluate the respective performance under different scales.
Results.Table A.1 shows the performance on E2E of different models whose hyperparameters are picked using the validation set.We can see the proposed SQL that trains models from scratch achieves competitive results with the common MLE and MLE+PG.In contrast, the PG algorithm alone without MLE fails the training.Figure A.2 (left) shows the respective training curves (on the validation set), demonstrating that SQL converges in an efficient and stable way as MLE.
We further demonstrate the sensitive of MLE+PG and SQL w.r.t the reward scale as a key hyperparameter.Figure A.2 (middle and right) shows the training curves of the two methods with varying reward scales.We can see SQL is significantly more robust as reward scale changes, while MLE+PG tends to collapse with improper reward scale configurations.

A.2 Setup Details
Our evaluation follows the GEM Benchmark (Gehrmann et al., 2021) when applicable,8 and otherwise same with the reward function used in training.We use a transformer model (Vaswani et al., 2017) based on Texar-Pytorch (Hu et al., 2019) by default, with 64 hidden dimension, 3 blocks, and 4 heads.For experiments that involve policy gradient training, we initialize the model with maximum likelihood training by default unless specified otherwise.We train soft Q-learning model from scratch with both off-policy (using data) and on-policy (using samples) by default except in §4.1 and §4.3, in which we find it beneficial to warm-up the model with just off-policy training.We apply similar tuning budgets to both soft Q-learning model, and policy-gradient (mostly the reward scale and top-k), based on performance on the validation dataset and sample qualities.Most of the experiments are conducted using Nvidia 1080 or 2080 series GPUs with around 12GB memory.Most of the datasets are based in English.
Reward Functions We use the robust entailment classifier (Nie et al., 2020) in §4.1,9 one of the most used entailment classifiers on HuggingFace-Hub in §4.2,10 and a zero-shot classifier based on  BART (Lewis et al., 2020) to compute the topic score in §4.3. 11To compute perplexities, we use a GPT-2 model (124M parameters) (Radford et al., 2019) fine-tuned on the corresponding datasets for computing perplexity in §4.1 and 4.2, and a distilled GPT-2 model in §4.3 without fine-tuning. 12e simply set reward weights to 1.0, except in §4.2, where we changed the entailment weight to 0.5, log-likelihood and repetition penalty weight to 5.0.
A.2.1 Setup Details: §4.1 We study using the SNLI dataset (Bowman et al., 2015), a dataset commonly used in training an entailment classifier.The original dataset contains (premise, hypothesis) sentence pairs, where the hypothesis may or may not entail the premise.We sub-sampled 50, 000 training examples from the corpus such that the hypotheses have an average entailment probability of only 50% in terms of the premises, and over 2/5 examples have entailment probabilities less than 20%, which can be seen as negative (contradictive) examples.The resulting training set poses a significant challenge for the models to learn from the noises.The RL algorithms (including PG and ours) permit us to plug in arbitrary reward functions to drive learning.Based on the goal of the task, we use the following intuitive rewards to ensure entailment accuracy and language quality: (1) a robust entailment classifier (Nie et al., 2020) that measures the entailment score of a generation in terms of the input premise, (2) a GPT-2 language model (Radford et al., 2019) that measures the log-likelihood of the generation as an indicator of language quality, and (3) BLEU score w.r.t the input premises as another language quality reward that avoids trivial outputs.
We sum together all rewards with weights 1.0.

A.2.2 Setup Details: §4.2
We study the task of attacking an entailment classifier.In particular, we aim to attack one of the most popular entailment classifiers on Hugging-FaceHub. 13The attack generation model gener- ates adversarial text without conditioning on any inputs so that the generated attacks are universal to all premises.The generation model is trained with mostly the same setting as in §4.1, where the entailment classifier to be attacked is used as entailment score reward functions.Besides, we additionally include a token-level repetition penalty reward, which empirically benefits readability.Finally, we use the MultiNLI dataset (Williams et al., 2018) which includes more diverse examples than the SNLI used above.We compare our SQL with MLE+PG.We use all hypotheses in the MultiNLI dataset as the training data for the MLE training in MLE+PG and the offpolicy updates for our SQL.We do not compare with previous specialized adversarial text attack methods, because they either are not applicable to the universal attack setting (Morris et al., 2020;Jin et al., 2020;Ebrahimi et al., 2017), or were not designed to generate human-readable sentences (Wallace et al., 2019).Besides, it is worth noting that the general RL algorithms have an additional advantage of doing black-box attacks.That is, the algorithms only require the ability to query the entailment classifier for entailment probability, without need of knowing the internal structure of the classifier (e.g., for computing gradients) as in previous attack algorithms (Ebrahimi et al., 2017;Wallace et al., 2019).
For top-p sampling results, we sample a hypothesis for each premise and measure the average attack rate across the dataset.This is because sampling multiple hypotheses, each for all premises, and measure performance are expensive.Since the hypotheses are sampled input-independently, this should be a good approximation.
of 20 input sentences (Figure 5) for the pretrained LM to generate continuation sentences.There is no direct supervision data available for training the prompt generator.We randomly create some noisy text as the training data for MLE baselines below and for off-policy updates for our algorithm.Specifically, the noisy text is created by sampling keywords and topics from the list used in (Dathathri et al., 2020) and a paraphrase generation model.
Figure 5 shows the architecture of prompt-based controllable generation.We compare our SQL method with MLE+PG as before.At training time, for each generated prompt sample, the pretrained LM generates 2 continuation sentences for evaluating average reward.We use a zero-shot classifier to evaluate the topic accuracy of the continuation sentences.That is, we do not assume access to classifiers pretrained on topic-specific sentences, because generating such topic-specific sentences is the goal of the task in the first place.We additionally use an LM to evaluate the loglikelihood of continuation sentences for measuring language quality.Since the prompt length could impact the generated sentences, we conducted experiments with maximum prompt length 5, 10, and 15.As ablation study, we also evaluate the SQL algorithm with only off-policy updates (i.e., without on-policy exploration), denoted as SQL(off), and compare it with vanilla MLE training.At test time, given a topic, the trained prompt generator produces one prompt using beam search decoding.For each generated prompt, the pretrained LM generates 100 sentences using top-k decoding (with k = 50) for evaluation.Finally, we also compare with two specialized controllable generation techniques based on pretrained LMs, namely PPLM (Dathathri et al., 2019) and GeDi (Krause et al., 2020), following similar procedures using their open-sourced code.We use a distilled GPT-2 model14 as the pretrained LM to be controlled.We use the paraphrase generation model based on Zhang et al. (2019). 15 legal: legal space religion and space In summary, a good understanding of these concepts is that by giving an explicit understanding to a person, they provide an avenue to be studied and studied.But the concept of one person being a space is also very confusing, and can be very difficult to obtain.\nSo, politics: the primary referendum is In summary, the outcome will be a referendum on the EU membership for the first time of its kind for EU citizens, and the full extent of the benefits of a single market and a flexible single EU state."computers: macintoshintoshintoshintosh In summary, it appears that the company and IBM products are currently in need of upgrading the computer.This can be seen in a detailed review of the Macintosh version of Windows Vista and XP.However, when looking at the changes made by the HP Macintosh hardware and software versions of space: legal space science and space In summary:\n\n The purpose of this paper is to investigate and test the theory of space space and other objects.This project will support NASA.s efforts to demonstrate these theories, and to propose other relevant new theories.\nThis paper provides the following arguments for the religion: space legal religion religion religion In summary, to the author the current discussion is the position of the Church and the community.While we acknowledge that we should not be commenting upon claims such as our recent cases or the other ones that contradict our view, we conclude it is appropriate to include these cases.Further science: the chemical microscope is In summary, the most sophisticated of these experiments is a technique that gives no obvious, no apparent way of revealing that the material was obtained.In this study, we examine how the compounds in the samples in question make up the composition of the chemical and its properties.The chemical composition military: arms defense battalion battalion cavalry In summary: 6th Panzer Field Division, Second Division.\n\nThe main task of the battalion in the main counterinsurgency campaign was to counter the enemy in any counter-incursion.The main objective of this campaign is to eliminate enemy groups and the remnants of legal: legal space religion and space This essay discusses the idea of space and time as a space, in both theoretical and conceptual terms, as not an individual time period or anything else.The emphasis is on time itself, rather than having a fixed central space.Space was the object of the first chapter, and politics: the primary referendum is This essay discusses the nature of the EU referendum.The purpose of this essay is to shed light on the importance of a public referendum, on a question of whether the decision of an EU member states to remain in the European Union is constitutional and thus in accord with constitutional guarantees of sovereignty computers: macintoshintoshintoshintosh This essay discusses hardware devices and software systems for Mac OS X, MacOS X and Linux.To view the latest version of Macintosh OS: Mac 8.7.x\n\nFor more information or for information about Macintosh systems, visit Mac MacSystems.\nMore space: legal space science and space This essay discusses science for teens, adults and teenagers.\n\nWhen the idea of studying space was first implemented as a method to test, the question was: What if a student has been "comfortable" with space without its body?What would their body like to be religion: space legal religion religion religion This essay discusses an alternative religion that focuses on the role of a particular religion and views some form of religious ethics as the form when the law is applied to that particular religious community .This discussion is concerned with the status of faith for individuals or groups which may be members and members science: the chemical microscope is This essay discusses the mechanisms of reaction with a focus on the molecular structure of nucleite and of enzymes within the cytoskeleton, thus making it easier to understand the process of metabolism and other elements of cellular life.In this essay, we use techniques such as the photochemical transfer military: arms defense battalion battalion cavalry This essay discusses three main themes:\n\n 1) Lack of uniformed soldiers is an unacceptable and unconscionable strategy for the Army.\n 2) Poor and inadequate training does not compensate the soldiers, and may deprive them of the necessary and competitive training from their instructors

Figure 1 :
Figure 1: Left: An overview of the proposed SQL algorithm.Text generation is challenging due to sparse reward (i.e., the rewards of all intermediate steps are 0) and large action space (i.e., large vocabulary).Our SQL formulation enables several key algorithmic features as highlighted with yellow color, including (1) the combined on-and off-policy updates for the best of both, (2) bridging the final non-zero reward to directly supervise the Q-value estimation at intermediate steps for learning stability, and (3) simultaneously updating the Q-values of all candidate actions for efficiency.Right: We explore diverse applications of the text-generation RL algorithm.

Figure 2 :
Figure2: Soft Q-Learning with path consistency learning (PCL) objectives.Left: Single-step objective (Eq.7),where for each (s t , a t ), the computation involves step t and t + 1. Dashed boxes in dark green and gray indicate the regression target, where the intermediate reward r t is often 0 due to sparsity.The gradient is applied to parameters θ at step t (indicated by orange color).Right: Multi-step objective (Eq.9) which aggregates from step t all the way to T .In this way, the final-step non-zero reward r T is used as the regression target.

Figure 3 :
Figure 3: Left: entailment generation performance plotted against diversity (average of H 1 and H 2 ).Circles represent results of top-p sample outputs, and triangles represent results of beam-search outputs.Please seeTable A.3 for additional results.Right: entailment attack performance against diversity.Only a few MLE+PG dots are visible because the model is not able to generate more diverse samples even with increasing p value in top-p decoding, i.e., the model collapses.

Figure 5 :
Figure 5: The scheme of prompt generation for controlling the outputs of pretraind LMs.

Figure A. 1 :
Figure A.1: Entailment generation performance plotted against diversity (average of H 1 and H 2 ).

Figure A. 2 :
Figure A.2: Training curves on validation sets.Left: Training curves on E2E with best hyperparameter configurations.Middle: Training curves on E2E with varying reward scale.Right: Training curves on CommonGen with varying reward scale.
Algorithm 1 Efficient Soft Q-Learning for Text Generation Input: Q θ (i.e., generation model logit function f θ in Eq.1)

Table A .
2 shows samples.We can see that SQL outperforms MLE+PG consistently across different diversity values.The outputs from MLE+PG are not diverse even with high p's, indicating the model collapses and can only generate a small set of unique adversarial examples.The model by SQL discovers the pattern "saint-pierre-et-saint-paul" (an entity name), and exploits this to generate samples with high universal entailment rate.Figure 4: Average topic accuracy.Please see Table A.4 for more details.

Table 1 :
Average perplexity across topics.The lower, the more fluent the generated continuation sentences.

Table 2 :
Average sentence generation time cost.
Table A.1: BLEU results on the E2E val/test sets.
Table A.3: Beam search results on entailment generation, in the format val/test.↑/↓ indicates higher/lower is better.† SQL (single) achieves zero in H 1 /H 2 as it generates a single token.Table A.4: Prompt generation results.Note that some of the numbers from GeDi are low because the topics are tokenized into two subword tokens, which the model was not trained with.

Table A .
6: Prompt samples from SQL.