Implicit Unlikelihood Training: Improving Neural Text Generation with Reinforcement Learning

Likelihood training and maximization-based decoding result in dull and repetitive generated texts even when using powerful language models (Holtzman et al., 2019). Adding a loss function for regularization was shown to improve text generation output by helping avoid unwanted properties, such as contradiction or repetition (Li at al., 2020). In this work, we propose fine-tuning a language model by using policy gradient reinforcement learning, directly optimizing for better generation. We apply this approach to minimizing repetition in generated text, and show that, when combined with unlikelihood training (Welleck et al., 2020), our method further reduces repetition without impacting the language model quality. We also evaluate other methods for improving generation at training and decoding time, and compare them using various metrics aimed at control for better text generation output.


Introduction
Language models have become a subject of close attention in the Natural Language Processing field over the past few years. They are widely used not only for unsupervised pre-training, but also for text generation, such as what is implemented in dialogue systems (Roller et al., 2020). While there are ongoing efforts to develop non-autoregressive models for language modeling, most current state-of-the-art approaches use the autoregressive method of generating text (i.e., word by word). Holtzman et al. (2019) showed that even powerful trained models with a high likelihood value for test data can output repetitive results. Schmidt (2019) argues that the reason for that is train-test discrepancy and lack of generalization when running * Work done while at VK. standard maximum likelihood estimation (MLE) training.
Unwanted repetition can be remedied at decoding and training time. Decoding methods focus on sampling techniques that generate less repetitive or incoherent samples, while other methods aim to improve model training to minimize the effects of degeneration. An effective method for reducing language model degeneration is unlikelihood training (Welleck et al., 2020), where a regularization term forces the model to reduce the probability of generating a token that has already occurred in a sequence. Li et al. (2020) further explored this idea and showed that adding a loss function for regularization to avoid undesirable sequences improves text generation not only by reducing repetition, but also by decreasing contradiction. Roller et al. (2020) reported that adding unlikelihood training also improves the humanness of generated text.
In this paper, we propose Implicit Unlikelihood Training, a method for regularizing output by fine-tuning a language model with policy gradient reinforcement learning to improve generation results. We apply this method for a repetition objective, and show that combining Implicit Unlikelihood Training with minimizing unlikelihood loss results in reduced repetition and perplexity. We also evaluate alternative approaches to improving generated texts in terms of repetitiveness, and compare these methods using a wide variety of metrics.
2 Related Work 2.1 Decoding Strategies Holtzman et al. (2019) observed that maximization-based decoding methods, such as top-k sampling (Fan et al., 2018), beam search and its variations, can all lead to degeneration. They addressed this problem by using top-p (nucleus) sampling, proposing sampling from the top portion of the probability mass. Paulus et al. (2017) reported that ground-truth sentences for summarization tasks almost never contain the same trigram twice, and proposed the beamblocking approach, where the decoder is forced to never output the same trigram more than once during testing. Penalized sampling (Keskar et al., 2019) works by discounting the scores of previously generated tokens. Martins et al. (2020) proposed preventing unlikely words from receiving any probability mass by using entmax sampling. Jiang et al. (2020) suggested that some tokens can be more difficult for a model to learn than others. These tokens are still under-learned after training, making their repetition more likely to happen. This issue is addressed by token loss dynamic reweighting (TLDR), which applies differentiable weights to individual token losses. Repetition can also be improved at training time by adding unlikelihood loss (Welleck et al., 2020;Li et al., 2020) to regular likelihood loss. Unlikelihood training is aimed at decreasing the probability of previously generated tokens, and it was shown that it can outperform beam blocking and top-p sampling.

Training Strategies
Coverage mechanisms (Tu et al., 2016;See et al., 2017) can also be used to reduce repetition. Adding pre-attention and highway connections was shown to decrease repetition for RNNs (Jiang et al., 2020), while the architecture tweaks required for Transformers (Vaswani et al., 2017) are still an open question.

Unlikelihood Training
Unlikelihood Training involves adding Unlikelihood Loss to lower the probability p θ (c i |x <t ) of negative candidates C t = {c 1 , c 2 , ..., c n } at each timestamp: (1) We can construct the negative candidate set as C t = {x t−1 , x t 2 , ..., x 1 }\{x t } to improve generation results through reducing repetition. Welleck et al. (2020) also proposed using Sequence-Level Unlikelihood Loss on sampled continuations, where a sequence (x t+1 , x t+2 , ..., x t+N ) from a prefix (x 1 , x 2 , ..., x t ) is sampled first, and then the loss defined in Eq. 1 for each is a part of a repeating n-gram at a position before t+i is minimized (see Algorithm 3 for details on Sequence-Level Unlikelihood Loss). Fine-tuning a language model is then performed by equally alternating between sequence-level unlikelihood and likelihood updates.
Perplexity is the metric used to evaluate language model quality. It is defined as ppl(x) = p(x 1 , x 2 , ..., x t ) − 1 t , where x 1 , x 2 , ..., x t is the sequence of tokens from test data. The lower the perplexity, the better the language model. Welleck et al. (2020) used a portion of duplicate n-grams in a generated sequence to measure sequence repetition: Higher repetition values mean that a language model tends to produce more repetitive output, which might appear less natural. Note that 0 ≤ seq rep n(x) < 1. Same as (Welleck et al., 2020), we controlled for the number of unique next-token predictions (uniq), as it was shown that generated texts are less diverse than those written by a human. We also used the number of unique tokens in continuations of validation or test prefixes (uniq-seq) as a measure of token distribution in generated text. rep/l is the fraction of next-token (top-1) predictions that occur in the previous l tokens. wrep/l is a variant of rep/l which only counts single token repetitions that are not equal to the ground truth next-token. We use 16,32,128,512 as l and average the results to compute rep and wrep. Martins et al. (2020) introduced -perplexity for computing perplexity of sparse distributions. The perplexity is smoothed by adding a small value of to all terms, followed by renormalization. They also introduced sparsemax score (sp) and Jensen-Shannon divergence (JSD) for evaluating quality and sparsity of probability distributions. For deterministic models, sparsemax score becomes word accuracy, and is bounded by 0 and 1. With JSD, the distance between the sparse or truncated distribution and the one-hot encoded ground truth distribution can be measured. It is used as a metric for language models using different decoding strategies. Unlike perplexity, JSD is bounded by log 2. Jiang et al. (2020) evaluate methods with a diversity metric based on n-grams (DIMEN). A high DIMEN score means that a set of generated sequences is diverse.
In this paper, we mainly focused on reducing sequence repetition (seq rep n) (Welleck et al., 2020), which is a portion of duplicate n-grams in a generated sequence. Improving generation results by minimizing repetition should not significantly affect the perplexity of the language model.

Implicit Unlikelihood Training
Li et al. (2020) showed that Unlikelihood Training can be employed as a general framework for reducing the likelihood of undesirable text generation results through training on negative examples. However, we argue that, in some cases, it could be difficult to construct negative samples for specific types of Unlikelihood Loss 1 . To address this issue, we propose extending Unlikelihood Training with policy gradient reinforcement learning, which does not need explicitly created negative samples.
We chose to test this approach for repetition as the most widely considered property of neural text degeneration. To directly minimize repetition (see Equation 2) for sequence x, we define the reward as R = 1 − seq rep n(x) with n = 4. We alternated between maximizing the reward R, optimizing the likelihood of training data, and Sequence-Level Unlikelihood Loss (see 1, 2 and 3 for details on the process of Implicit Unlikelihood Training and policy gradient update).
Algorithm 1: i-UT: alternating between MLE, UT and PG updates Input: update rate r, total number of updates 4 Experiment Details

Setup
We fine-tuned small and medium GPT-2 models (175M and 345M parameters, respectively) (  Our experiments consisted of alternating between three types of updates: Maximizing Likelihood (MLE), minimizing Sequence-Level Unlikelihood Loss (UL), and minimizing repetition with policy gradient reinforcement learning (PG). The first approach is a plain MLE update, for which we do not use any specific methods for reducing repetition in samples. We also experimented with Unlikelihood Training (UT), which involves alternating between MLE and UL updates (see Algo- repeat-n rithm 1 for details).
In policy gradient experiments, we trained models in three different scenarios: a plain PG for which we alternated MLE and PG updates; a combined PG + UT approach, where we alternated between maximizing the likelihood and minimizing the sum of policy gradient and unlikelihood losses; and finally, the proposed Implicit Unlikelihood Training (i-UT), which consisted of alternating between MLE, UL and PG updates (see Algorithms 1, 3, and 2). We used 0.25 and 0.5 alternating update rates for the small GPT-2 model, and r equal to 0.5 for the medium GPT-2 model.
Full optimization details are provided in Appendix A.1. Optimization Details.

Evaluation
We used top-k and top-p samplings with different k and p to evaluate sequence repetition 2 (seq rep 4) for the described approaches. For these experiments, we used validation data to evaluate perplexity and to generate the sampling prefixes for evaluating uniq and seq rep 4 metrics. The number of unique tokens (uniq) was evaluated using greedy sampling. We also evaluated the proposed method with rep, wrep, and JSD metrics using different sampling methods on test data, and compared it with other related approaches (MLE, UT, entmax). We repeated each experiment 5 times and reported the mean and standard deviation values of the measurements.
More experiments and their results are described in Appendix A.2. Experiments.

Results
We showed that Implicit Unlikelihood Training is a competitive approach that outperforms other methods in sequence repetition when fine-tuning small and medium GPT-2 models (see Table 1) on most variants of top-k and top-p sampling, while maintaining the lowest perplexity and the highest count of unique tokens generated. This approach also achieved better results than training with entmax loss and other related approaches, using a different range of sampling methods (see Table  2), with the only exception being the rep metric, where entmax performed similar to i-UT.
Samples of generated outputs are provided in Tables 8, 9 in Appendix.

Future Work
The described and evaluated reinforcement learning framework makes it possible to optimize text generation for any objective. In future work, we intend to test the approach not only for repetition, but also for various other metrics, such as the toxicity level or bias of generated text.

A.1 Optimization Details
For likelihood update, we evaluated the likelihood of a sequence of tokens with lengths equal to 300. For both UL and PG updates, we formed prefixes using a sequence of 300 tokens to form 6 sequences with lengths equal to 50. We then used these prefixes to sample sequences with a maximum length of 100 tokens. For optimization, we used the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 6.25 × 10 −5 . Similar to (Welleck et al., 2020) and(Martins et al., 2020), we did no warmup steps for UT and α−entmax training. For i-UT, we did 500 linear warm-up steps . After warm-up steps, we linearly decayed the learning rate to zero.
In all our experiments, we fine-tuned language models for 5000 total updates.
Once training was complete, we selected a checkpoint with the least validation perplexity obtained during training. This is the last checkpoint in most of our experiments, which means that general log-likelihood loss converges.
As shown in Algorithm 1, we equally alternated between UL and PG updates. We also found that reducing unlikelihood update rate to 0.25 may also be effective, taking twice less time (see Table 1). The parameters for − ppl and α for α − entmax training were taken from (Martins et al., 2020) (except for α − entmax, which we set to 1 × 10 −6 ).
We conducted coefficient search on our policy gradient loss with c = {3, 9, 15, 30} for the small GPT-2 model, and c = {3, 15, 30} for medium GPT-2 model. We chose the best models based on the results on the validation set, and also reported the metrics on the test set.

A.2 Experiments
We evaluated DIMEN and uniq-seq for UT and i-UT methods, applied to small and medium GPT-2 models using different sampling methods for DI-MEN, and greedy sampling for evaluation of uniqseq. In this experiment, we observed that Implicit Unlikelihood Training performed better or equal to Unlikelihood Training with different sampling methods measured by the DIMEN metric, having a significantly better value of uniq-seq (see Table  7).
We also evaluated sequence repetition with beam-search sampling for MLE, UT, and i-UT methods for both small and medium GPT-2 models, using validation data to form sampling prefixes. When sampling with beam search, we found that Implicit Unlikelihood Training produced better results than Unlikelihood Training (see Table  3).
For greedy sampling with small GPT-2 model, we evaluated sequence repetition, wrep, uniq, and perplexity. We used test data to evaluate the perplexity, and to form sampling prefixes for other methods. We observed that MLE, UT, and i-UT methods had similar performance in terms of repetition using greedy sampling, while i-UT still had the best number of unique tokens (see Table 4).
Finally, we evaluated the TLDR method using both sequence repetition and DIMEN metrics (see Tables 5, 6). In our experimental setup, TLDR performed on par with MLE approach.

B Negative Results
Our results showed that all sampling methods, other than greedy sampling, led to worse convergence of the seq rep 4 metric.
We experimented with using the Proximal Policy Optimization algorithm (Schulman et al., 2017) for PG update, but faced unstable validation perplexity behavior during training, and did not obtain any comparable results.
Another unsuccessful direction of our experiments was substituting the estimation of the reward calculated on the full sequence with the reward put on each token separately. We tried using two variants of the binary reward function: does the current n-gram appear first time in the text, and does the current n-gram appear in the following part of the text. We experimented with advantage estimation by using a value function estimator, and without it by using pure rewards. In the former case, we adjusted different values of λ and γ for the Generalized Advantage Estimation algorithm (Schulman et al., 2015), and in the latter, we used a general discounted future reward. We observed that the approach of estimating a single reward for a whole sequence and subtracting a baseline value to reduce the variance of the gradient estimation performed best.       i-UT October 1914 as a radio operator and served with distinction until his discharge from the RAAF in December. In February 1915, Headlam was appointed to the role of President of the RAC, and a policy advisory officer to the PM. He was promoted to the rank of Chief of the Air Staff, and underlined that he was a major and not to have the responsibility of determining the combatant classes ( like the Royal Australian Flying Corps ). When the RAC was created in .0 Table 9: Generation Samples