RewardsOfSum: Exploring Reinforcement Learning Rewards for Summarisation

To date, most abstractive summarisation models have relied on variants of the negative log-likelihood (NLL) as their training objective. In some cases, reinforcement learning has been added to train the models with an objective that is closer to their evaluation measures (e.g. ROUGE). However, the reward function to be used within the reinforcement learning approach can play a key role for performance and is still partially unexplored. For this reason, in this paper, we propose two reward functions for the task of abstractive summarisation: the first function, referred to as RwB-Hinge, dynamically selects the samples for the gradient update. The second function, nicknamed RISK, leverages a small pool of strong candidates to inform the reward. In the experiments, we probe the proposed approach by fine-tuning an NLL pre trained model over nine summarisation datasets of diverse size and nature. The experimental results show a consistent improvement over the negative log-likelihood baselines.


Introduction
The current state-of-the-art neural text summarisation models have been refined to excel at either the extractive or abstractive styles, or even both (Zhang et al., 2020a;Lewis et al., 2020;Raffel et al., 2020). Along with contemporary summarisation datasets (Narayan et al., 2018a;Grusky et al., 2018;Fabbri et al., 2019), the advent of large pre-trained language models, and their subsequent derivations (Liu and Lapata, 2019;Park, 2020), has allowed summarisation to become a more practical and reasonable task to implement, without compromising, and often improving, the accuracy. However, these models usually employ the standard negative loglikelihood (NLL) as their training objective, which aims to maximise the likelihood of each token in a given ground-truth reference. Despite its efficacy, the NLL fails to account for synonymous tokens and other potentially valid variations, and strongly biases the model towards the ground-truth reference (Ranzato et al., 2016). Furthermore, the NLL operates as a token-level objective during training, which promotes an inconsistent comparison with sequence-level evaluation metrics, such as ROUGE (Lin, 2004).
In order to address the inconsistency between token-level training and sequence-level evaluation, reinforcement learning (RL) has been adopted in summarisation and other language generation tasks to afford the optimization of sequence-level metrics during training (Paulus et al., 2018;Pasunuru and Bansal, 2018). Reinforcement learning has proved successful at improving the accuracy of language generation tasks, such as summarisation (Paulus et al., 2018;Arumae and Liu, 2018;Pasunuru and Bansal, 2018) and machine translation (Ranzato et al., 2016;Edunov et al., 2018). However, balancing exploration and exploitation remains imperative to the successful choice of an effective reward. When standard RL techniques, such as REINFORCE (Williams, 1992), are implemented in natural language generation tasks, the required expectation becomes intractable due to large vocabulary sizes. Therefore, the application of REINFORCE is typically reduced to calculating the approximate expectation with respect to only a single predicted sequence. To teach the model to understand the importance of sample variation among synonymous tokens, we instead choose to implement an objective function which includes multiple predicted sequences, allowing for a scenario in which several valid candidate summaries can be considered. Another consideration is that the success of techniques such as REINFORCE strongly depends on the use of an effective and appropriate reward. Designing such a reward, one which enables the model to manipulate multiple sequences and yet provides a positive and informative outcome in the process, is therefore necessary for producing better results. This allows us to modify the reinforcement learning framework in such a way that enforces only a higher weighting to those predicted sequences which obtain a higher reward. As such, we apply two techniques to summarisation; RwB-Hinge, which applies a hinge-loss modification to the classical REINFORCE with baseline (Rennie et al., 2017) to selectively apply the model gradients, and Expected Risk Minimization (RISK) (Edunov et al., 2018), which leverages a small pool of strong sampled candidates to smartly inform the reward function. We aptly refer to our framework as RewardsOfSum, to hint at the exploration of suitable reward functions for summarisation. Empirically, we show that the two proposed variants perform better than standard negative log-likelihood baselines over a range of datasets of diverse size and nature.

Related Work
In recent years, there has been some work in summarisation to separate from the traditional negative log-likelihood (NLL) objective function, and mollify its dependency on ground-truth references. Several implementations of reinforcement learning in summarisation involved optimizing discrete metrics, such as the standard ROUGE (Paulus et al., 2018;Narayan et al., 2018b). Others have introduced novel rewards into the reinforcement learning framework, such as question-focused rewards (Arumae and Liu, 2018), saliency and entailment rewards (Pasunuru and Bansal, 2018), and even distributional semantic rewards . Gao et al. (2020) also present a novel unsupervised metric for summarisation which correlates highly with discrete evaluation metrics if adopted in a reinforcement learning approach.
On the other hand, there has been much work in leveraging large, pre-trained language models (LM) (Devlin et al., 2019;Lewis et al., 2020;Raffel et al., 2020) to improve the quality and performance of summarisation models. Utilizing pretrained language models requires significantly less engineering effort to continually improve over stateof-the-art baselines. Typically, these approaches include using novel pre-training objectives (Zhang et al., 2020a;Raffel et al., 2020;Zhu et al., 2020) or implementing successful reinforcement learning techniques (Bae et al., 2019).  found that optimizing semantic rewards in reinforcement learning, using BERTScore (Zhang et al., 2020b), does not necessarily correlate with the ROUGE score at test time. As such, the choice of reward in a reinforcement learning approach should attempt to carefully align with the evaluation metric.
How best to inform the reward via the reward function, is critical to the performance of models in an RL framework. In our work, we aim to stray from the typical sole NLL objective, and by leveraging a pre-trained language model in a reinforcement learning framework, explore different RL-based reward functions for summarisation.

Proposed Reinforcement Learning Training
In order to improve over the negative log-likelihood baseline models, we aim to implement a reinforcement learning framework that adopts the standard evaluation metric, ROUGE, as a reward during training. We aim to keep consistent with previous implementations of reinforcement learning in summarisation, and assume ROUGE-L F1 to be the reward metric in the following work. In Sections 3.1 and 3.2, we consider the following standard notations: x is defined as an input source document, y * ,ŷ, and y s are referred to as the ground-truth reference, argmax prediction, and sampled sequence, respectively, and r(y) refers to the reward of sequence y, computed with respect to the ground-truth reference, y * . By exploiting a combination of sampling and predictions, we aim to enhance training diversity in the vein of the work of ; Holtzman et al. (2020).

RwB-Hinge
We adopt the standard self-critical policy gradient objective (Rennie et al., 2017), notably applied to summarisation by Paulus et al. (2018): (1) log p(y s t |y 1 , . . . , y t−1 , x) (2) In (1), y s andŷ denote a sampled sequence and the argmax prediction of the current model, respectively. The reward of the argmax, r(ŷ), is used as a "baseline" for the reward of the sample, r(y s ). It is easy to see that if r(y s ) − r(ŷ) > 0, the sign of this loss is negative, treating y s as a "good" prediction and leading to an increase of its probability. Conversely, if the sign is positive, y s is deemed as a "bad" prediction and its probability is decreased.
However, in abstractive summarisation it is not trivial to discriminate between a good and a bad summary when the reward score is in an intermediate range. To avoid inappropriately penalising acceptable predictions, we propose incorporating a hinge loss in (1): The hinge loss allows the model to limit the gradient updates to only the predictions that are considered as good. In this way, we avoid the risk of unstable training updates and hope to afford a clearer trajectory towards a well-trained model.

Expected RISK Minimization
We also utilise a classical structured loss function that has been shown to perform well in sequenceto-sequence learning tasks (Edunov et al., 2018): In (4), y represents one of multiple candidate summaries, sampled or predicted with the methods defined in Section 4.2 (e.g. argmax, Gumbel-Softmax (Jang et al., 2017)), that form the total candidate summary set U (x). The conditional probability of the predicted summary is noted as p(y|x, θ).
This conditional probability is defined in (5), where m is the number of tokens in the summary. The sum of logarithms in (6) is divided by the total number of tokens in the sequence, and is scaled back using an exponential function, allowing each candidate summary to be compared fairly in the objective function and avoiding underflow.
By using this objective function, the model is taught to assign higher probability to the candidate summaries that obtain higher rewards. This objective does not require a baseline or hinge loss to select the predictions, since using multiple candidates already exposes the model to different, potentially valid predictions. Edunov et al. (2018) demonstrates the effectiveness of this approach at sentence level for both neural machine translation and summarisation. For the summarisation task, Edunov et al. (2018) compute the reward at sentence-level since their dataset has single-sentence references. However, as the reward function is agnostic to single or multi-sentence predictions, we can easily translate the RISK objective function to be used at summary level.

Overall Training Objective
Similar to previous reinforcement learning implementations (Paulus et al., 2018;, we, too, utilise a mixed learning objective function, as shown in (8). This mixed approach helps the model to not deviate too much from the reference summaries, given a γ balancing coefficient chosen with a strict validation criterion (Appendix A). The L RL term refers to either the RwB-Hinge or RISK training objective function.
4 Experimental Setup

Datasets
Inspired by the recent work from Zhang et al. (2020a), we utilise nine of the summarisation datasets reported in their paper. The nine datasets have been chosen based on the different lengths of their reference summaries, to provide enough of a variation to demonstrate the applicability of the presented methods. We split the datasets into three classes: "short", "medium", and "long". Short datasets have reference summaries ≤ 64 tokens, medium datasets > 64 and ≤ 128 tokens, and long datasets > 128 tokens.

Sampling Methods
In order to promote exploration across the vocabulary distribution, we employ three simple methodologies to provide candidate sequences for our training objectives. Argmax: As is the standard with the majority of sequence generation tasks, a predicted sentence can be easily provided by allowing the model to make hard decisions (e.g. argmax) over the probability distribution generated by the decoder. This allows us to use it as a baseline for the following experiments. In its simplest form the argmax is defined as: where we use "teacher forcing" for the predictions. 2nd-Best: Similar to the argmax, we employ a k-best approach to sample the second best-argmax from the same probability distribution generated by the decoder. This allows us to choose different, yet similarly weighted words from the decoder to introduce variability between produced summaries: y s j = argmax k=2 p(y|x, y * j−1 , θ) j = 1, . . . , n (10) Gumbel-Softmax: We also utilise a recent reparameterization technique known as the Gumbel-Softmax (Jang et al., 2017) that allows sampling soft latent categorical variables by transforming samples from a Gumbel distribution. Compared to the standard "hard" predictions, this approach is differentiable and allows controlling the sparsity of the samples by a temperature parameter, τ : In (11), g i is a sample from the zero-mean, unitscale Gumbel distribution, p i j is the probability dis-tribution for a given token i at slot j, and the temperature parameter, τ , controls the sparsity of the output soft variable,p i j . In our experiments, we have set τ to 0.1 to enforce sparsity.

Baseline Model and Training Runs
The abstractive text summarisation model we use for our experiments is PEGASUS, a large pretrained Transformer encoder-decoder architecture that has recently reported state-of-the-art results over a number of datasets. Please refer to Zhang et al. (2020a) for details. All hyperparameters used in our experiments can be found in Appendix B.
We employ two training approaches to test the solidity of the proposed methods. The first is a fewshot learning approach that adopts limited, fixed numbers of training samples (1000) and training iterations (2000) for fine-tuning the model. The second is a full-data learning approach, that utilises all available training data, and exhausts the objective function until convergence over the validation set. In all experiments, we first fine-tune a pretrained PEGASUS model with the NLL, and then we further fine-tune the NLL model with one of the proposed approaches. We train the model in this way to avoid the slow and inefficient training often associated with policy gradient objectives, and as a result, adhere to the standard warm-start NLL training adopted in previous reinforcement learning-based approaches (Paulus et al., 2018;.
In the following experiments, we refer to PEGA-SUS as PEG, and its NLL-tuned models with the suffixes -few_shot and -full_data. The proposed approaches are in turn noted as RwB-Hinge and RISK.
Experiment Arg-max 2nd-Best G-S RwB-Hinge RISK-2 RISK-3    few-shot (top halves) and full-data results (bottom halves), where the scores have been averaged over three independently-initialised training runs. Each fine-tuning method is employed in a mixed loss framework, as mentioned in (8) in Section 3.3; the value for the γ hyperparameter has been de-termined over the validation set as described in Appendix A. The results show that all the fine-tuning methods have surpassed the NLL baselines for almost all datasets. Several of these improvements have also passed a bootstrap test for statistical significance, which is regarded as a more appropriate Figure 1: Comparing the uni-, bi-, and tri-gram novelty for the medium sized datasets. These datasets contain generated sequences up to 128 tokens in length. The methods are as follows: NLL (baseline), RwB-Hinge, RISK-2, and RISK-3. The unique average n-gram novelty (n-grams that do not appear in the source text) is shown to increase across the board compared to the standard NLL baseline.
statistical test for summarisation compared to a t-test (Dror et al., 2018). Figure 1 compares the effect that each finetuning method has had over the production of novel n-grams during test time (a property nicknamed as n-gram novelty). For medium sized datasets in particular, the reinforcement learning approaches appear to, on average, facilitate the production of more distinct uni-, bi-, and tri-grams at test time, compared to the NLL baseline. Whilst n-gram novelty is typically used in summarisation to showcase test-time summary abstractiveness, the results in Figure 1 highlight that training with objectives that promote sample variation leads to models capable of producing more novel n-grams (up to 13.8 pp in tri-gram novelty over CNN/DM). This is supported by the qualitative example in Table 6 which shows that the proposed fine-tuning methods can achieve greater diversity of summary predictions, whilst still improving over the baseline NLL ROUGE scores. It seems that the proposed fine-tuning methods have allowed the model to effectively weigh the predicted summaries during training, and when combined with the "stable" NLL in a mixed-loss approach, this has been able to produce well-rounded predictions, diverse enough to stray from the original baseline and the reference summaries.
In addition, Figure 2 shows a performance comparison with respect to the length of the reference summaries for the full-data approach over a medium size dataset (CNN/DM). We see that our fine-tuning methods have led, on average, to higher Figure 2: Comparison of each method for the full-data approach over a medium size dataset (CNN/DM). The methods are as follows: NLL (baseline), RwB-Hinge, RISK-2, and RISK-3. We see that the reinforcement learning approaches have led, on average, to higher ROUGE-L scores for the longer summaries compared to the NLL baseline. ROUGE-L scores for the longer summaries (up to 2.3 ROUGE-L points for summaries between 80-100 tokens, and up to 6.2 points for summaries over 100 tokens). Likely, the proposed methods have been able to amend the reported tendency of the NLL models to curtail the prediction of long summaries.
Comparing multiple fine-tuning methods is useful for showcasing the improvements that reinforcement learning can play on a generation task

Source Document
Dougie Freedman is on the verge of agreeing a new two-year deal to remain at Nottingham Forest. Freedman has stabilised Forest since he replaced cult hero Stuart Pearce and the club's owners are pleased with the job he has done at the City Ground. Dougie Freedman is set to sign a new deal at Nottingham Forest. Freedman has impressed at the City Ground since replacing Stuart Pearce in February. They made an audacious attempt on the play-off places when Freedman replaced Pearce but have tailed off in recent weeks. That has not prevented Forest's ownership making moves to secure Freedman on a contract for the next two seasons. Table 6: Example of the performance of each method from the CNN/DailyMail dataset for the full-data approach, compared to the reference summary and NLL baseline. Words highlighted in blue indicate that they are not present in the baseline NLL summary. Here we choose a typical method that aligns the best with the average NLL baseline score, and compare how the methods pit against it. We see that there is a relative increase in ROUGE scores, whilst diversifying the output.

Dataset
Approach RwB-Hinge RISK-2 RISK-3  Table 7: Scores on the validation set for short, medium, and long datasets to determine the best method for each size class. RISK, on average, appears to work best for short/medium sized datasets (up to 128 tokens), and RwB-Hinge works better for longer datasets (over 128 tokens).  Table 8: Comparisons between REINFORCE with baseline with and without the hinge-loss modification on the validation set for short, medium, and long datasets, to validate the use of the hinge-loss modification in our method. This is run over the full-data baselines, and shows that for the majority of dataset classes, the adopted hinge-loss modification leads to improvements in performance.
like summarisation. However, no single method has outperformed all others over all the datasets and in both the few-shot and full-data approaches. Whilst all methods have achieved interesting im-provements over the baseline figures, we have run a comparison over the validation set to see if their relative rankings could be a reliable indicator of the relative rankings of the test set scores reported in Tables 3, 4, and 5. Table 7 shows the results for one dataset per class size, showing that for the short and medium size datasets (≤ 128 tokens), either of the RISK methods could be chosen to fine-tune the model. This contrasts to the longer datasets where the hinge-loss modification has achieved the best results. In both cases, the results are in good agreement with those on the test sets. Lastly, in Table 8, we further validate our use of the hinge-loss adaptation to the classical RE-INFORCE with baseline method -a staple in the reinforcement learning literature of language generation tasks (Paulus et al., 2018). Over the same three datasets of Table 7, we see that in the majority of instances the hinge-loss modification has been distinctively better than the standard approach. This confirms our intuition that the adoption of a hinge loss to restrict the gradient updates to "good" predictions only is beneficial to the improvement of ROUGE scores.

Conclusion
In this paper, we have proposed two variants to the reinforcement learning approaches typically used in sequence-to-sequence learning tasks. The two proposed approaches -nicknamed RwB-Hinge and RISK -have been designed to improve the reinforcement learning rewards by selecting and diversifying the predictions used during the fine-tuning of the model. In a set of automated summarisation experiments over nine, diverse datasets, the approaches have consistently led to improved performance, and also diversified the generated summaries. We note that, despite its commonplace use for summarisation evaluation, utilizing ROUGE as reinforcement learning reward does not easily translate into improved performance. For this reason, in the near future we plan to explore other contemporary score functions, such as BERTScore (Zhang et al., 2020b), in an attempt to build more effective rewards.

A Validation Scores
To determine an appropriate γ term for our mixed loss implementation, we have run tests with different values over the validation set for each dataset. To determine the best value, we have utilised the standard REINFORCE (Williams, 1992) approach combined linearly with the negative log-likelihood. We have chosen to optimise REINFORCE here since, being a close relative, but not the same as the algorithms we have used during training, it may help to eschew overfitting. In the interest of time, we have utilised the validation scores of a single seed to determine the γ values. For the few-shot implementation in Table A.1, we have fixed the number of examples to fine-tune on (1,000) and the number of training iterations (2,000) exactly as in the standard baseline approach defined in Section 4. For the full-data approach in Table A.3, we have utilised all the training data, but, again in the interest of time, we have capped the number of training iterations to either: a) the same training time as the exhausted NLL tests reported in Table B  AESLC ArXiv Billsum CNN/DM Gigaword Newsroom Pubmed Reddit-TIFU XSum 0.9 0.7 0.9 0.9 0.9 0.7 0.7 0.9 0.9 Table A.3: Validation scores of the baseline PEGASUS model, fine-tuned on all training examples provided with the dataset for as many training iterations as either; the NLL baseline tests in Section 4, or 10,000 training iterations for longer datasets (ArXiv, Billsum, Pubmed). Best scores are highlighted. Table A.4: A summary of the corresponding gamma weights determined from the above full-data validation tests.

B Model Hyperparameters
In our experiments, we have utilised the same hyperparameters used in the original PEGASUS paper (Zhang et al., 2020a). The exception to this is our use of a smaller batch size, constrained by computational resources. As batch size we have used 1, which has resulted in a drop in performance compared to that of the original paper. However, our fine-tuning approach is ensured to converge through the use of a convergence criterion. This is defined by a validation run that evaluates the model every 1000 training iterations, and monitors the progression of the validation loss over the entire training run. A model is deemed 'converged' if its validation loss does not decrease over 3000 training iterations.