Using Question Answering Rewards to Improve Abstractive Summarization

Neural abstractive summarization models have drastically improved in the recent years. However, the summaries generated by these models generally suffer from issues such as: not cap-turing the critical facts in source documents, and containing facts that are inconsistent with the source documents. In this work, we present a general framework to train abstractive summarization models to alleviate such issues. We ﬁrst train a sequence-to-sequence model to summarize documents, and then further train this model in a Reinforcement Learning set-ting with question-answering based rewards. We evaluate the summaries generated by the this framework using multiple automatic measures and human judgements. The experimen-tal results show that the question-answering rewards can be used as a general framework to improve neural abstractive summarization. Particularly, the results from human evaluations show that the summaries generated by our approach are preferred over 30% of the time over the summaries generated by general abstractive summarization models.


Introduction
Although neural abstractive summarization has seen drastic improvements over the recent years (Nallapati et al., 2016;See et al., 2017;Paulus et al., 2018;Shi et al., 2021), these systems still have multiple drawbacks. One such common drawback is that the generated summaries frequently fail to capture critical facts in source documents (low recall) (Scialom et al., 2021). On the other hand, neural abstractive summarization models are known to generate content which are inconsistent with the source document (low precision). This is commonly known as hallucination (Kryscinski et al., 2020(Kryscinski et al., , 2019. Some studies (Cao et al., 2018) claim that nearly 30% of the outputs of common abstractive summarization models suffer from this problem.  Figure 1 shows a source document, the ground truth summary and few summaries generated by neural models. In the Generated Summary 1, the model fails to capture some of the crucial facts in the original dialog, such as the play is translated. In Generated Summary 2, although the model successfully identifies the fact that the play is a translated, it incorrectly mentions that both Charlie and Curtis are performing. Due to such common factuality related issues, neural abstractive summarization models are hardly usable in real-world applications (Scialom et al., 2021).
In this work, we propose a general framework to alleviate factuality related issues and improve the quality of the abstractive summarization by using question-answering(QA) based rewards. First, we train a sequence-to-sequence(seq2seq) summary generation model to take a document as the input and generate a summary as the output. Next, we improve the precision and recall of the summary generation model using a QA framework as follows. To improve the precision of the model, we first generate questions and corresponding answers for each generated summary. Next, we evaluate the answers that we get for the same questions from the ground truth summaries. If a generated summary contains factually incorrect information, this would lead to having different answers from the ground truth summary for some of the generated questions. We use the similarity of answers to calculate a reward to improve the precision. Similarly, to improve the recall of the summarization model, we generate questions and corresponding answers from the ground truth summaries and evaluate the answers we obtain for the same questions from the generated summaries. If the generated summary does not contain some key information as captured in the ground truth summary, then this would lead to obtaining different answers from the ground truth summary for some of the generated questions. We use the similarity of answers to calculate a reward to improve the recall. The calculated rewards were used in a Reinforcement Learning (RL) based framework to improve the summary generation model. In Figure 1 we show an example output from our approach, which does not contain the factuality related issues shown above. We evaluate the summaries generated by our approach using multiple automatic measures and human judgements, and show that the QA can be used as a general framework to improve abstractive summarization.
In summary, our key contributions are: (1) We introduce a Reinforcement Learning framework, which uses QA rewards to improve the recall and precision of abstractive summarization. (2) The framework is evaluated on three commonly used transformer based summarization models on two public datasets. (3) The evaluation of generated summaries on several automatic measures and human judgements show the effectiveness of our method. In particular, the human judges prefer summaries generated by our approach more than 30% of the time, over the summaries generated by general abstractive summarization models.

Related Work
There have been previous work on improving the factual consistency of abstractive summarization models. Cao et al. (2018) used an approach with two encoders, one to encode the source document, and another to encode the facts, and a decoder to attend to the outputs of the two encoders when generating the summary. Zhu et al. (2020) used OpenIE to extract facts and used them in the form of knowledge graphs to improve abstractive summarization. Arumae and Liu (2019) used facts obtained from question-answering rewards to improve extractive summarization. Huang et al. (2020) used multichoice cloze rewards, in addition to the knowledge graphs to improve the factual consistency.  incorporated entailment knowledge into abstractive summarization to improve factual correctness.
There have been several work proposed to evaluate the factuality of summarization algorithms, as more common n-gram based metrics, such as ROUGE (Lin, 2004), are known to perform poorly for this purpose. Most recent approaches proposed for evaluating the factuality are based on QA frameworks (Chen et al., 2018;Eyal et al., 2019;Deutsch et al., 2020;Durmus et al., 2020;Scialom et al., 2021). The evaluation metrics proposed by the the above studies measure to which extent a generated summary provides sufficient information to answer questions posed on its ground truth summary and whether the questions generated on the generated summary can be answered by the ground truth summary.

Improving Summarization with QA Rewards
In general, abstractive summarization models are trained to minimize the cross entropy loss of the reference summary at the word-level, which does not necessarily reward models for being factually accurate with high precision and recall (Maynez et al., 2020). Hence, to improve the factual accuracy of abstractive summarization, we propose a general framework which uses QA based rewards and RL based training. Our proposed framework is illustrated in Figure 2, and below we describe the critical components of the framework.

Summary Generator
Recent work have leveraged pre-trained Transformer (Vaswani et al., 2017) models for abstractive summarization (Lewis et al., 2019;Zhang et al., 2020). In this work, as the first step of summary generation, we train a transformer based seq2seq model (S), where the source document is fed as Figure 2: The training process for the summarization framework with QA rewards the input, and the model is trained to generate the summary token-by-token. The model is trained to optimize the cross entropy loss. During inference, we use top-p nucleus sampling (Holtzman et al., 2019) as the decoding mechanism, with p=0.95.

Question-Answer Generator
The QA Generator is utilized to generate questions and answers from the original and generated summaries. We generate questions and corresponding answers from the original summary and evaluate the answers obtained for those questions from the generated summary. Similarly, we generate questions and corresponding answers from the generated summary and evaluate the answers obtained for those questions from the original summary. The functionality of the QA framework is explained in Algorithm 1. To generate questions and corresponding answers, we use an answer aware question generation model 1 , which is fine-tuned on t5base (Raffel et al., 2020) model. To identify the answer for a generated question from a summary, we use a extractive QA model 2 , which is trained on the SQuAD task (Rajpurkar et al., 2018).

Reward Model
We use the similarity between the answers obtained by generated and ground truth summaries as the reward function. A generated summary is considered relevant if the questions posed by the ground truth summary can be answered correctly by the generated summary, as this shows the critical information queried by the question is present in the generated summary. Similarly, a generated summary is considered factual if a question generated on the generated summary can be correctly answered by the ground truth summary, as the questions generated on a hallucinated summary will not be correctly answered by the original summary. In this study, use 1 https://huggingface.co/valhalla/t5-base-qg-hl 2 https://huggingface.co/distilbert-base-cased-distilled-squad where, QG a represents the question set generated for the text Ga and AG a represents the corresponding answer set. 3 Ask the QG a from the Gt, and obtain the corresponding answer set AGa using A. Similarly, ask QG t from the Ga, and obtain the corresponding answer set AGt using A.
Calculate the reward for Ga by the similarity between AGa and AGa as well as similarity between AGt and AGt.
the Normalized Levenshtein distance (Yujian and Bo, 2007) as the similarity measure 3 . An example for using QA for reward calculation is provided in Section B of the appendix. The reward 1 is used by the RL framework (shown in Figure 2) to further train the summary generation model S.

Policy training
We use proximal policy optimization (PPO) (Schulman et al., 2017) as the optimizer for the policy training, as it prevents the generator from moving too far away from the pretrained language model . We used a publicly available PPO implementation 4 in this study. This approach of QA based optimization following general seq2seq training was used to make this framework applicable across different abstractive summarization models.

Evaluation and Results
We evaluate our QA based summarization framework on three common neural abstractive summarization models: The documents in the XSUM data are fed to the models unaltered. For SAMSUM data, we first preprocess the conversations by replacing the personal names (ex: John) with unique tags (ex:<person_0 >), and then accumulate the utterances in each conversation as follows before feeding them to the models: <person_1>utterance_1 <person_2>utterance_2 <person_1>utterance_3 .... In this implementation, we generate one QA pair per sentence in a summary. In addition to that, we also filter out answers that are long (over 5 words), as we believe such long answers do not correspond to the factuality, which is the focus of this study. The average number of QA pairs per summary are 2.5 and 1.4 for SAM-SUM and XSUM datasets respectively. The QA based reward process is less expensive in this study, since the number of QA pairs generated are low compared to the studies that generate QA pairs on source documents (not on summaries). We evaluate each model, first, with general method of training: generate the summary given the document, then, with further RL based training with QA rewards that we propose. The hyper-parameters used in training are available in the Section A of the appendix.
Evaluation with ROUGE scores: We first evaluate the models using the ROUGE scores. The obtained results are reported in Tables 1 and 2. Each  table contains two sections, where the first section shows the accuracy before training with QA based rewards, and the second section shows the results after RL based training with QA rewards. The results suggest that for both datasets, each model significantly improves (p < 0.05) its summarization accuracy using our QA framework.

Factuality based evaluation:
We evaluate the results obtained from our models using the factuality based evaluation framework proposed by Scialom et al. (2021). This measure provides better correlation with human judgments over four evaluation dimensions (consistency, coherence, fluency, and relevance) (Scialom et al., 2021), and provides precision, recall and F1 for a generated summary given a reference. The results obtained on the two datasets are shown in Table 3. Similar to the ROUGE based evaluation, the results here clearly indicate that the for both datasets, each model improves its accuracy using our QA framework.
Human Evaluation: We further conducted human evaluations to study the quality of the models. We focused on the two models that obtained the best scores in our automatic evaluations: PEGA-SUS and BART, and compared the quality of summaries between the original model to our model optimized with QA rewards. For this assessment we first randomly sampled 30 records from the test sets of SAMSUM and XSUM (overall 60 records). Then, we generated 4 types of summaries: PE-GASUS, PEGASUS-QA, BART, BART-QA. We followed the evaluation protocol similar to , in which, the annotators were presented with a document, a ground truth summary and a model summary, and were asked to make two decisions: (1) which model summary is more factual consistent with the given document, and (2) which model summary is of a higher quality, taking into account Informativeness, Fluency, and Succinctness. The annotators were instructed to select one summary or indicate that both summaries are equally good or bad. To achieve a high quality standard we recruited 6 NLP experts, and collected 3 human judgments per each summary. To obtain a single score per summary, we took the majority vote of the collected assessments. More details about human evaluation is available at Section C of the Appendix.   These results indicate that QA based rewards helps to significantly improve summary generation model, considering both factual consistency and general quality aspects. Examples: In Figure 3 we show some examples of model improvements after RL based training with QA rewards. For each model, we show as Original, the summary produced by the model before RL training and, as After RL, the summary produced by the model after RL training.

Conclusion
We investigated the problem of low recall and precision of factuality in neural abstractive summarization models, and proposed a framework to alleviate this issue which uses QA based rewards. The proposed framework is evaluated on three commonly used transformer based summarization models and on two publicly available datasets. The automatic evaluations were performed using ROUGE scores, as well as question answering based evaluation framework and the results suggest that the our method improves the summarization accuracy and factuality. The human evaluation on the generated summaries also suggest that our approach produces summaries with significantly high factual consistency and quality. Ground truth summary person_1 will be home soon and she will let person_0 know.

GPT-2 Model
Original person_1 wants to grab something for dinner with person_0. person_0 is not hungry. She will pick up something for dinner when she gets home.
After RL person_1 is away for the evening. person_0 wants to pick him up and person_1 will let him know when he gets home.

BART Model
Original person_1 is not hungry tonight. She will be home soon.
After RL person_1 doesn't want person_0 to cook anything for dinner tonight. She will be home soon and will tell person_0 when she gets home.

Pegasus Model
Original person_1 will be home soon. person_0 will pick her up.
After RL person_1 will tell person_0 when he gets home.

Ethics
In this study we used the publicly available SAMSUM (https://huggingface. co/datasets/samsum) and XSUM (https: For the human evaluation, in order to meet a high quality standard, we recruited 6 NLP researchers, who have graduate degree in NLP and Machine Learning. Before the official evaluation started, we sampled 10 tasks to get an estimate of the duration of the task and to make sure the instructions are clear enough.

A.3 PEGASUS model
Similar to the BART experiments, we use a PE-GASUS model (Zhang et al., 2020) provided by HuggingFace (Wolf et al., 2019) library 7 , which is fine-tuned on the extreme summarization (XSUM) task. During the evaluation with SAMSUM dataset, we further fine-tune this model on SAMSUM data. This model takes around 7 hours to finetune on the SAMSUM data. The code used for the fine-tuning is publicly available 8 . The hyperparameters used for training the PEGASUS model are as follows:  the GEN summary for the same questions. For example, for the question 'Who will visit person_1's grandma tonight?', the answer from the GT summary is 'person_1 and person_0' while the answer from the GEN summary is only 'person_1'. Since the model failed to capture the fact that both persons will be visiting grandma, the model will receive a lower reward for this case. Next section shows the questions and answers generated from the GEN summary. For example, for the quesion 'What will person_0 buy for her?', the GEN summary produces the answer 'chocolate and cake' while the GT summary produces 'chocolate' as the answer. This mismatch occurs since GEN summary has some hallucinated content (cake), and this will be penalized with a lower reward during the RL model training.  Figure 5 shows the annotation interface and instructions that were given to the annotators while working on the factual-consistency human evaluation task. Annotators used a drop-down list to select their judgments ([===],[>], [<]) Notice that following , ground-truth summaries were prepended back onto the source article (within square brackets).