Reinforcement Learning for Abstractive Question Summarization with Question-aware Semantic Rewards

The growth of online consumer health questions has led to the necessity for reliable and accurate question answering systems. A recent study showed that manual summarization of consumer health questions brings significant improvement in retrieving relevant answers. However, the automatic summarization of long questions is a challenging task due to the lack of training data and the complexity of the related subtasks, such as the question focus and type recognition. In this paper, we introduce a reinforcement learning-based framework for abstractive question summarization. We propose two novel rewards obtained from the downstream tasks of (i) question-type identification and (ii) question-focus recognition to regularize the question generation model. These rewards ensure the generation of semantically valid questions and encourage the inclusion of key medical entities/foci in the question summary. We evaluated our proposed method on two benchmark datasets and achieved higher performance over state-of-the-art models. The manual evaluation of the summaries reveals that the generated questions are more diverse and have fewer factual inconsistencies than the baseline summaries


Introduction
The growing trend in online web forums is to attract more and more consumers to use the Internet for their health information needs. An instinctive way for consumers to query for their health-related content is in the form of natural language questions. These questions are often excessively descriptive and contain more than required peripheral information. However, most of the textual content is not particularly relevant in answering the question * * These authors contributed equally to this work. (Kilicoglu et al., 2013). A recent study showed that manual summarization of consumer health questions (CHQ) has significant improvement (58%) in retrieving relevant answers (Ben Abacha and Demner-Fushman, 2019). However, three major limitations impede higher success in obtaining semantically and factually correct summaries: (1) the complexity of identifying the correct question type/intent, (2) the difficulty of identifying salient medical entities and focus/topic of the question, and (3) the lack of large-scale CHQ summarization datasets. To address these limitations, this work presents a new reinforcement learning based framework for abstractive question summarization. We also propose two novel question-aware semantic reward functions: Question-type Identification Reward (QTR) and Question-focus Recognition Reward (QFR). The QTR measures correctly identified question-type(s) of the summarized question. Similarly, QFR measures correctly recognized key medical concept(s) or focus/foci of the summary.
We use the reinforce-based policy gradient approach, which maximizes the non-differentiable QTR and QFR rewards by learning the optimal policy defined by the Transformer model parameters. Our experiments show that these two rewards can significantly improve the question summarization quality, separately or jointly, achieving the new state-of-the-art performance on the MEQSUM and MATINF benchmark datasets. The main contributions of this paper are as follows: • We propose a novel approach towards question summarization by introducing two question-aware semantic rewards (i) Questiontype Identification Reward and (ii) Questionfocus Recognition Reward, to enforce the generation of semantically valid and factually correct question summaries. • The proposed models achieve the state-ofthe-art performance on two question summa-rization datasets over competitive pre-trained Transformer models. • A manual evaluation of the summarized questions reveals that they achieve higher abstraction levels and are more semantically and factually similar to human-generated summaries.

Related Work
In recent years, reinforcement learning (RL) based models have been explored for the abstractive summarization task. Paulus et al. (2017) introduced RL in neural summarization models by optimizing the ROUGE score as a reward that led to more readable and concise summaries. Subsequently, several studies (Chen and Bansal, 2018;Pasunuru and Bansal, 2018;Zhang and Bansal, 2019;Gupta et al., 2020;Zhang et al., 2019b) have proposed methods to optimize the model losses via RL that enables the model to generate the sentences with the higher ROUGE score. While these methods are primarily supervised, Laban et al. (2020) proposed an unsupervised method that accounts for fluency, brevity, and coverage in generated summaries using multiple RL-based rewards. The majority of these works are focused on document summarization with conventional non-semantics rewards (ROUGE, BLEU).
In contrast, we focus on formulating the semantic rewards that bring a high-level semantic regularization. In particular, we investigate the question's main characteristics, i.e., question focus and type, to define the rewards. Recently, Ben Abacha and Demner-Fushman (2019) defined the CHQ summarization task and introduced a new benchmark (MEQSUM) and a pointer-generator model. Ben Abacha et al. (2021) organized the MEDIQA-21 shared task challenge on CHQ, multi-document answers, and radiology report summarization. Most of the participating team (Yadav et al., 2021b;He et al., 2021;Sänger et al., 2021) utilized transfer learning, knowledgebased, and ensemble methods to solve the question summarization task. Yadav et al. (2021a) proposed question-aware transformer models for question summarization. Xu et al. (2020) automatically created a Chinese dataset (MATINF) for medical question answering, summarization, and classification tasks focusing on maternity and infant categories. Some of the other prominent works in the abstractive summarization of long and short documents include

Proposed Method
Given a question, the goal of the task is to generate a summarized question that contains the salient information of the original question. We propose a RL-based question summarizer model over the Transformer (Vaswani et al., 2017) encoderdecoder architecture. We describe below the proposed reward functions.
3.1 Question-aware Semantic Rewards (a) Question-type Identification Reward: Independent of the pre-training task, most language models use maximum likelihood estimation (MLE)based training for fine-tuning the downstream tasks. MLE has two drawbacks: (1) "exposure bias" (Ranzato et al., 2016) when the model expects gold-standard data at each step during training but does not have such supervision when testing, and (2) "representational collapse" (Aghajanyan et al., 2021), is the degradation of generalizable representations of pre-trained models during the fine-tuning stage. To deal with the exposure bias, previous works used the ROUGE and BLEU rewards to train the generation models (Paulus et al., 2017;Ranzato et al., 2016). These evaluation metrics are based on n-grams matching and might fail to capture the semantics of the generated questions. We, therefore, propose a new question-type identification reward to capture the underlying question semantics.
We fine-tuned a BERT BASE network as a question-type identification model to provide question-type labels. Specifically, we use the [CLS] token representation (h [CLS] ) from the final transformer layer of BERT BASE and add the feed-forward layers on top of the h [CLS] to compute the final logits Finally, the question types are predicted using the sigmoid activation function on each output neuron of logits l. The fine-tuned network is used to compute the reward r QT R (Q p , Q * ) as F-Score of question-types between the generated question summary Q p and the gold question summary Q * .
(b) Question-focus Recognition Reward: A good question summary should contain the key information of the original question to avoid factual inconsistency. In the literature, ROUGE-based rewards have been explored to maximize the coverage of the generated summary, but it does not guarantee to preserve the key information in the question summary. We introduce a novel reward function called question-focus recognition reward, which captures the degree to which the key information from the original question is present in the generated summary question. Similar to QTR, we fine-tuned the BERT BASE network for questionfocus recognition to predict the focus/foci of the question. Specifically, given the representation matrix (H ∈ R n×d ) of n tokens and d dimensional hidden state representation obtained from the final transformer layer of BERT BASE , we performed the token level prediction using a linear layer of the feed-forward network. For each token representation (h i ), we compute the logits l i ∈ R |C| , where (|C|) is the number of classes and predict the question focus as follows: The fine-tuned network is used to compute the reward r QF R (Q p , Q * ) as F-Score of question-focus between the generated question summary Q p and the gold question summary Q * .

Policy Gradient REINFORCE
We cast question summarization as an RL problem, where the "agent" (ProphetNet decoder) interacts with the "environment" (Question-type or focus prediction networks) to take "actions" (next word prediction) based on the learned "policy" p θ defined by ProphetNet parameters (θ) and observe "reward" (QTR and QFR). We utilized ProphetNet (Qi et al., 2020) as the base model because it is specifically designed for sequence-to-sequence training and it has shown near state-of-the-art results on natural language generation task. We use the REINFORCE algorithm (Williams, 1992) to learn the optimal policy which maximizes the expected reward. Toward this, we minimize the loss function where Q s is the question formed by sampling the words q s t from the model's output distribution, i.e. p(q s t |q s 1 , q s 2 , . . . , q s t−1 , S). The derivative of L RL is approximated using a single sample along with baseline estimator b: The Self-critical Sequence Training (SCST) strategy (Rennie et al., 2017) is used to estimate the baseline reward by computing the reward with the question generated by the current model using the greedy decoding technique, i.e., b = r(Q g , Q * ). We compute the final reward as a weighted sum of QTR and QFR as follows: We train the network with the mixed loss as discussed in Paulus et al. (2017). The overall network loss is as follows: where, α is the scaling factor and L M L is the negative log-likelihood loss and equivalent to − t=m t=1 logp(q * t |q * 1 , q * 2 , . . . , q * t−1 , S), where S is the source question.

Datasets
We utilized two CHQ abstractive summarization datasets: MEQSUM and MATINF 1 to evaluate the proposed framework. The MEQSUM 2 training set consists of 5, 155 CHQ-summary pairs and the test set includes 500 pairs. We chose 100 samples from the training set as the validation dataset.
For fine-tuning the question-type identification and question-focus recognition models, we manually labeled the MEQSUM dataset with the question type: ('Dosage', 'Drugs', 'Diagnosis', 'Treatments', 'Duration', 'Testing', 'Symptom', 'Usage', 'Information', 'Causes') and foci. We use the labeled data to train the question-type identification and question-focus recognition networks. For question-focus recognition, we follow the BIO notation and classify each token for the beginning of focus token (B), intermediate of focus token (I), and other token (O) classes. Since, the gold annotations for question-types and question-focus were not available for the MATINF dataset, we used the pre-trained network trained on the MEQSUM dataset to obtain the silver-standard question-types and question-focus information for MATINF 3 . The MATINF dataset has 5, 000 CHQ-summary pairs in the training set and 500 in the test set.

Experimental Setups
We use the pre-trained uncased version 4 of Prophet-Net as the base encoder-decoder model. We use a beam search algorithm with beam size 4 to decode the summary sentence. We train all summarization models on the respective training dataset for 20 epochs. We set the maximum question and summary sentence length to 120 and 20, respectively.   We first fine-train the proposed network by minimizing only the maximum likelihood (ML) loss. Next, we initialize our proposed model with the fine-trained ML weights and train the network with the mixed-objective learning function (Eq. 3). We performed experiments on the validation dataset by varying the α, γ QT R and γ QF R in the range of (0, 1). The scaling factor (α) value 0.95, was found to be optimal (in terms of Rouge-L) for both the datasets. The values of γ QT R = 0.4 and γ QF R = 0.6 were found to be optimal on the validation sets of both datasets. To update the model parameters, we used Adam (Kingma and Ba, 2015) optimization algorithm with the learning rate of 7e − 5 for ML training and 3e − 7 for RL training. We obtained the optimal hyper-parameters values based on the performance of the model on the validation sets of MEQSUM and MATINF in the respective experiments. We used a cosine annealing learning rate (Loshchilov and Hutter, 2017) decay schedule, where the learning rate decreases linearly from the initial learning set in the optimizer to 0. To avoid the gradient explosion issue, the gradient norm was clipped within 1. For all the baseline experiments, we followed the official source code of the approach and trained the model on our datasets. We implemented the approach of Ben Abacha and Demner-Fushman (2019) to evaluate the performance on both datasets. All experiments were performed on a single NVIDIA Tesla V100 GPU having GPU memory of 32GB. The average runtimes (each epoch) for the proposed approaches M 2 , M 3 and M 4 were 2.7, 2.8 and 4.5 hours, respectively. All the proposed models have 391.32 million parameters.

Results
We present the results of the proposed questionaware semantic rewards on the MEQSUM and MATINF datasets in Table-1. We evaluated the generated summaries using the ROUGE (Lin, 2004)  Reference:how can I find a physician (s) or hospital (s) who specialize in rapid methadone withdrawal under anesthesia, and the cost and insurance benefits for the procedure?

Generated Summary
ProphetNet: what is the treatment for rapid withdrawal of methadone under anesthesia? Proposed Approach: where can i find physician (s) who specialize in rapid withdrawal of methadone? outperforming competitive baseline Transformer models. We also compare the proposed model with the joint learning baselines, where we regularize the question summarizer with the additional loss obtained from the question-type (Q-type) identification and question-focus (Q-focus) recognition model. To make a fair comparison with the proposed approach, we train these joint learning-based models with the same weighted strategy shown in Eq. 3. The results reported in Table 1 show the improvement over the ProphetNet on both datasets.
In comparison to the benchmark model on MEQ-SUM, our proposed model obtained an improvement of 9.63%. A similar improvement is also observed on the MATINF dataset. Furthermore, the results show that individual QTR and QFR rewards also improve over ProphetNet and ROUGEbased rewards. These results support two major claims: (1) question-type reward assists the model to capture the underlying question semantics, and (2) awareness of salient entities learned from the question-focus reward enables the generation of fewer incorrect summaries that are unrelated to the question topic. The proposed rewards are model-independent and can be plugged into any pre-trained Seq2Seq model. On the downstream tasks of question-type identification and questionfocus recognition, the pre-trained BERT model achieves the F-Score of 97.10% and 77.24%, respectively, on 10% of the manually labeled MEQ-SUM pairs.
Manual Evaluation: Two annotators, experts in medical informatics, performed an analysis of 50 summaries randomly selected from each test set. In MATINF, nine out of the 50 samples contained translation errors. We thus randomly replaced them. In both datasets, we annotated each summary with two labels 'Semantics Preserved' and 'Factual Consistent' to measure (1) whether the semantics (i.e., question intent) of the source question was preserved in the generated summary and (2) whether the key entities/foci were present in the generated summary. In the manual evaluation of the quality of the generated summaries, we categorize each summary into one of the following categories: 'Incorrect', 'Acceptable', and 'Perfect'. We report the human evaluation results (average of two annotators) on both datasets in Table-2. The results show that our proposed rewards enhance the model by capturing the underlying semantics and facts, which led to higher proportions of perfect and acceptable summaries. The error analysis identified two major causes of errors: (1) Wrong question types (e.g. the original question contained multiple question types or has insufficient type-related training instances) and (2) Wrong/partial focus (e.g. the model fails to capture the key medical entities).

Conclusion
In this work, we present an RL-based framework by introducing novel question-aware semantic rewards to enhance the semantics and factual consistency of the summarized questions. The automatic and human evaluations demonstrated the efficiency of these rewards when integrated with a strong encoder-decoder based ProphetNet transformer model. The proposed methods achieve state-of-the-art results on two-question summarization benchmarks. In the future, we will explore other types of semantic rewards and efficient multi-rewards optimization algorithms for RL.
Our project involves publicly available datasets of consumer health questions. It does not involve any direct interaction with any individuals or their personally identifiable data and does not meet the Federal definition for human subjects research, specifically: "a systematic investigation designed to contribute to generalizable knowledge" and "research involving interaction with the individual or obtains personally identifiable private information about an individual."