Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance

Query focused summarization (QFS) models aim to generate summaries from source documents that can answer the given query. Most previous work on QFS only considers the query relevance criterion when producing the summary. However, studying the effect of answer relevance in the summary generating process is also important. In this paper, we propose QFS-BART, a model that incorporates the explicit answer relevance of the source documents given the query via a question answering model, to generate coherent and answer-related summaries. Furthermore, our model can take advantage of large pre-trained models which improve the summarization performance significantly. Empirical results on the Debatepedia dataset show that the proposed model achieves the new state-of-the-art performance.


Introduction
Query focused summarization (QFS) models aim to extract essential information from a source document(s) and organize it into a summary that can answer a query (Dang, 2005). The input can be either a single document that has multiple views or multiple documents that contain multiple topics, and the output summary should be focused on the given query. QFS has various applications (e.g., a personalized search engine that provides the user with an overview summary based on their query (Su et al., 2020)).
Early work on the QFS task mainly focused on generating extractive summaries (Davis et al., 2012;Daumé III and Marcu, 2006;Feigenblat et al., 2017;Xu and Lapata, 2020b), which may contain unreadable sentence ordering and lack cohesiveness. * * The two authors contribute equally. 1 The code is released at: https://github.com/ HLTCHKUST/QFS Document: Interrogator Ali Soufan said in an April op-ed article in the New York Times: "It is inaccurate to say that Abu Zubaydah had been uncooperative [and that enhanced interrogation techniques supplies interrogators with previously unobtainable information]. Along with another f.b.i. agent and with several c.i.a. officers present I questioned him from March to June before the harsh techniques were introduced later in August. Under traditional interrogation methods he provided us with important actionable intelligence." Query: Are traditional interrogation methods insufficient? Summary: The same info can be obtained by traditional interrogations. Table 1: An example of QFS. The input is a document and a corresponding query, and the highlight sentence is the answer from our QA module. We observe that the summary and the answers are very correlated.
Other work on abstractive QFS incorporated the query relevance into existing neural summarization models (Nema et al., 2017;Baumel et al., 2018). The closest work to ours was done by (Su et al., 2020) and (Xu and Lapata, 2020a,b), who leveraged an external question answering (QA) module in a pipeline framework to take into consideration the answer relevance of the generated summary. However, they only used QA as distant supervision to retrieve relevant segments for generating the summary, but did not take into consideration the answer relevance in the generation model. As shown in the Table 1, the query focused summary is correlated to the answer extracted from the QA module.
On the other hand, recent neural summarization models (Paulus et al., 2017;Gehrmann et al., 2018; have achieved remarkable performance in generic abstractive summarization by taking advantage of large pre-trained language models (Lewis et al., 2019;. Yet, how to leverage these models and adapt them to the QFS task remains unexplored. In this work, we propose QFS-BART, a BARTbased (Lewis et al., 2019) framework for abstractive QFS that incorporates explicit answer relevance. We leverage a state-of-the-art QA model (Su et al., 2019) to predict the answer relevance of the given source documents to the query, then further incorporate the answer relevance into the BARTbased generation model. We conduct empirical experiments on the Debatepedia dataset, one of the first large-scale QFS datasets (Nema et al., 2017), and achieve the new state-of-the-art performance on the ROUGE metrics compared to all previously published work.
Our contributions in this work are threefold: • Our work demonstrates the effectiveness of the answer relevance score in neural abstractive QFS.
• We propose an effective method to incorporate the answer relevance score into the pretrained language models which can produce more query-relevant summaries.
• Our model reaches the state-of-the-art performance on a single-document QFS dataset (Debatepedia), and brings substantial improvements over several strong baselines on two multi-document QFS datesets (DUC 2006(DUC , 2007.

Related Work
Abstractive summarization models aim to generate short, concise and readable text that extracts the salient information from a document. In the past few years, significant achievements (See et al., 2017;Liu and Lapata, 2019;Lewis et al., 2019;Dong et al., 2019) have been made in the generic abstractive summarization task which is attributed to the advanced neural architectures and the availability of large-scale datasets (Sandhaus, 2008;Hermann et al., 2015;Grusky et al., 2018). QFS is a more complex task that aims to generate a summary according to the query and its relevant document(s). Nema et al. (2017) proposed an encode-attend-decode system with an additional query attention mechanism and diversity-based attention mechanism to generate a more queryrelevant summary. Baumel et al. (2018) Figure 1: The framework of QFS-BART. The QA module calculates the answer relevance scores, and we incorporate the scores as explicit answer relevance attention to the encoder-decoder attention.
rated query relevance into a pre-trained abstractive summarizer to make the model aware of the query, while Xu and Lapata (2020a) discovered a new type of connection between generic summaries and QFS queries, and provided a universal representation for them which allows generic summarization data to be further exploited for QFS. Su et al. (2020), meanwhile, built a query model for paragraph selection based on the answer relevance score and iteratively summarized paragraphs to a budget. Although Xu and Lapata (2020a) and Su et al. (2020) utilized QA models for sentence-or paragraph-level answer evidence ranking, they did not make use of answer relevance to query-forcused generation.
To the best of our knowledge, we are the first to leverage explicit answer relevance to abstractive QFS. In addition, our approach can be easily combined with pre-trained Transformers (Song et al., 2019;Dong et al., 2019;Lewis et al., 2019;Xiao et al., 2020), which have shown great success for the generic abstractive summarization task.

Methodology
In this section, we present our approach to incorporating the answer relevance into QFS. First, we introduce the method of generating answer relevance scores. Then, we describe our answer relevance attention in the Transformer-based model. Third, we introduce our QFS-BART model in which the decoder is composed of a stack of answer relevance decoding layers, as shown in Figure 1.

Answer Relevance Generation
In recent years, neural models (Yang et al., 2019a;Su et al., 2019) have shown remarkable achievements in QA tasks. In order to apply QA models to the QFS task, we use HLTC-MRQA (Su et al., 2019) to generate the answer relevance score for each word in context. The reason for choosing HLTC-MRQA is twofold: 1) it shows robust generalization and transferring ability on different datasets, and 2) the model shows great performance in QA tasks and significantly outperforms the BERT-large baseline by a large margin. The HLTC-MRQA is introduced as follows.
Based on XLNet (Yang et al., 2019b), HLTC-MRQA is fine-tuned on multiple QA datasets with an additional multilayer perceptron (MLP). Given a context that contains n words, the model outputs a distribution s ∈ (0, 1) for each word's probability of being the start word of the answer and a probability distribution e ∈ (0, 1) to be the end word of answer. To generate the answer relevance score r for each word, we calculate it by summing two distributions: where r ∈ (0, 2).

Answer Relevance Attention
Scaled dot-product attention (Vaswani et al., 2017) is the core-component of the Transformer-based model: (2) where d is the dimension of the query matrix Q, key matrix K and value matrix V . The Transformer encoder is constructed by self-attention layers, where all of the keys, values and queries come from the input sequence. This makes each token in the input attend to all other tokens. The Transformer decoder layer is a combination of a self-attention layer and encoder-decoder attention layer. In the encoder-decoder attention layer, the query comes from the decoder's self-attention layer, and the key and value come from the output of the encoder. This allows every generated token to attend to all tokens in the input sequence.
In this work, we propose to incorporate the wordlevel answer relevance score as additional explicit encoder-decoder attention in the transformer decoder. Given a document with n tokens, we generate a summary with a maximum length of m tokens. Let x l ∈ R n * d denotes the output of the l-th transformer encoder layer and y l ∈ R m * d denotes the output of the l-th transformer decoder layer's self-attention layer. The encoder-decoder attention α l ∈ R m * n can be computed as: where W Q and W K ∈ R d k * d k are parameter weights and A ar ∈ R m * n is our explicit answer relevance score.
Since the original answer relevance score is an n-dimensional vector, we repeat it m times to generate an m by n attention matrix, which means our answer relevance attention is equal to all generated tokens. Xsum (Narayan et al., 2018)). 2) BART follows the standard Transformer encoder-decoder architecture, and we can easily combine the answer relevance as explicit attention to the encoder-decoder attention layers. In detail, we incorporate the same answer relevance attention for all Transformer decoder layers. Domain adaption for natural language processing tasks is widely studied (Blitzer et al., 2007;Daumé III, 2009;Liu et al., 2020;Yu et al., 2021). Hua and Wang (2017) first studied the adaptation of neural summarization models and showed that the models were able to select salient information, even when trained on out-of-domain data. Inspired by this, we leverage a two-stage fine-tuning method for our QFS-BART. In the first stage, we directly fine-tune the original BART model with the Xsum  Table 2: ROUGE-F1 scores for Debatepedia QFS dataset. Results with * mark are taken from the corresponding papers. The previous work can be divided into two categories: 1) training the models from scratch, and 2) using pre-trained models and fine-tuning on a QFS dataset.

QFS-BART
dataset, and in the second stage, we fine-tune our QFS-BART model with QFS datasets. All the parameters in the model are initialized from the first stage. In order to make the model capture both query relevance and answer relevance, the input text is formatted in the following way: [CLS] document [SEP] query. The answer relevance attention score for the document is generated by the QA model, and we take the maximum number in the document as the attention score for all the words in the query.
Training Details For all the experiments, we use the BART-large version to implement our models. We use a mini-batch size of 32 and train all the models on one V100 16G. During decoding, we use beam search with the beam size of 4. We decode until an end-of-sequence token is emitted and early stop when the generated summary reaches to 48 tokens.

Results & Analysis
We compare our proposed QFS-BART model with the following models: 1) Transformer does not consider the queries in the Debatepedia dataset. 2) Document: Interrogator Ali Soufan said in an April op-ed article in the New York Times: "It is inaccurate to say that Abu Zubaydah had been uncooperative [and that enhanced interrogation techniques supplies interrogators with previously unobtainable information]. Along with another f.b.i. agent and with several c.i.a. officers present I questioned him from March to June before the harsh techniques were introduced later in August. Under traditional interrogation methods he provided us with important actionable intelligence." Query: Are traditional interrogation methods insufficient? BART-FT: Al Qaeda detainee Abu Zubaydah has been cooperative under traditional interrogation. QFS-BART: The same info can be obtained by traditional interrogation. Gold: The same info can be obtained by traditional interrogations. Transformer (CONCAT) concatenates the query and the document. 3) Transformer (ADD) adds the query encoded vector to the document encoder. 4) SD2 adds a query attention model and a new diversity-based attention model to the encode-attend-decode paradigm. 5) CSA Transformer combines conditional self-attention (CSA) with Transformer. 6) RAS Word Count incorporates query relevance into a pre-trained abstractive summarization model. 7) QR-BERTSUM-TL presents a transfer learning technique with the Transformer-based BERTSUM model (Liu and Lapata, 2019). 8) BART-FT concatenates the document and query, and directly fine-tunes on the Debatepedia dataset.
We adopt ROUGE score (Lin, 2004) as the evaluation metric. As shown in Table 2, QFS-BART significantly outperforms the models without pre-training. Compared with the models utilizing pre-training, ours improves the ROUGE-1 and ROUGE-L scores by a large margin.

Case Study
We present a case study comparing between the strong baseline BART-FT model, our QFS-BART model and the gold summary, shown in Table 3. It's clear that the baseline model tends to copy spans from the document which are not directly related to the query and the QFS-BART model produces a more query-and answer-related summary.

Conclusions
In this work, we propose QFS-BART, an abstractive summarization model for query focused summarization. We use a generalizing QA model to make explicit answer relevance scores for all words in the document and combine them to the encoder-decoder attention. We also leverage pretrained model (e.g. BART) and two-stage finetuning method which further improve the summarization performance significantly. Experimental results show the proposed model achieves state-ofthe-art performance on Debatepedia dataset and outperforms several comparable baselines on DUC 2006-7 datasets. Hal Daumé III and Daniel Marcu. 2006  In this paper, we introduce a two-step architecture: 1) Retrieve answer-related sentences given the query, rank them by the confidence score (generated from Equation 4) and concatenate them. 2) Use our QFS-BART to produce an abstractive summary.

A.1 Answer Retrieving
We split documents into paragraphs and feed each paragraph to the QA model to get answer-related sentences. Then the sentences are ranked by the confidence score.

Document Segmentation
The QA model selects one answer span given an input document, and the sentences that contain the span will be chosen as the answer-related sentences. Since we only retain the answer-related sentences as input to the next step, we set the maximum paragraph length to 300 words to avoid missing too much information in this step. Specifically, we feed text to the paragraph sentence by sentence until it reaches the maximum length.
Answer Relevance Ranking The paragraphs are fed to the QA model to generate answer-related sentences and the corresponding answer relevance scores. We align each sentence with a confidence score from the corresponding answer span. The confidence score is defined as: where P start and P end is two probability distributions over the tokens in the context. P start (i)/P end (i) the probability of the i-th token is the start/end of the answer span in context.

A.2 Summary Generation
We use the answer-related sentences and their answer relevance scores as the input to the QFS-BART model. The DUC 2005 dataset is used as a development set to optimize the model, and we evaluate the performance on the DUC 2006-7 dataset. We compare our QFS-BART with the following models.  LEAD. (Xu and Lapata, 2020c) returns all leading sentences of the most recent document up to 250 words.
TEXTRANK. (Mihalcea and Tarau, 2004) is a graph-based ranking model that incorporate two unsupervised methods for keyword and sentence extraction.
HLTC-MRQA. truncates the ranked answer related sentences from our first step as the extractive summary.
BART-CQA. (Su et al., 2020) uses QA models for paragraph selection and iteratively summarizes paragraphs to 250 words. We adopt ROUGE-F1 score (Lin, 2004) as the evaluation metric. As shown in Table 5, HLTC-MRQA significantly outperforms the LEAD and TEXTRANK baselines, which indicates the effectiveness of our answer retrieval. However, QFS-BART does not perform well on DUC 2006-7 datasets. We conjecture that the model can not converge to the task well with limited training samples (DUC 2005 contains only 300 samples).