LMGQS: A Large-scale Dataset for Query-focused Summarization

Query-focused summarization (QFS) aims to extract or generate a summary of an input document that directly answers or is relevant to a given query. The lack of large-scale datasets in the form of documents, queries, and summaries has hindered model development in this area. In contrast, multiple large-scale high-quality datasets for generic summarization exist. We hypothesize that there is a hidden query for each summary sentence in a generic summarization annotation, and we utilize a large-scale pretrained language model to recover it. In this way, we convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS, which consists of over 1 million document-query-summary samples. We thoroughly investigate the properties of our proposed dataset and establish baselines with state-of-the-art summarization models. By fine-tuning a language model on LMGQS, we achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks, demonstrating the high quality and diversity of LMGQS.


Introduction
The field of generic summarization (See et al., 2017;Gehrmann et al., 2018;Liu and Lapata, 2019) has made significant progress in recent years, thanks to the development of generative deep neural models (Sutskever et al., 2014;Vaswani et al., 2017) and the availability of large-scale training data (Nallapati et al., 2016;Narayan et al., 2018;Zhu et al., 2021).However, query-focused summarization (QFS) presents a significant challenge due to the lack of data.Most of the available QFS corpora (Dang, 2006a,b;Nema et al., 2017;Baumel et al., 2016;Zhong et al., 2021) contain only a few thousand documents or less, which is insufficient for training a robust neural model.
We propose a Language Model Generated Query-focused Summarization Dataset (LMGQS) For each summary, write a general question about the article that can be answered by it:  to address the lack of a large-scale QFS dataset.Human annotation for QFS typically involves generating suitable queries and then writing corresponding summaries, which is both time-consuming and expensive.Furthermore, it may necessitate a meticulous definition of the query scheme based on the domain of documents (Zhong et al., 2021).We hypothesize that, for a pair of document and summary in a generic summarization dataset, hidden queries exist that represent the information needs associated with the summary.Therefore, to efficiently scale up annotation, we prompt the largescale language model InstructGPT (Ouyang et  2022) with documents and summaries from four generic summarization datasets to generate the hidden queries.This approach results in the LMGQS dataset, which contains over 1.1 million triplets of document, query, and summary, encompassing a wide range of document and question types.
To investigate the utility of our proposed LMGQS , we finetune a pretrained language model on it.The model accepts the concatenation of the original document and generated query as input and is trained to produce the original summary.We then compare the finetuned model with various query-focused summarization models on several existing QFS benchmarks that have no overlap with LMGQS under the zero-shot setting.Empirical results demonstrate that the model finetuned on LMGQS achieves promising performance on both single-document and multi-document QFS benchmarks, surpassing strong baselines.Similarly, when utilizing LMGQS for pre-finetuning, the model achieves state-of-the-art performance in the supervised setting.
In summary, our contributions are three-fold: (1) We introduce a novel framework for constructing a QFS dataset by converting existing generic summarization datasets using language models as annotators.(2) We present LMGQS, a large-scale QFS benchmark, to foster future research on QFS 1 .(3) The model finetuned on LMGQS exhibits robust generalization capability and achieves remarkable zero-shot and supervised performance on other unseen QFS test sets.

Dataset Creation
We choose 4 generic datasets to build LMGQS: CNN/DailyMail (Nallapati et al., 2016), XSUM (Narayan et al., 2018), SAMSum (Gliwa et al., 2019), and DialogSum (Chen et al., 2021).Among 1 Dataset will be released after the anonymous period them, CNN/DailyMail and XSUM are news summarization datasets, where both the documents and summaries are in formal written English.SAM-Sum and DialogSum are two recently proposed dialogue summarization datasets, whose inputs are the transcripts of multi-speaker conversations.

Prompt-based Query Generation
Given a document and its corresponding summary, we take advantage of the robust few-shot capabilities of InstructGPT (Ouyang et al., 2022) to generate a query that encapsulates the information required by the annotator when crafting the summary.More specifically, we construct a prompt for each document-summary pair and input it into the In-structGPT model to generate the query by completing the prompt.An example prompt is illustrated in Figure 1.Since InstructGPT excels at adhering to human-readable instructions and can even generalize to unseen instructions (Ouyang et al., 2022), we begin our prompt with a clear directive for the query generation task.Following the instruction, we incorporate a one-shot example of the task into the prompt, which includes a human-written query derived from a document-summary pair.We set the number of examples to be 1 based on a balance between effectiveness and efficiency: during our preliminary exploration, we noticed more failure cases for zero-shot query generation, while incorporating additional examples in the prompt would increase both the time and cost of generation.
In the one-shot example, we restrict the number of queries to be equivalent to the number of summaries.In other words, there exists a one-to-one correspondence between the sentences in the summary and the query.This constraint is imposed by appending prefix indices and appending newline characters for each summary/query sentence, as illustrated in Figure 1.
Due to the domain difference between news and dialogue summarization, we choose different oneshot examples for the two domains.The queries for the two example pairs were annotated by the authors of this paper and are attached in the appendix.

Prompt Query Types
Given a document and a summary sentence, multiple valid queries can be formulated.For instance, consider the summary sentence: She has released a book to encourage people to find their passion at work.One possible query is: What is her book about?Alternatively, another valid query could be: Has she released a book?To address this variety, we utilize two sets of annotated queries: yes/no queries and wh-queries.Yes/no queries correspond to questions that can be answered with a simple "yes" or "no".However, in the context of QFS, the summary (i.e., the answer to the yes/no query) is never a mere "yes" or "no".For example, for a yes/no query like Is he still alive?, we expect the answer to be: He was killed in an attack on a guerrilla encampment rather than a simple no.Detailed annotated queries are presented in Table 11.
The type of queries in a one-shot prompt significantly influences the generated queries.We provide a breakdown of query types in Table 1.It is evident that when the prompt includes only whqueries, over 99% of the generated queries are also wh-queries, with the most frequent ones beginning with "What".The same pattern applies when the prompt contains only yes/no queries.The most common queries generated by InstructGPT typically start with "do/does/did" or "is/are/was/were".

Statistics of LMGQS
Using the aforementioned prompting method, we collected 1,138,077 document-query-summary triples covering 13 different query types.Detailed statistics of the generated LMGQS dataset are shown in Table 2. First, the length of the generated queries has a strong Pearson correlation (0.95) with the length of summaries, which is expected due to our one-to-one mapping between the summary and query sentences.Second, the length of queries is consistently shorter than the summary, with whqueries slightly shorter than yes/no queries.
We introduce the novel token percentage: N T P (string1, string2), defined as the percentage of tokens in string1 that are absent in string2.This statistic quantifies the amount of unique information contained in string1 with respect to string2 First, N T P (doc, query) is always lower than N T P (doc, sum), indicating that the generated query always contains less information about the document than the summary.Subsequently, we observe that N T P (query, doc) is in general higher than N T P (sum, doc), because queries are shorter and contain more unique question words like "what" and "did".Finally, N T P (query, sum) being considerably lower than N T P (sum, query) shows that the summary contains more unique information than the query.Furthermore, the query includes a subset of information present in the summary.For instance, a query might inquire about a specific entity in the document, while the summary addresses the query with detailed contexts and facts extracted from the document.
In conclusion, LMGQS encompasses documents in both written and spoken languages, covering a wide range of document/summary lengths, abstraction levels, and compression ratios.

LMGQS for QFS
In this section, we demonstrate that by finetuning pretrained language models on LMGQS, one can obtain a QFS model that generalizes effectively to unseen tasks and domains.In particular, we finetuned a BART model (Lewis et al., 2020), and the resulting model, LMGQS BART, exhibits promis- ing performance on various QFS datasets when directly applied to the unseen test set.Moreover, when extending the fine-tuning process with several thousand in-domain QFS data points, the resulting supervised model surpasses other strong supervised baselines.

Implementation Details
We fine-tuned BART-Large (Lewis et al., 2020) on LMGQS , using a maximum input length of 1024 and output length of 256.The input string consists of a document and a query, formatted as question:\n <query> \n context:\n<document>, where "\n" represents a newline character.We employed 8 NVIDIA Tesla V100 GPUs for training, with a batch size of 4 per GPU and an accumulation step of 8, yielding an effective batch size of 256.
The BART model was fine-tuned using a learning rate of 3 × 10 −5 for 50, 000 steps, and the learning rate was scheduled by a polynomial scheduler with 2000 warmup steps.We set a weight decay of 0.001 and a label smoothing factor of 0.1.For supervised finetuning, we continued to finetune the LMGQS BART model with 2000 total steps and 200 warmup steps.The implementation from Huggingface (Wolf et al., 2020) was utilized.

Datasets
We conduct evaluation of the finetuned BART-Large model (LMGQS BART) on several existing QFS benchmark datasets.
• MultiOpEd (Liu et al., 2021) presents an opendomain news editorial dataset specifically designed to support automatic perspective discovery in news articles.Given a query that ex-plicitly addresses a controversial topic, a system is expected to generate a single-sentence thesis statement that summarizes the arguments presented.Along with ROUGE scores as evaluation metrics, the paper also proposes trained classifiers to assess the correctness and relevance of the generated summary.More specifically, a stance classifier is utilized to predict whether a summary shares the same stance as the news article.For example, a summary that presents an opposing argument to the article might still achieve a high ROUGE score due to n-gram overlap but would receive a low stance accuracy.Similarly, a relevance classifier is employed to evaluate whether the summarized perspective is pertinent to the query.
• • Debatepeida (Nema et al., 2017) was built on Debatepedia -an encyclopedia of pro and con arguments and quotes on critical debate topics.The summaries are highly abstractive and not extractive in the sense that the summary does not necessarily comprise of a sentence which is simply copied or shortened from the original document.
• Document Understanding Conferences (DUC) 2006/2007 2 set up the task to simulate realworld complex question answering.The query in this dataset cannot be answered by simply stating a name, date, quantity, etc.Given a topic and a set of 25 relevant documents, the task is to synthesize a fluent, well-organized 250-word summary of the documents that answers the question(s) in the topic statement.

Baselines
We compare LMGQS BART with the following baseline models: CNN/DM BART is a large BART model that has been fine-tuned on the query-agnostic CNN/DailyMail dataset (See et al., 2017).This serves as a baseline model for summarization based solely on the input document, without considering the query in the QFS setting.
InstructGPT 002 is an InstructGPT model that can be accessed through the OpenAI API using the text-davinci-002 model.A simple template, "Summarize by answering the following questions:", is used to link the document with the query and generate content with the temperature set to 1.0, top-p set to 0.9, and maximum length set to 512.
LaQSUM (Xu and Lapata, 2022) is a recent model that learns latent queries from documents 2 URL at https://www-nlpir.nist.gov/projects/duc for abstractive summarization.Unlike other approaches, LaQSUM models the query as hidden binary variables to indicate whether a token in the document contributes to the information sought in the summary.This model does not require QFS annotation and is trained on the CNN/DM dataset.
MARGESUM (Xu and Lapata, 2021) is a stateof-the-art few-shot method for QFS that requires a small QFS development set.
GSUM+Query is adapted from GSUM (Dou et al., 2021), which is a guided summarization system.An unsupervised query-focused extractive system is used to pre-extract the top-ranked sentences for each test document as guidance.The GSUM model is trained with the CNN/DM dataset.
QuerySum (Xu and Lapata, 2020) is an extractive method that uses QA datasets as distant supervision to train an evidence estimator for identifying segments likely to answer the query and should be included in the summary.
ProphetNet (Qi et al., 2020) is a supervised abstractive summarization model that predicts the next n tokens simultaneously.The results for ProphetNet are taken from the NEWTS paper (Bahrainian et al., 2022).
Unsupervised extractive baselines are taken from Xu and Lapata (2022).Lead and LexRank estimate sentence-level centrality using Markov Random Walk on graphs.
QMDSCNN (Pasunuru et al., 2021) transfers the CNN/DailyMail dataset into query-focused multidocument summarization dataset and build abstractive end-to-end neural network models to obtain zero-shot results on DUC 2016 and 2017 datasets.

Query Unification
Different QFS datasets have different query formats.For instance, Debatepedia has the query format of a natural question, which is the same as LMGQS, while the majority of queries in DUC datasets are instructions such as "Discuss conditions on American Indian reservations or among Native American communities."and "Include the benefits and drawbacks of the reservation system.".And for NEWTS, the query is a "topic" in the topic model and described in words, phrases or a sentence.
To use LMGQS in the zero-shot setting, it is necessary to convert the queries of diverse formats into natural questions.Without an off-the-shelf tool for this task, we propose to further utilize LMGQS for the query unification task.Specifically, we finetune a BART model to generate queries with the document and summary as input.The finetuned BART model shares the same input and output as the Instruct-GPT used in Section 2.1 to generate queries from generic summarization datasets.
We denote this finetuned model as G d,s q and the finetuned model described in Section 3.1 as G d,q s .Given original query q and document d, we first use q as a pseudo "summary" and ask G d,s q to produce a query q ′ of the desired format, i.e., q ′ = G d,s q (d, q).We then use the generated query q ′ as the input query in the follow-up zero-shot inference to predict summary s = G d,q s (d, q ′ ).
The query unification is used to generate queries for NEWTS, DUC 2006, and DUC 2007 dataset.We quantitatively and qualitatively verified its effectiveness in section 4.3.

Mutli-document Query Focused Summarization
Since LMGQS contains only single-document QFS data, the fine-tuned model G d,q s can generate summaries based on individual document-query pairs.To evaluate zero-shot multi-document QFS, we adopt a straightforward iterative approach from previous works by Baumel et al. (2018); Xu and Lapata (2022).Given a cluster of documents and a query, we first rank documents using term frequency-inverse document frequency, then generate a summary for each ranked document.The final summary is chosen from the top-ranked list.Following the list order, we successively concatenate a summary if its token overlap percentage with the selected summaries is below a threshold, e.g., 50%, until the total length of chosen summaries reaches a predefined token budget (e.g., 250 tokens).

Results on Single-document QFS
The table 4 presents the ROUGE scores and accuracies of stance and relevance for various models on the MultiOpEd dataset.It can be observed that LMGQS BART outperforms other models in both supervised and unsupervised cases, achieving the highest ROUGE scores and stance accuracies in both settings.For relevance accuracy, it also achieves the best in the zero-shot setting and the second best in the supervised setting.This demonstrates the robust performance of LMGQS BART across different settings.Interestingly, in the supervised setting, pre-finetuning on the CNN/DailyMail dataset (CNN/DM BART) actually diminishes performance compared to vanilla BART without prefinetuning.This result indicates that a general summarization dataset may not always be beneficial for QFS and highlights the necessity for high-quality, large-scale QFS datasets like LMGQS .
Similarly, Table 5 presents the ROUGE scores (R-1, R-2, and R-L) and topic scores on the NEWTS dataset for different models under two categories: Supervised and Zero-shot/Transfer Learning.We used "w/ {query_granularity}" to denote the results using three different granularities for the query: words, phrases, and sentences.For instance, "ProphetNet supervised w/ topic words" refers to the result ProphetNet achieved using a query of topic words.Overall, the LMGQS BART models outperform other baselines in terms of ROUGE scores, with the LMGQS BART w/ topic words model achieving the highest scores in the zero-shot setting and the LMGQS BART w/ topic phrases model obtaining the best results in the supervised setting.Additionally, the LMGQS BART w/ topic sentences model achieves the highest topic score among all models in both zero-shot and supervised settings, closely approaching the topic scores of the ground truth.Without fine-tuning on any supervised data, LMGQS BART exhibits a significant advantage over the supervised ProphetNet models in terms of ROUGE scores and topic scores.The supervised results also reveal that LMGQS remains more beneficial even when some in-domain supervised data (2,400 training samples from NEWTS) is accessible.
Table 6 presents the ROUGE scores on the single-document QFS dataset Debatepedia for various models, classified into unsupervised, supervised, and zero-shot/transfer learning categories.LMGQS BART achieves the highest ROUGE scores, surpassing all other models in the zeroshot/transfer learning category.
It is worth mentioning that our model distilled from InstructGPT outperforms the teacher model in the all single-document QFS datasets.

Human Study
A recent study by Laskar et al. (2022) discovered that some queries have no relation to the input documents.To investigate this, we conducted a human study comparing the LMGQS BART model's output with the Debatepedia reference.Human annotators were instructed to choose the better summary from two candidates, given the query and the context document.If both summaries were of equal quality or the query was unanswerable from the document, they would mark it as a "Tie."In the blind test, annotators preferred the LMGQS BART model's output 18 times, the reference 15 times, and selected "Tie" 17 times.This indicates that LMGQS has a higher quality compared to existing benchmark datasets like Debatepedia.Additionally, we observe that model finetuned on LMGQS will not just summarize the document but also tends to answer the question by giving a direct answer and giving a stance of supporting or opposing the statement in the query.hypothesize that the primary reason for this is the prevalence of queries in the DUC datasets that are presented in a human-readable instruction format, which inherently favors the instruction-following nature of InstructGPT.Despite being a considerably smaller model, LMGQS BART still demonstrates promising instruction-following capabilities by leveraging our query unification method.

Ablation Study of Query Unification
To evaluate the efficacy of query unification, we perform an ablation study to compare the quality of automatically generated queries q ′ = G d,s q (d, q).For comparison, we manually create a query template to transform the query into a natural language question.The template is selected separately for the NEWTS and DUC datasets, and the authors utilize the generation on the development set of these datasets to carefully refine the template, as shown in table 8.In Figure 2, we present a comparison of ROUGE-2 scores between LMGQS BART when employing 1) manually crafted query templates or 2) automatically generated queries from query unification.Other ROUGE scores are presented in table 9 and 10 in appendix.It is evident that query unification holds an advantage over the handcrafted template, despite the latter necessitating access to a validation set and meticulous tuning from human experts.

Conclusion & Future Works
We introduce a novel large-scale dataset, LMGQS, for query-focused summarization (QFS), addressing the scarcity of large-scale benchmarks in this domain.In order to reduce the laborious human effort required for QFS, we utilize a large-scale language model to generate hidden queries from existing query-agnostic summarization datasets.By performing standard finetuning on LMGQS, we attain state-of-the-art zero-shot performance across multiple QFS benchmarks.

Ethics Statement
In this study, we acknowledge the ethical concerns related to the use of InstructGPT, a pre-trained language model that has the potential to generate toxic or biased outputs, make up facts, and produce sexual and violent content without explicit prompting (Ouyang et al., 2022).To mitigate these risks, we employed the content filtering function of the Ope-nAI API calls 3 .This filter runs both the prompt and completion through an ensemble of classification models designed to detect and prevent the output of harmful content.A small fraction of our prepared prompts were filtered by the OpenAI API of InstructGPT, and we discarded the corresponding samples in our proposed LMGQS dataset.We acknowledge the possibility that the Instruct-GPT model might be discontinued by OpenAI in the future, rendering parts of our research irreproducible.To address this issue, we plan to release the full results returned by the API calls, as well as the prompts for generating the hidden queries.This will enable the research community to construct a higher-quality dataset with more advanced models in the future, reproduce our zero-shot and supervised finetuning results, and further build upon our work.
Regarding the human study, the annotators involved are the paper's two authors who possess proficiency in English and are well-acquainted with the query-focused summarization task.The annotators were tasked with choosing the better candidate 3 https://learn.microsoft.com/en-us/azure/aiservices/openai/concepts/content-filtersummary between two options in a comparison task.To minimize confirmation bias, the order of the candidates was randomized and hidden from the annotators.No payment was involved in this human study, and the authors exercised their best efforts to minimize any inadvertent biases in the annotation process.

Limitations
One limitation of LMGQS is that it is solely in English, lacking the extensive multilingual coverage of other languages.Additionally, the data creation procedure is cost-inefficient due to the necessity for numerous API calls to InstructGPT.Our human study involves authors of this paper and might be subject to confirmation bias.Lastly, the generated queries are primarily divided into two categories: yes/no queries and wh-queries.A more fine-grained approach to control the types of queries based on the properties of the document and summary is currently lacking.For instance, if a summary emphasizes location over time, a query beginning with "where" would be more appropriate than one starting with "when".

Model
Medina Caracas was a fugitive from a U.S. drug trafficking indictment … … 4. … The summary answers the following 4 questions from the article: 1.Who was Tomas Medina Caracas?

Figure 1 :
Figure1: Example prompt for query generation.The top part is the instruction, followed by the one-shot example consisting of document, summary, and query.The query for the input document (highlighted in yellow) and summary is generated by InstructGPT.

Figure 2 :
Figure 2: Ablation study on the effect of query unification.For simplicity, we only present ROUGE-2 score in this figure.

Table 1 :
al., Breakdown of query types on QuerySum.The upper part of the table shows the query type percentage when using wh-queries in the one-shot prompt example.The lower part shows the percentage when prompting with yes/no queries.Color of blue means a higher percentage while red means a lower one.

Table 3 :
Size and example queries of QFS datasets used in evaluation.

Table 4 :
(Liu et al., 2021)ccuracies of stance and relevance on MultiOpEd dataset.All baseline results except for InstructGPT are from(Liu et al., 2021)

Table 8 :
The manual template of queries that are not in question format.

Table 9 :
NEWTS w/ topic words NEWTS w/ topic phrases NEWTS w/ topic sentences DUC 2006 DUC 2007 Effect of query unification in ROUGE-1