Improving Query-Focused Meeting Summarization with Query-Relevant Knowledge

Query-Focused Meeting Summarization (QFMS) aims to generate a summary of a given meeting transcript conditioned upon a query. The main challenges for QFMS are the long input text length and sparse query-relevant information in the meeting transcript. In this paper, we propose a knowledge-enhanced two-stage framework called Knowledge-Aware Summarizer (KAS) to tackle the challenges. In the first stage, we introduce knowledge-aware scores to improve the query-relevant segment extraction. In the second stage, we incorporate query-relevant knowledge in the summary generation. Experimental results on the QMSum dataset show that our approach achieves state-of-the-art performance. Further analysis proves the competency of our methods in generating relevant and faithful summaries.


Introduction
Meetings are an essential part of human collaboration and communication.Especially in recent years, the outbreak of Covid-19 has led people to meet online, where most meetings are automatically recorded and transcribed.Query-Focused Meeting Summarization (QFMS) (Zhong et al., 2021) aims to summarize the given meeting transcript conditioned upon a query, which helps people efficiently catch up to the specific part of the meeting they want to know.
There are two main challenges for QFMS: Firstly, meeting transcripts can be so long that current deep learning models cannot encode them at once.Even for models (Beltagy et al., 2020;Xiong et al., 2022) that accept long text input, the cost of computational complexity is enormous.Secondly, query-relevant content is sparsely scattered in the meeting transcripts, meaning a significant part of the transcripts is noisy information when given a

Query-Relevant Knowledge Triple
Query Figure 1: An example of query-relevant knowledge triple extracted from meeting transcripts.The knowledge triple can be used in query-relevant segment extraction as well as summary generation.particular query.Therefore, the models need to reduce the noisy information's impact effectively.
In this paper, we focus on the two-stage framework, extracting query-relevant segments from the meeting transcripts and then generating the summary based on the selected content.Compared to the end-to-end approaches (Zhu et al., 2020;Pagnoni et al., 2022) that directly encode the entire meeting transcripts, the two-stage framework is better at keeping computational efficiency and easier to scale up to longer inputs.Specifically, we propose Knowledge-Aware Summarizer (KAS) that incorporates query-relevant knowledge in both stages.In the first stage, we extract knowledge triples from the text segments and introduce knowledge-aware scores to improve segment ranking.In the second stage, the extracted query-relevant knowledge triples are utilized as extra input for the summary generation.We conduct experiments on a QFMS dataset named QMSum (Zhong et al., 2021) and achieve state-of-the-art performance.We further investigate how the different numbers of extracted segments affect the final performance.In addition, we manually evaluate the generation quality regarding fluency, relevance and factual correctness.
Our contributions in this work are threefold: (1) Our work demonstrates the effectiveness of leveraging query-relevant knowledge in QFMS.(2) We  propose KAS, a two-stage framework incorporating query-relevant knowledge for QFMS.
(3) Experimental results show that our approach achieves state-of-the-art performance on a QFMS dataset (QMSum).Further analysis and human evaluation indicate the advantage of our method.

Related Works
Existing QFMS methods can be divided into two categories (Vig et al., 2022): two-stage approaches and end-to-end approaches.Two-stage approaches (Zhong et al., 2021;Vig et al., 2022;Zhang et al., 2022) first extract query-relevant snippets and then generate the summary upon the extracted snippets, while end-to-end approaches (Zhu et al., 2020;Zhong et al., 2022;Pagnoni et al., 2022) directly generate summaries from the whole meeting transcripts.However, both types of methods have their disadvantages.For example, most of the two-stage approaches select query-relevant content based on utterance, which ignores the contextual information between multiple utterances.As for the end-to-end approaches, the computational and memory requirements will increase rapidly when the input text becomes longer, making models challenging to adapt to long meeting transcripts.
To our knowledge, we are the first to incorporate query-relevant knowledge in QFMS and demonstrate its effectiveness.Besides, we include multiple utterances in each segment to preserve the contextual information, and our approach can easily extend to process long input text.

Methodology
This section presents our two-stage framework.Firstly, we introduce the extractor, which extracts query-relevant segments and knowledge from the source document.Then, a generative model synthesizes the query, extracted segments, and knowledge into the final summary.

Knowledge-aware Extractor
Meeting Transcripts Segmentation To preserve the contextual information between utterances, we split the meeting transcripts into segments, and each segment could contain multiple utterances.To do so, given an input meeting transcripts T , we will separate it into n segments S = {S 1 , S 2 , ..., S n } where each S i is fewer than l tokens.Specifically, we feed meeting transcripts to the segment utterance by utterance until it reaches l.
Knowledge-aware Ranking The knowledgeaware ranking approach will select top-k segments according to a combination of semantic search scores and knowledge-aware scores.We apply Multi-QA MPNet (Song et al., 2020) to calculate semantic search scores.Multi-QA MPNet is trained on 215M question-answer pairs from various sources and domains, including Stack Exchange, MS MARCO (Nguyen et al., 2016), WikiAnswers (Fader et al., 2014) and many more.
Given the query and the segments, the model outputs 768-dimension vectors to represent them.The semantic search score for each segment is computed according to the cosine similarity: To compute the knowledge-aware score for each segment, we first use OpenIE (Angeli et al., 2015) to extractive knowledge triples.Then, to filter out the triples irrelevant to the query, we only keep the triples that contain overlapping words with the query.The knowledge-aware score is obtained by L2 normalizing the number of the remaining triples m i in each segment: Finally, we calculate the ranking score by summing the semantic search score and the knowledgeaware score: The segment with the top-k ranking score will be selected for the next stage of summary generation.We denote the remaining segments as S = {S 1 , S 2 , ..., S k }.

Generator
We choose BART-large (Lewis et al., 2020), a transformer-based (Vaswani et al., 2017) generative pre-trained language model, as our backbone model for the generator because of its remarkable performance on text summarization benchmarks.Following the idea of Fusion-in-Decoder (FiD) and its applications in generation tasks (Izacard and Grave, 2021;Su et al., 2022;Vig et al., 2022), we employ FiD-BART for encoding multiple segments independently in the encoder and fuse information from all segments in the decoder jointly through the encoder-decoder attention.
To incorporate the extracted knowledge in the summary generation, we use the knowledge as extra inputs other than the query and segments.In detail, we remove the stop words in the knowledge triples and then merge all the remaining words of knowledge triples in each segment as a set of knowledge phrases.The concatenation of each segment S i , knowledge phrases K i and the query Q will be processed by FiD-BART encoder: Finally, the decoder performs encoder-decoder attention over the concatenation of all segments' encoder outputs.In this way, the computational complexity grows linearly with the number of segments rather than quadratically, while jointly processing all segments in the decoder enables the model to aggregate information from multiple segments.

Experimental Setup
We choose the top 12 text segments in the segment selection, with each fewer than 512 tokens.For the summary generator, inspired by the effectiveness of pre-finetuning models on relevant datasets to transfer task-related knowledge in the abstractive summarization (Yu et al., 2021;Vig et al., 2022), we initialize our BART-large model using the checkpoint2 pre-finetuned on WikiSum (Liu et al., 2018).See Appendix A and B for more details of the experimental setup and baselines.
We evaluate the models on the QMSum (Zhong et al., 2021) dataset, which consists of 1,808 querysummary pairs over 232 meetings from product design, academic, and political committee domains.We report ROUGE3 (Lin, 2004) as the automatic evaluation results.

Main Results
We compare our proposed model with strong baselines and previous state-of-the-art models.As shown in Table 1, our approach outperforms both the two-stage and end-to-end methods by a large margin on all evaluation metrics.We also investigate how the number of selected segments in the first stage affects the final performance.As shown in Table 2, when the number of selected segments grows, the quality of the final summary also improves.Previous end-to-end approaches usually outperform two-stage approaches because they encode the entire meeting transcripts, which conduct lots of computing resources.Our approach performs better by encoding 12 segments of input (6,144 tokens) in the second stage generation than the state-of-the-art end-to-end approach that directly encodes 16384 tokens.Therefore, we strike a balance between performance and efficiency, which is essential in real-world applications.Besides, our approach can easily extend to processing long input text since the ranking method is unsupervised.

Ablation Study
We conduct an ablation study to investigate the contribution of the knowledge-aware modules by removing the knowledge-based scoring in the ranking (w/o KA in ranking) and the knowledge input in the summary generation (w/o KA in generator), respectively.Besides, we evaluate the entity-level factual consistency (Nan et al., 2021) of the summary to test the effectiveness of our knowledgeaware modules in keeping the knowledge entities in generated summaries.Specifically, we report the F-1 scores of the entity overlap (Entity F-1) between the source and the generated summary.Table 3 presents that the model's performance decreases in both metrics, especially the Entity F-1, without the knowledge-aware module, which indicates the effectiveness of our method.

Human Evaluation
We further conduct a human evaluation to assess the models on fluency, relevance, and factual correctness.We randomly select 50 samples from the QMSum test set and ask three annotators to score the summary from one to five, with higher scores being better.We compare SegEnc-W because it is the best publicly available model.

Conclusion
In this paper, we propose a two-stage framework named KAS that incorporates query-relevant knowledge in both stages.Extensive experimental results on the QMSum dataset show the effectiveness of our method.We further conduct de-tailed analysis and human evaluation prove our method's capacity to generate fluency, relevant and faithful summaries.

Limitations
Our method is trained and tested on the only publicly available QFMS dataset named QMSum.QM-Sum consists of three different domains (Academic, Committee and Product), which makes the evaluation robust.However, QMSum only contains 1,808 samples, which is relatively small.We hope larger QFMS datasets be proposed to accelerate the development of this field.
In the first stage of our approach, we extract a fixed length (6,144 tokens) of the meeting transcripts as the second stage input text.Therefore, the model's performance could be affected since some query-relevant information could be cut off in the first stage.

Ethics Statement
Although our approach can generate query-focused meeting summaries and achieve a much better factual correctness score than other models, we can not avoid generating hallucinated content from the generative models.We recommend that when our approach is deployed in real-world applications, additional post-processing should be carried out to remove unreliable summaries.In addition, we will indicate that the summary we provide is for reference only, and users need to check the original meeting transcripts to get accurate information.

A Experimental Details
We use the off-the-shelf model Multi-QA MPNet4 for semantic searching.We initialize the generator from the off-the-shelf BART-large model through Hugging Face.We use learning rates 6e −5 following (Lewis et al., 2020) and Adam optimizer (Kingma and Ba, 2014) to fine-tune KAS.We use a batch size of one and train all models on one RTX 3090 for ten epochs.During decoding, we use beam search with a beam size of five and decode until an end-of-sequence token is emitted.The final results are the average test set performance on the best three checkpoints in the validation set.In addition, the QMSum dataset has MIT License, so we can freely use it.

B Baselines Details
We conduct both two-stage and end-to-end baselines for comparison.MaRGE (Xu and Lapata, 2021) used Masked ROUGE extractor in the first stage.DYLE (Mao et al., 2022) is a dynamic latent extraction approach for abstractive long-input summarization.SUMM N (Zhang et al., 2022) is a multi-stage summarization framework for long input dialogues and documents.RelReg (RELevance REGression) (Vig et al., 2022) is similar to MaRGE, which trains a relevance prediction model directly on QFS data using the original, nonmasked query.RelReg-W (Vig et al., 2022) uses the same framework as RelReg but initializes the generator from the WikiSum pre-trained checkpoint.LED (Beltagy et al., 2020) is an encoderdecoder transformer-based model that employs efficient attention to process long input text.Di-alogLM (Zhong et al., 2022) is a pre-trained neural encoder-decoder model for long dialogue understanding and summarization, and it uses a hybrid attention approach by combining sparse local attention with dense global attention.SegEnc-W (Vig et al., 2022) is a fusion-in-decoder method that initialized the model from WikiSum (Liu et al., 2018).BART-LS (Xiong et al., 2022) used poolingaugmented blockwise attention to improve the efficiency and pre-trains the model with a maskedspan prediction task with spans of varying lengths.SOCRATIC (Pagnoni et al., 2022) uses a questiondriven, unsupervised pretraining objective to improve controllability in summarization tasks, and it

Table 1 :
Results on QMSum test set.The previous works can be divided into two-stage and end-to-end approaches.

Table 3 :
Ablation results on the QMSum test set without knowledge-Aware (KA).Our KA modules improve the performance of KAS on both stages.

Table 4 :
Human evaluation results on QMSum test set.