An Exploratory Study on Long Dialogue Summarization: What Works and What’s Next

Dialogue summarization helps readers capture salient information from long conversations in meetings, interviews, and TV series. However, real-world dialogues pose a great challenge to current summarization models, as the dialogue length typically exceeds the input limits imposed by recent transformer-based pre-trained models, and the interactive nature of dialogues makes relevant information more context-dependent and sparsely distributed than news articles. In this work, we perform a comprehensive study on long dialogue summarization by investigating three strategies to deal with the lengthy input problem and locate relevant information: (1) extended transformer models such as Longformer, (2) retrieve-then-summarize pipeline models with several dialogue utterance retrieval methods, and (3) hierarchical dialogue encoding models such as HMNet. Our experimental results on three long dialogue datasets (QMSum, Me-diaSum, SummScreen) show that the retrieve-then-summarize pipeline models yield the best performance. We also demonstrate that the summary quality can be further improved with a stronger retrieval model and pretraining on proper external summarization datasets.


Introduction
Large amount of dialogue data have been produced in meetings, TV series, and interviews Zhong et al., 2021;Zhu et al., 2021). Dialogue summarization aims to generate a short summary for long dialogues to help the readers capture important information more efficiently.
A number of existing works on dialogue summarization focus on extracting the main events of a short conversation (Gliwa et al., 2019;Rohde et al., 2021). However, unlike the short dialogues * Equal Contribution. ‡ The work was done when Asli was at MSR. which contains less than 20 utterances, some tasks for summarizing much longer dialogues have been proposed recently Zhong et al., 2021). These datasets are usually derived from meetings and interviews, with hundreds of turns in each dialogue. The length of such dialogues typically exceeds the input limits imposed by recent transformer-based models (Lewis et al., 2020), making it difficult to train an end-to-end summarization model for such tasks. This poses the challenge: How can we effectively use the current neural summarization models on dialogues that greatly exceed their length limits?
Additionally, compared with document summarization, dialogues are interactive in nature, makes it more context-dependent and the information in dialogues is more sparsely distributed. Besides, the informal language used in dialogues leads to difficulties in modeling relevance and salience. To solve these issues, hierarchical methods are proposed to model the dialogues at turn level (Zhu et al., 2020a;Rohde et al., 2021). However, generating a short summary that contains all the salient information remains challenging.
In this paper, we systematically investigate these issues on dialog summarization: we first explore the various solutions to the lengthy input problem. Then, we analyze and compare the methods to improve generic summarization models on challenging dialogue datasets. To address the long input issue, we investigate extended transformer models such as Longformer (Beltagy et al., 2020), and several dialogue utterance retrieval methods for a retrieve-then-summarize pipeline model, as well as hierarchical dialogue encoding models. For the specific challenges in dialogues, we explore different datasets for pretraining to test the transferability between similar summarization tasks. We evaluate these models on three recent long dialogue summarization datasets: QMSum for meetings (Zhong et al., 2021), MediaSum for inter-views (Zhu et al., 2021), SummScreen for TV series transcripts . In our experiments, we find that the pipeline method with a dialogue utterance retrieval model yields the best performance, and it can be further improved with a stronger retrieval model. Our experiment results also suggest that pretraining on proper external summarization datasets can effectively improve the performance of dialogue summarization models.

Related Work
Long Sequence Summarization Recent summarization models are based on Transformer (Vaswani et al., 2017) that has a quadratic time and memory complexity with respect to the input length, preventing it from being used for longer sequences. To address this issue, Beltagy et al. (2020) used the sliding window and global attention, while Zaheer et al. (2020) used a combination of random, sliding window and global attention mechanism to reduce the quadratic complexity to close-linear. Previous benchmarks for long sequence summarization mostly focus on documents instead of dialogues: PUBMED and ARXIV (Cohan et al., 2018)  Dialogue Summarization Dialogue summarization aims to generate concise summaries for dialogues, such as meetings (McCowan et al., 2005;Janin et al., 2003;Zhong et al., 2021;Shang et al., 2018;Zhu et al., 2020a), TV series , interviews (Zhu et al., 2021), and chitchat (Gliwa et al., 2019;Zhao et al., 2020;Chen and Yang, 2021). Some summarization datasets (not limited to dialogues) contain queries asking for summarizing specific parts of dialogues (Zhong et al., 2021;Nema et al., 2017), while others only need to summarize whole dialogues Gliwa et al., 2019;Hermann et al., 2015). As for dialogue summarization models, Zhu et al.

Methodology
In this section, we will introduce the dataset used to evaluate and pretrain the model, two types of summary models, and the details of the experiment setup.

Datasets
To explore the problems in long dialogue summarization, we leverage three different long dialogue summarization tasks as main datasets: QMSum (Zhong et al., 2021) is a query-based multi-domain meeting summarization dataset annotated by humans. It contains 1,808 queries together with 232 long meeting transcripts, with topics as software product, academics, and committee meetings. QMSum also contains annotated gold spans which could be used as the gold labels for training the retrievers; MediaSum (

Retrieve-then-summarize Pipeline
Dialogues tend to be relatively long, and most existing summarization models cannot process such long inputs. The two-stage retrieve-thensummarize pipeline first retrieves the most relevant subtext in the dialogue and then feeds to a summarizer. We experiment with the following retrievers: • TF-IDF (Jones, 1972) Based on bag-of-words representation, TF-IDF measuers term frequency (TF) and normalizes them with inversed document frequency (IDF); • BM25 (Robertson and Zaragoza, 2009) Similar to TF-IDF but accounts for document length and term saturation; • Locator 1 The utterance locator model proposed by Zhong et al. (2021) using convolution neural networks with BERT (Devlin et al., 2019).
For TF-IDF and BM25, we limit the number of retrieved utterances to be at most 10% of the whole dialogue, while we directly use the utterances predictor by Locator in its setting. After retrieval, we use the BART-large model fine-tuned on the output of the various retrievers to produce the summary.

End-to-end Summarization Models
To study how current state-of-the-art neural summarizers perform on long dialogue summarization, we choose the following three models: BART (Lewis et al., 2020) is a transformer-based encoder-decoder model which obtains a number of state-of-the-art results on various text generation tasks. We use this model as our baseline summarization model for studying its ablations under different settings. The maximum number of input tokens is 1,024 so we truncate the input when it exceeds such limit. 2 HMNet (Zhu et al., 2020a) is a hierarchical network for dialogue summarization. It models the structure of the dialogue, using a token level encoder to encode each sentence and a turn level encoder for aggregating each turn. We use HM-Net as a representative for the hierarchical type of models and compare it with other baselines. Due to the limitation of the memory cost, we constrain the maximum number of tokens to be 8,192 for HMNet, which is 8x as large as BART mentioned above.
Longformer (Beltagy et al., 2020) adapts the selfattention mechanism from full attention matrix to sliding window attention + global attention, which is more memory efficient. Longformer can accept up to 16K tokens and has shown improvement over long document summarization using its long-encoder-decoder (LED) variant. We allow the maximum input of 4,096 tokens for Longformer and cutoff the rest of the input, as we found further increasing such limit yields no improvements.
To incorporate queries in QMSum for these endto-end models, we simply append the queries to the front of the meeting transcripts, as it is a standard practice for query-based summarization and also question answering (Devlin et al., 2019).

Experiment Setup
For a fair comparison between all models, we fit all of the models into the same RTX 8000 GPU with 48 GiB of GPU memory. We adopt the fairseq 3 implementation for BART, and the original code base for both Longformer 4 and HMNet 5 . We inherit the hyperparameters for all those models for fine-tuning in our experiments. 6 Our most expensive experiments are fine-tuning for HMNet and Longformer, which take around 8 hours, while the runtime for BART model is less than one hour. We use ROUGE (Lin, 2004) as our main evaluation metric and pyrouge library 7 as the ROUGE implementation throughout all experiments.

Result and Analysis
Here we demonstrate our findings in four corresponding subsections. We also show some concrete examples and perform qualitative analysis in § 4.5

Dealing with Long Dialogues
We compare several methods for addressing the long input issue for dialogue summarization, including different utterance retrieval methods describe in § 3.2.1 for a retrieve-then-summarize framework, heuristics for shortening the dialogue  as well as baseline methods to establish reasonable bounds. From Tab. 2, we can see that even in the query-based dialogue summarization with QM-Sum, randomly selecting utterances still presents a strong baseline. Over different modeling choices, the retrieve-then-summarize framework generally works better than end-to-end learning with dialogue cutoff at maximum input length. We do not observe an advantage of using Longformer over the BART model. This raises the question on whether all utterances in the dialogue are needed to produce a good summary or irrelevant utterances would add more noise. Moreover, we notice that all these methods present a non-trivial gap with the summarization performance on the gold span, which uses relevant utterances annotated by humans. This suggests that there is plenty room for improvement if a better utterance retrieval method is developed.

Robustness to Input Length
As we discussed, some dialogues (e.g., QMSum) contain more than 20k tokens. They exceed the input limitation of most existing summarization models. In this section, we further analyze the performance of summarization models as the input length changes. To compare the robustness between two types of models (mainly BART and HM-Net), we divide the test dialogues by the number of tokens. As we can see in Fig. 1, the performance of the BART model decreases sharply when the dialogue input becomes longer while the HMNet shows the opposite effect. This could be the result of their unique properties: BART is pretrained on the datasets with a limited length (i.e., 1,024) and the input has to be truncated to fit the limitation, while HMNet obtains more information when the input is longer. However, the overall performance of HMNet is worst than BART.

Incorporating Queries
Certain dialogue summarization tasks, such as QM-Sum, require generating a summary based on a specific question about the dialogue (e.g., opinion of a speaker or conclusion to a topic). In this section, we study the influence of incorporating queries in dialogue summarization. Tab. 4 shows the performance of two models, BART and HMNet, on QMSum with and without queries at the beginning of the input. For the input to the two models, we use the gold relevant text spans given a query in QMSum to avoid the influences of retrieval models. The results show that encoding queries has a large impact on both types of models, especially for BART, even if the gold utterances are given.

Transfer Ability between Different Tasks
Pretraining has been shown effective for document summarization by introducing external knowledge from other similar tasks (Hermann et al., 2015;Fabbri et al., 2019). We hypothesize that it is especially important for dialogue summarization because the dataset size is usually small. Therefore, we study the transfer learning between different dialogue summarization tasks via pretraining. Tab. 3 shows the performance of BART-large models that are pretrained using different datasets and later fine-tuned on QMSum and SummScreen-FD. The results show that BART-large pretrained on CNN/Dailymail dataset (BART-CNN) yields the best performance after finetuning, though CNN/Dailymail consists of News articles and is not in dialogue format. We also note that pretraining on external datasets can also hurt the performance, and thus such pretraining datasets need to   be carefully chosen.
We also analyze the performance of BART-large by pretraining it on more than one dataset to test if BART-large can be further improved. We use the BART-large model pretrained on CNN/DM (BART-CNN) as baseline model since BART-CNN yields the best performance compared with the others. And then pretrain the same BART-CNN model on SAMSum and MediaSum separately. However, Tab. 3 shows that after pretraining BART-CNN on these two datasets, ROUGE scores decrease sharply on QMSum dataset, and lightly on SummScreen-FD dataset except for ROUGE-L. This result demonstrates that pretraining on multiple dataset may not further improve the performance of the pretrained models.

Case Study
We exam several summaries generated by BARTlarge model pretrained on three different datasets. We found that the BART-CNN model yields the best output with the least number of syntax errors and the closest content to the desired ones, while the output of BART-MediaSum model is usually shorter than Gold resulting in incomplete generation, and BART-XSum model usually predicts summaries with errors and duplication. This could be the result of data bias of pretraining datasets -Summaries in MediaSum and XSum are shorter than CNN/DM. However, despite the better performance of BART-CNN model, these cut-off models fail to predict some part of the gold summary when the number of tokens in input dialogue is larger than the maximum input length of the model. For concrete examples, please refer to Appendix A.

Conclusion and Future Work
We first explore the lengthy input problem of dialogue summarization through experiments on transformers and retrieval models. We conclude that the retrieve-summarize pipeline results in the best performance. Then, the experiments demonstrate the important role of queries and robustness to input length for different types of models. We found that adding a single query sentence in the input greatly improves ROUGE scores on QMSum. Additionally, BART performs worse when the input is beyond 512 tokens, even with extended positional embeddings; on the contrary, the hierarchical model performs better for longer inputs. We also test the transferability of summarization datasets by pretraining the language model on similar tasks. We conclude that the BART-large model pretrained on CNN/DM yields the best performance on both QMSum and SummScreen-FD.
For future work on solving the long input problem, we found that using an utterance retrieval model for summarization is a promising direction, yet modeling relevance between query and dialogue utterances remains a challenging task. And for the summarization models, it is worth exploring methods to 1) pretrain on valuable datasets for dialogue summarization, 2) better fuse the queries into the neural models, and 3) make the model robust to the input length (like HMNet) and maintain the high performance in the meantime (like BART).

A Generated Cases
Tab. 5 shows some concrete sample summaries generated by BART-large model pretrained on various datasets.

MediaSum
Transcribers are working on transcribing the data from the corpus . The next step is to insure that the data are clean first , and then channelized . The transcribers are also working on ensuring that the mark-up is consistent all the way throughout .
Sydney and Will are sent to a secret CIA project to find out if their father is alive or dead . Meanwhile , Sydney and Vaughn are sent to a secret CIA facility to find out what Irina is up to .

XSum
The transcribers have transcribed about thirty-five hours of transcripts from the corpus . The next step is to insure that the data is clean first , and then channelized . The transcribers are working on is to insure that the data is clean first , and then channelized . The transcribers are also incorporating additional conventions that Liz requested in terms of having a systematic handling of numbers , acronyms and acronyms which I had n't been specific about .
Sydney and Will are shocked to learn that Sydney's father , who was killed in Madagascar , is alive and working for the CIA . Will is also shocked to learn that Sydney 's mother , who was killed in the Rambaldi experiment , is alive . Will is also shocked to learn that Sydney 's father is a scientist . Will is also shocked to learn that Sydney 's mother is a scientist . Will is also shocked to learn that Sydney 's mother is a scientist . · · ·

CNN
The team was working on transcribing the data , and the next step was to ensure that the data was clean first , and then channelized . The team was working on ensuring that the data was spell-checked , that the mark-up was consistent all the way throughout , and that they incorporated additional conventions that Liz requested in terms of having a systematic handling of numbers , acronyms , and acronyms which they had n't been specific about .
Sydney and Will investigate the death of her father , who was killed in a Russian KGB operation in 1982 . They discover that the Rambaldi device was a Russian spy device , which was used to test the IQ of children . Sydney 's father was a KGB agent , and she is now a KGB agent . She is also a double agent , and she is working for the CIA . She is also working for the CIA to find out who is behind the death of her father . Meanwhile , Irina is worried about her father 's death , and she is worried about her relationship with Vaughn .

Gold
Efforts by speaker fe008 are in progress to ensure that transcripts are clean ( i.e . spell checked ) , channelized , and conform to set conventions regarding the coding of numbers , acronyms , and explicit comments ( e.g . door slams , coughs , and laughter ) . Subsequent efforts by speaker fe008 will be to tighten up boundaries on the time bins . Inter-annotator agreement was reported to be very good .Speaker mn014 's multi-channel speech/non-speech segmenter is in use .
Sydney races to find a cure for Vaughn , but in order to find the antidote , Sydney must make a deal with Sark that could endanger Sloane 's life . Meanwhile , Will continues his research for Vaughn and discovers some disturbing inconsistencies involving 20-year -old standardized IQ tests . Sydney finds out that Vaughn has a girlfriend . Table 5: Sample output summaries of various pretrained models on QMSum and SummScreen. The summary S of row X, column Y indicates that BART-large model which is pretrained on X dataset generates summary S from test set of Y . The errors and duplication are marked in red. The out-of-boundary contents are marked in grey. Tokens marked in brown indicate the keywords emerged in Gold summary.