Controllable Abstractive Dialogue Summarization with Sketch Supervision

In this paper, we aim to improve abstractive dialogue summarization quality and, at the same time, enable granularity control. Our model has two primary components and stages: 1) a two-stage generation strategy that generates a preliminary summary sketch serving as the basis for the final summary. This summary sketch provides a weakly supervised signal in the form of pseudo-labeled interrogative pronoun categories and key phrases extracted using a constituency parser. 2) A simple strategy to control the granularity of the final summary, in that our model can automatically determine or control the number of generated summary sentences for a given dialogue by predicting and highlighting different text spans from the source text. Our model achieves state-of-the-art performance on the largest dialogue summarization corpus SAMSum, with as high as 50.79 in ROUGE-L score. In addition, we conduct a case study and show competitive human evaluation results and controllability to human-annotated summaries.


Introduction
Text summarization aims to produce an abridged version of the input text by distilling its most critical information. In particular, abstractive -as opposed to extractive -summarization requires generative models with a high level of semantic understanding, as the output words do not necessarily appear in the source text. While it is more challenging, it gives more flexibility to a summary compared to extractive summarization models (Zhang et al., 2018). Significant research efforts have been focused on summarization of singlespeaker documents such as text documents (Liao et al., 2018), News (Hermann et al., 2015;Nallapati et al., 2016;See et al., 2017) or scientific * Equal contribution. Work mainly done when Linqing Liu was an intern at Salesforce Research.
publications (Qazvinian and Radev, 2008;Nikolov et al., 2018). However, dialogue summarization has not received much attention despite the prevalence of dialogues (text messages, email, social media, etc.) and the vast application potential of dialogue summarization systems.
Since dialogue language is inherently different from written text, it poses a unique set of challenges (Zechner, 2001): 1) Distributed information across multiple speakers. The most important information is usually scattered across several conversation turns from different speakers, while in articles it mostly presents in titles or the first few sentences. 2) Boundary detection. In each turn pauses do not always match linguistic sensible segments; it is difficult to identify various critical information across turns due to surrounding non-content noise and disfluency. 3) Modeling interactions between speakers. The speaker interaction plays an important role as it would imply the current dialog state and the status of the next speaker. If we directly apply neural abstract summarization models which mostly encode the whole input only as a source sequence, the flow of the dialogue would be overlooked (Pan et al., 2018). Previous methods (Goo and Chen, 2018;Liu et al., 2019) rely on explicit annotations to capture the logic of the dialogue, however, such annotations are not always available in datasets and additional labeling is cumbersome.
To solve these challenges, we propose CODS, a COntrollable abstractive Dialogue Summarization model equipped with sketch generation. We first automatically create a summary sketch that contains user intent information and essential key phrases that may appear in summary. It identifies the interaction between speakers and salient information in each turn. This summary sketch is prefixed to the human-annotated summary while fine-tuning a generator, which provides weak supervision as the final summary is conditioned on the Figure 1: An input and output example. Given the dialogue, we first construct a summary sketch with intent and key phrase information for each turn, and then split the dialogue into several segments (marked with dashed lines on the left hand side) for model controllability and interpretability. generated summary sketch. In addition, we propose a length-controllable generation method specifically for dialogue summarization. Desired lengths of summaries strongly depend on the amount of information contained in the source dialogue and granularity of information the user wants to understand (Kikuchi et al., 2016). We first segment the dialogue into different segments by matching each summary sentence linearly to its corresponding dialogue context. Then we train our model to generate only one sentence for each dialogue segment. This strategy makes use of the distributed information of the dialogue and make the generated summaries more trackable. We base our model on BART-xsum (Lewis et al., 2019), which is first pre-trained with unsupervised denoising objectives, and further fine-tuned on the News summarization corpus XSUM (Narayan et al., 2018). We evaluate our approach on SAM-Sum (Gliwa et al., 2019), the largest dialogue summarization dataset. Experimental results show that CODS achieves state-of-the-art dialogue summarization performance on several automatic metrics. The main contributions of this work 1 are: 1) We propose a two-stage strategy that uses artificial summary sketch as weak supervision, 2) we introduce a text-span based conditional generation approach to control the granularity of generated dialogue summaries without human-written summaries at different detail levels, and 3) we conduct comprehensive case study and human evaluation to show that CODS can achieve consistent and informative summary, especially for controllable summary, where existing models either cannot do it or do it poorly.

Methodology
Our model is based on pre-trained generative language models (Section 2.1). Given an input dialogue history, our model first generates a summary sketch that serves as additional weakly supervised signal for the final summary (Section 2.2). Then it predicts the text span cutoffs over the entire dialogue and generates summaries accordingly (Section 2.3). We define the conversational history input as D = {X 1 , X 2 , . . . , X N }, where each X i has a sequence of words, N is the total numbers of dialogue turns, and the input may contain more than two speakers. We intend to generate Msentence dialogue summary Y = {Y 1 , . . . , Y M } that is suppose to be briefer than the overall dialogue history.

Generative Pre-trained Language Models
As a first, our model needs transform a conversational history input into a dialogue summary. Re-cently, self-supervised pretrained language models have been employed as encoders and decoders since they (Radford et al., 2019;Dong et al., 2019) have achieved remarkable success across many NLP tasks. For general text summarization, this has also been the case with models such as BART (Lewis et al., 2019) and PEGA-SUS (Zhang et al., 2019a). However, there are no results reported for self-supervised pretrained language models applied to dialogue summarisation, and people have argued that there is an intrinsic difference of linguistic patterns between human conversations and written text (Wolf et al., 2019b;Wu et al., 2020a;Wu and Xiong, 2020). We would like to answer the question which generative language model is the best base model for dialogue summarization tasks.

Sketch Construction
Conversational data, unlike news or scientific publications, includes lots of non-factual sentences such as chit-chats and greetings. Removing these least critical information in the dialogues could potentially help the model better focus on the main content. Based on this hypothesis, we combine a syntax-driven sentence compression method (Xu and Durrett, 2019) with neural content selection.
Another potentially useful attribute for the conversational data is each dialogue turn inherently encodes user intent. However, unlike task-oriented dialogue systems, which have explicit annotated intents (e.g., book flight and check account), dialogue summarization data rarely have such labels. Thus we use a few heuristics with Snorkel (Ratner et al., 2019) to programmatically label each turn with a predefined interrogative pronoun category. The generated intents and the compressed dialogues together constitutes the summary sketch as weakly-supervised signals.
To the best of our knowledge, in general, there is no non-task-oriented established label set. Thus we draw upon the FIVE Ws principle, which often mentioned in journalism and research investigation, in that a passage can only be considered as complete if it answers these questions starting with such interrogative words (Hart). We adapt this principle to the dialogue scenario and identify a set of interrogative pronouns to support diverse enough user intents of all utterances, serving as the dialogue's logic. For example, in Figure 1, Morgan asked Suzanne "Do you feel like going to a con-cert next week?" One can expect that Suzanne will confirm her willingness in the next utterance. We define such dialogue intent categories including why, what, where, confirm, and abstain. More information for each category is shown in the Appendix (A.1).
To compress and remove noisy sub-sentences in the dialog, we first use a trained constituency parser (Kitaev and Klein, 2018) to parse each utterance. Then we compare the parsed phrases with the ground-truth summary to find their longest common sub-sequence (lcs), we set a threshold to filter and remove non-meaningful words (e.g., stop words) in lcs. Note that there are circumstances where the whole utterance is noisy and removable. Overall, we construct a summary sketch by concatenating utterance index, user intent label, and compressed utterance within the entire dialogue history into a string, ending with a special token, "TL;DR". Take Figure 1 as an example, the summary sketch is "1 what 2 abstain 's just one of ... square garden 8 why 9 abstain TL;DR". We train our model first to generate this summary sketch and then generate the final summary in an autoregressive way. We use TL;DR token to distinguish sketch and final summary during inference time.

Controllability
Due to the success of controllable language modeling (Keskar et al., 2019), the ability to control text summarization in the News domain has gradually been attracting attention (Fan et al., 2018;Liu et al., 2018) The high-level intuition for our solution is that if we can control a generative model only to generate one sentence as output for a partiallyhighlighted input, we can control the number of output sentences by choosing how to highlight the input. We highlight each dialogue split using the special token < hl >. For example, in Figure 1, we generate the first summary sentence for the first segment from turn one to four, and the second and third from turn five to seven and turn eight to nine, respectively (separated by the dashed lines). This way, we can not only gain the summary controllability but also make the generation more interpretable.
The next challenge is, during training, we have to find a mapping between each sentence in a reference summary to its corresponding dialogue split. In other words, how do we know where to insert the highlighting tokens? We do so by training a dialogue-turn-level binary classifier (detailed be-low) that predicts whether each turn is a cutting point (i.e., dialogue segmentation). Our observation is that sentences within a reference summary usually have a strong temporal dependency, that is, people summarize the dialogue almost linearly. We use a simple approach to find the cutting points: the highest similarity score between conversations and each summary sentence. The cutting point where SIM could be any similarity functions (we use ROUGE-1), and c m is the accumulated turn index (c 1 = 1 and c m = t m−1 ) that indicates which part of a dialogue has been covered. Note that for a summary with M sentences, we only need to decide M − 1 cutting points. With the pseudo labels (t m ) provided by this heuristic, we formulate the dialogue segmentation problem into a binary classification problem. Specifically, we train a classifier C, which takes dialogue history as input and predicts whether each dialogue turn is a cutting point. We prefix each dialogue turn with a separation token as input to the classifier.
(2) The classifier output H is the representations of those separation tokens, and each of them is a d emb dimension vector. W 1 ∈ R d emb ×1 is a trainable linear mapping. TheP is the predicted segment probability that is trained with binary cross-entropy loss. We use a BERT-base model (Devlin et al., 2018) as classifier and the i-th cutting point is triggered ifP i > 0.5. This prediction means that our model can automatically determine how many sentences should be generated in the final summary. If no cutting point is triggered, we generate a onesentence summary. If one cutting point is triggered, we will have a two-sentence summary, and so forth.
Finally, we can control the number of output summary sentences by controlling the dialogue split. Specifically, we first decide the expected number of output sentences (e.g., K), and then we choose the top K −1 indexes with highest probabilities in segmentation probabilityP . We use these K − 1 indexes as cutting points. We can also generate one-sentence summary by clipping the whole dialogue with one pair of highlighting tokens at the beginning and the end of a dialogue (we call this setting as CODS-1).

Overall Generation
The overall training and inference block diagrams are shown in Figure 2. CODS follows a standard encoder-decoder framework. During training, we use dialogue segmentation to add highlighting tokens for each summary sentence. We take the highlighted dialogue history as input and train our model to generate its corresponding summary sketch and summary sentence. For example in Figure 1, the first summary sentence, we input the whole dialogue with added highlighting tokens both at the beginning of the first turn and at the end of the fourth turn, and generate output that contains the corresponding summary sketch "1 what 2 abstain ... well-deserved break" and the first summary sentence "Suzanne is at work and is having a break now." The entire model is trained using cross-entropy loss for the generated tokens. During inference, we first use the trained binary classifier to predict cutting points. Then, we use the predicted segmentation to add highlighting tokens into a dialogue. Finally, after generating multiple summary sentences separately, we concatenate them to be the final summary.  3 Experiments

Dataset
We perform experiments on the recently released SAMSum dataset (Gliwa et al., 2019) 2 , which is the most comprehensive resource for abstractive dialogue summarization tasks. It contains 16K natural messenger-like dialogues created by linguists fluent in English with manually annotated summaries. This dataset is more challenging than the previous corpus (McCowan et al., 2005) in the following aspects: 1) Unlike previous datasets consisting of only hundreds of dialogue-summary pairs, it has larger data size (16369 samples); 2) 75% of the conversations are between two interlocutors, the rest are between three or more people; 3) the conversations cover diverse real-life topics, and the summaries are annotated with information about the speakers. We preprocess the data by the following steps: 1) concatenate adjacent utterances of the same speaker into one utterance; 2) clean the dia-logue text by removing hashtags, URLs and Emojis; 3) label each utterance with its corresponding interrogative pronoun category with a weak supervision approach (Ratner et al., 2019); 4) parse each utterance with a constituency parser and find the longest common sub-sequence between the phrases and summary to be the key phrases.

Evaluation Metrics and Baselines
We use the standard ROUGE metric (Lin, 2004) as automatic evaluation metrics, including ROUGE-1, ROUGE-2, and ROUGE-L. Following previous work (Gliwa et al., 2019), we use py-ROUGE 3 library with stemming. We compare our model with baselines reported in Gliwa et al., 2019: Longest-3 is a commonly-used extractive summarization baseline which takes the top three longest sentences as summary. The pointer generator and Fast abs are RNN-based methods with copy-attention mechanism or policy gradient. The Transformer is a random-initialized self-attention architecture with multi-head attention. The DynamicConv is a lightweight convolutional model that can perform competitively to self-attention. All of these models are not pre-trained. Besides, we investigate four pre-trained generative language models to see which works the best for the dialogue summarization task. DialoGPT is a GPT model pre-trained on open-domain Reddit data. UniLM is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction on English Wikipedia and BookCorpus. PEGASUS masks important sentences from input and is trained to generate the missing parts, similar to an extractive summary approach. BART is trained by corrupting text with an arbitrary noising function and learning to reconstruct the original text. We use default parameters listed in the respective open-source repositories to fine-tune on the dialogue summarization task. We show the training details in the Appendix.

Results
In Table 1 of ROUGE results, we find that the methods that are pre-trained or with pre-trained embeddings perform better than those that are not. For instance, DynamicConv achieves a 3 -4% improvement by adding GPT-2 embeddings. This further confirms the impact of language model pre-training on downstream tasks. Among the pre-trained generative language models examined, PEGASUS and BART are the two top performance models with ROUGE-1 higher than 50. DialoGPT, the model pre-trained on conversational data, does not achieve satisfactory results, implying that Reddit data has limited knowledge to be transferred to dialogue summarization tasks. CODS achieves the highest ROUGE score compared with other models, notably 50.79% ROUGE-L.
To understand the individual contribution of each component in our model, we also conduct an ablation study by removing summary sketch generation (BART+Ctrl) or controllability (BART+Sketch). In both cases we observe a performance drop, except a slight improvement on ROUGE-1 for BART+Ctrl. This suggests that the sketching step helps generate a more fluent summary even with lower unigram matching. Furthermore, recognizing the limitation of ROUGE scores in their ability to fully capture the resemblance between the generated summary and the reference, in Table 2, we follow (Fabbri et al., 2020) to compare model performances with additional met-   pineni et al., 2002), and CIDEr (Vedantam et al., 2015). As shown in Table 2, CODS consistently outperforms PEGASUS and BART. More information about these evaluation metrics are shown in the Appendix.

Human Evaluation by Crowdsourcing
We leverage human judgement to evaluate the generated summaries via crowdsourcing, especially for granularity-controlled generation, since we do not have human-written reference summaries of various lengths (number of sentences). We ask workers to rate the summaries in two aspects on a scale from -1 (worst) to 1 (best): factual consistency and informativeness. Factual consistency acts as a precision measure, assessing whether the information provided in summary contains factual errors which are against the source dialogue; Informativeness is a recall-oriented measure, examining whether critical information in a dialogue is mentioned in summary. We also show the length ratio between a summary and a dialogue, where a lower ratio means a higher compression rate. For the crowdsourcing evaluation, we randomly select 6% dialogues from the test set, each of which is annotated by three workers. More details about human evaluation process are in the Appendix 4 .
To show the proposed controllable generation's strengthens and quality, we provide two additional baselines, Longest-1 and BART-1. The longest-1 method is an extractive baseline that outputs the longest dialogue turn as the final summary. The BART-1 is a strong abstractive baseline where we train a BART-based summarization model with the Lilly will be late. Gabriel will order pasta with salmon and basil for her.
Lilly will be late for the meeting with Gabriel. Gabriel will order something for Lilly. Ann doesn't know what she should give to her dad as a birthday gift. He's turning 50. Fiona tries to help her and suggests a paintball match.
It's Ann's dad's 50th birthday. He's turning 50. Ann and Fiona are planning a surprise birthday party for her dad. Extract information after the discussion Paul will buy red roses following Cindy's advice. Paul wants to buy red roses.

Decide important information
Rachel's aunt had an accident and she's in hospital now. She's only bruised. The perpetrator of the accident is going to pay for the rehabilitation.
Rachel is at the hospital with her aunt, who had an accident. She's bruised but fine. She will give her a hug. Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.
Amanda can't find Betty's number. Amanda suggests to text him. number of summary sentences in the training set as its start-of-sentence token during decoding. Similar to the approach from Liu et al., 2018, we can use different start-of-sentence tokens to control the BART output.
In general, it is preferable to have a factually consistent and informative summary that is succinct (low length ratio, high compression rate) at the same time. As shown in the first row of Table 3, CODS-1 achieves the highest informative score among all generated one-sentence summaries, indicating the strength of the proposed controllable method in producing succinct yet informative dialogue summaries. The Longest-1 method has a higher consistent score because its summary is di-rectly copied from the original dialogue, preventing any factual mistakes. The second row of Table 3 shows that CODS, when automatically determining the granularity of the summary, produces summaries that are more succinct (lower length ratio), more factually consistent, and more informative, compared to the BART model.

Case Study
CODS outperforms the baseline models in both ROUGE scores and human evaluation metrics. We now further inspect its textual quality. In Table 4, we show an example from the SAMSum test set with summaries generated by different models. In this example, CODS and CODS-1 can both produce a near-perfect summary even compared to the human-written reference summary. On the other hand, the summary generated by BART includes overly detailed information (e.g., bank account). We show some more examples in the Appendix and all the predictions (including CODS-1 and CODS-2) in the supplementary file.
We also manually examine 100 summaries generated from CODS against the reference summaries in the test set. Specifically, we analyze each of the three following problematic cases, where summarization models frequently make mistakes, reported by Gliwa et al., 2019, and provide sample summaries in Table 5. 1) Associating names with actions: CODS performs well in dealing with speakers' names. It accurately associates "her dad" with "Ann's dad," also "Fiona tries to help her" with "Ann and Fiona." 2) Extract information about the arrangement after discussion: Even speakers hesitate about the flower's color to be yellow, pink or red in the middle of the discussion, CODS still correctly determines the right color after several turns. 3) Decide important information in dialogues: CODS fails to capture some of the important facts (marked as red) mentioned in reference summary. We conjecture the reason could be that 1) some of the important facts are located in the same part of the highlighted turns, and 2) those information is missed by the key phrase extraction. Simultaneously, we force the model to generate only the most important one under the constraint of controllability. The improvement of CODS on the first two summarization difficulties can be partially attributed to the clear logic in the sketch when input to the model.

Related Work
Neural Text Summarization There are two main paradigms for text summarization: extractive and abstractive. Inspired by the success of applying seq2seq models on neural machine translation, Rush et al., 2015 andNallapati et al., 2016 introduce the neural seq2seq model on abstractive text summarization, with an attention-based encoder and a neural language model decoder. To solve the problem of out-of-vocabulary words and to capture salient information in source documents, See et al., 2017 propose a pointer-generator network that copy words from source to target. Many subsequent works (Gehrmann et al., 2018;Paulus et al., 2018) further demonstrate its effectiveness with reinforcement learning. Recently, Liu and Lapata, 2019 apply BERT on text summarization and propose a general framework for both extractive and abstractive models. Zhang et al., 2019c pre-train hierarchical document encoder for extractive summarization. Lewis et al., 2019 introduces BART, a denoising autoencoder for pretraining sequence-tosequence models. BART significantly outperforms the best previous work in terms of ROUGE metrics.
Dialogue Summarization Regarding to the datasets in dialogue summarization, initial abstractive dialogue summarization work (Oya et al., 2014;Mehdad et al., 2014;Banerjee et al., 2015) are conducted on the AMI meeting corpus (McCowan et al., 2005), with only 141 summaries. Goo and Chen, 2018 propose to use the topic descriptions (high-level goals of meetings) in AMI as reference summaries and use dialogue acts as training signals. Pan et al., 2018 build the Dial2Desc dataset by reversing a visual dialogue task, aligning image dialogues with the image caption as a summary. Liu et al., 2019 collect their dataset from the logs in the DiDi customer service center. It is restricted to taskoriented scenario, where one speaker is the user and the other is the customer agent, with limited topics and it is also connected to the goal of dialogue state tracking task (Wu et al., 2019a(Wu et al., , 2020b. Recently, Gliwa et al., 2019 introduce the SAMSum corpus, with 16k chat dialogues with manually annotated summaries. It is the first comprehensive abstractive dialogue summarization dataset spanning over various lengths and topics. Chen and Yang, 2020 propose a multi-view sequence-to-sequence model by extracting different views of structures from conversations. Both their method and ours leverage rich conversation structure information. Evaluating on SAMSum, our model CODS outperform theirs by 3 points in terms of ROUGE scores, indicating our utilized dialogue features are more effective. Length-controllable Generation The most prevalent method for length control generation is using a special length embedding. Kikuchi et al., 2016 first propose length control for abstractive summarization by using length embedding as an additional input for the LSTM. Fan et al., 2018 train embeddings that correspond to each different output length and prepend that length marker at the beginning of the decoder. Liu et al., 2018 incorporates the length embedding into initial state of a CNN-based decoder. Takase and Okazaki, 2019 extends the positional encoding in Transformer model by considering the remaining length explicitly at each decoding step. Saito et al., 2020 propose to control the summary length with prototype extractor. However, the retrieve-and-rewrite process is restricted by the extraction quality, leaving its performance limited by extractive solutions' capabilities. The aforementioned works all focus on structured text summarization (e.g. news document). We are the first to propose generate length-controllable summary on dialogues by highlighting arbitrary numbers of dialogue spans.

Conclusion
The dialogue summarization task is challenging but with vast application potential. We propose CODS, a state-of-the-art dialogue summarization model with granularity controllability. CODS uses a weakly-labeled summary sketch for its two-stage generation, and text-span conditional generation for a controllable summary. Our model surpasses existing models on the largest dialogue summarization dataset. We show with human evaluation that our model can generate factually consistent and informative summaries. We also point out several error cases to shed light on future research direction of controllable dialogue summarization.

References
Siddhartha Banerjee, Prasenjit Mitra, and Kazunari Sugiyama. 2015. Abstractive meeting summarization using dependency graph fusion. In Proceedings of the 24th International Conference on World Wide Web, pages 5-6.
Jiaao Chen and Diyi Yang. 2020. Multi-view sequenceto-sequence models with conversational structure for abstractive dialogue summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4106-4118.

A.3 Sketch Construction
Previous methods (Goo and Chen, 2018;Pan et al., 2018) heavily rely on explicit intent annotations in datasets. We label user intent automatically for each utterance with the Snorkel library in a weak supervision approach. For each interrogative pronoun category, we first manually identify its most frequent key words and patterns (can be found in our source code). Then we use the labeling functions in Snorkel to label all the utterances.
For the utterance compression, we do LCS on the phrases generated from the constituency parser. In the example of 1, s just one of many boring days at work the parsed constituent overlapping with 'at work' in the summary, so we keep this phrase. However, in other examples, not all overlapped words are meaningful (e.g. stop words). We thus filter the LCS results and only keep important key phrases. Then we train our model to predict these key phrase spans in each turn. We show three examples of our generated key phrases in summary sketches on evaluation set (see Table 6, 7, 8)

A.4 Training Details
We use huggingface (Wolf et al., 2019a) implementation to fine-tune a BART model. We use the large Phil: is brandon in ? Clara: not yet . Phil: has he called to say he'd be late ? Clara: no , he hasn't . Phil: it's not the first time , ist it ? Clara: no , it isn't . Phil: when he arrives , tell him to come to me . Clara: no , it isn't . Phil: please prepare a report on the absenteeism and lateness . i expect it by friday on my desk . Clara: it will be ready .

Summary
Brandon is late again. Clara will prepare a report on the absenteeism and lateness for Phil by Friday. version fine-tuned on the XSUM (Narayan et al., 2018) dataset with 12 self-attention encoder and decoder layers. We truncate input dialogue to a maximal length 512 with training batch size 4. We train the model with Adam optimizer (Kingma and Ba, 2014) with 0.1 proportion for linear learning rate warmup. We early stop on validation set ROUGE-1 score, and it is trained for around 40,000 steps on one NVIDIA V100 GPU. During inference, we do beam search decoding with beam size 4.

A.5 Evaluation Metrics
Information obtains from (Fabbri et al., 2020): • ROUGE measures the number of overlapping textual units between the generated summary and a set of reference summaries.
• ROUGE-WE extends ROUGE by taking cosine similarity of Word2Vec embeddings into account.
• BERTScore computes similarity scores by aligning generated and reference summaries on a token-level based on the output of the BERT-based model. Token alignments are computed greedily with the objective of maximizing the cosine similarity between contextualized token embeddings. We report the F1 score.
• MoverScore measures semantic distance between a summary and reference text by making use of the Word Mover's Distance operating over n-gram embeddings pooled from BERT representations.
• Sentence Mover's Similarity (SMS) extends Word Mover's Distance to view documents as a bag of sentence embeddings as well as a variation which represents documents as both a bag of sentences and a bag of words.
• BLEU is a corpus-level precision-focused metric which calculates n-gram overlap between a candidate and reference utterance and includes a brevity penalty. It is the primary evaluation metric for machine translation.
• CIDEr computes 1-4-gram co-occurrences between the candidate and reference texts, downweighting common n-grams and calculating cosine similarity between the ngrams of the candidate and reference texts.

A.6 Human Evaluation
We use roughly 6% of the test set data in SAMSum for human evaluation and we do some filtering based on the annotation of the "gold summary". Specifically, we filter those annotations if a "gold summary" has been annotated as "-1" (the meaning of each score is shown below), implying that the annotators may not pay attention to the scoring. The final results reported in Table 3 is the mean from three different annotators. The "gold summary" is actually not perfect and it might contain some noisy annotation, this is the reason why some workers may give 0 even if it is collected from humans. Below is the scoring instruction we sent to our workers: • Factual Consistency (Precision): The rating measures whether the information provided in a summary is correct. Score -1 if a summary contains a serious factual error. Score 0 if a summary has some minor factual errors. Score 1 if everything in a summary is factually correct.
• Informative (Recall): The rating measures whether all the important information in a dialogue is included in a summary. Score -1 if a summary misses serious key points. Score 0 if a summary misses a few key points. Score 1 if a summary covers all key points.