Towards Understanding Omission in Dialogue Summarization

Dialogue summarization aims to condense the lengthy dialogue into a concise summary, and has recently achieved significant progress. However, the result of existing methods is still far from satisfactory. Previous works indicated that omission is a major factor in affecting the quality of summarization, but few of them have further explored the omission problem, such as how omission affects summarization results and how to detect omission, which is critical for reducing omission and improving summarization quality. Moreover, analyzing and detecting omission relies on summarization datasets with omission labels (i.e., which dialogue utterances are omitted in the summarization), which are not available in the current literature. In this paper, we propose the OLDS dataset, which provides high-quality omission labels for dialogue summarization. By analyzing this dataset, we find that a large improvement in summarization quality can be achieved by providing ground-truth omission labels for the summarization model to recover omission information, which demonstrates the importance of omission detection for omission mitigation in dialogue summarization. Therefore, we formulate an omission detection task and demonstrate our proposed dataset can support the training and evaluation of this task well. We also call for research action on omission detection based on our proposed datasets. Our dataset and codes are publicly available.


Introduction
With the exponential increase in the volume of conversational messages from daily life, there is a growing demand for dialogue summarization (Murray  To reduce omission rate and improve summarization quality, a comprehensive analysis on omission problem (e.g., how omission affects summary results) and a precise omission detection (i.e., to locate which dialogue utterances are omitted in the summarization) is important. However, there are no omission related datasets in dialogue summarization literature to support such analysis and detection. Hence, in this work, we construct the OLDS dataset, which provides high-quality Omission Labels for Dialogue Summarization. Our dataset is built upon five existing benchmarks covering different domains. For each dialogue, we use different abstractive models to generate diverse candidates and propose a reference-based strategy to automatically label omissions for these candidates. The human evaluation indicates that our OLDS dataset presents a high quality of omission labels.
Based on the curated OLDS dataset, we comprehensively investigate the omission problem in dialogue summarization from multiple aspects. First, we analyze the proportion of candidates with omission errors and the position distribution of omitted information in dialogues. The results reveal that omission is a severe problem that frequently occurs in dialogue summarization. Second, we measure the correlation between the omission rate and multiple reference-based metrics (e.g., ROUGE and BERTScore), discovering that omission is one of the decisive factors influencing the summary evaluation results. Third, we explore the potential performance improvement brought by utilizing the omission information in a post-editing manner. The analyses probe that candidate summaries could be effectively improved as long as the model is provided with the omitted dialogue utterances. Hence, how to accurately locate omission information in dialogue naturally becomes a critical question.
To pave the way to omission mitigation and summary improvement, we formulate the task of omission detection, which aims to identify the omitted utterance given the whole dialogue utterances and the generated summary with potential omission. In addition, we present three different frameworks as baselines for the omission detection task, including pair-wise classification, sequence labeling, and pointer network extraction. Experimental analyses on the OLDS dataset reveal that omission detection, as a promising direction to assessment and improvement for dialogue summarization, poses significant values and challenges.
The contributions of our paper are as follows: • We propose OLDS, a dataset with high-quality omission labels for dialogue summarization, to facilitate the research on the omission problem.
• Based on OLDS, we systematically analyze the omission problem and demonstrate the significance of omission in dialogue summarization.
• We introduce the omission detection task that paves the way to omission mitigation and summary improvement. We design 3 frameworks as baselines and conduct comprehensive analyses to provide possible directions for solving this task.

The OLDS Dataset
In this section, we first define what is omission. Then, we introduce OLDS, a dataset that contains Omission Labels for Dialogue Summarization that facilitates the analysis of omission problem and the exploration of how to identify omission content. Finally, we conduct human assessment that demonstrates the high quality of OLDS. Adam and Karen are worried that May suffers from depression. Karen will call her friend who is a psychologist and ask for advice.

Candidate summary:
May is depressed. Karen suggested she should see a specialist, but she doesn't want to. Karen will call her friend for advice.

Omission utterances (Labels):
(03) (15) Table 1: An example of the OLDS dataset. The dialogue is from SAMSum and the candidate summary is generated from BART large . The salient words are underlined, and the omission information is highlighted in red.

The Definition of Omission
In summarization tasks, omission 2 is one of the most common factual errors in abstractive summaries. It usually refers to the missing content in the candidates, which is presented in the gold reference. The definition of omission content is flexible, which could refer to either the omitted keywords, text spans, or utterances. In dialogues, an utterance has a clearer boundary compared to text spans and can be viewed as a basic unit for identification and evaluation. Therefore, in this paper, we mainly focus on utterance-level omission and provide utterance-level labels. Table 1 shows an example of our OLDS dataset, which contains the original dialogue, reference summary, candidate summary, and omission labels. In this example, the candidate summary omits three key messages: the person "Adam", the attitude "worried" and the persona "psychologist", and thus the corresponding utterance-level omission labels are the 3rd and 15th utterances in the original dialogue.  Finally, based on the collected candidate summaries, we need to identify which salient information is omitted in these candidates. Therefore, we elaborately design a strategy to label omission automatically and the details are described in the next subsection. As a result, our OLDS is able to obtain multiple candidates and their corresponding omission label for each dialogue. More details about the dataset creation can refer to Appendix A.

The Automatic Labeling Strategy
It is generally a non-trivial task to identify the missing critical content in candidate summary. Fortunately, the existing datasets provide reference summaries as ground truths. We could locate the omitted information in dialogue by directly comparing candidates with references. Thus, we design a pipeline strategy for automatic omission labeling, which is composed of three steps: oracle extraction, omission identification, and redundancy removal. Appendix A.1 shows an example of the complete process of automatic omission labeling.
Oracle Extraction The first step is to match summaries to the corresponding utterances in the dialogue. Following Nallapati et al. (2017), we use a greedy algorithm to select utterances from the dialogue that maximizes the Rouge score (Lin, 2004) with respect to the summary. We return this subset of utterances as oracle labels, representing their membership in the summary. We define the extracted oracle labels for reference summaries and candidate summaries as Gold Oracle and Candidate Oracle, denoted as G and C respectively.
Omission Identification The goal of this step is to find out the omission set O. An intuitive solution is to calculate the complement of candidate oracle in gold oracle as G − C = {u|u ∈ G, u / ∈ C}. Nevertheless, it is an imperfect solution because the utterances in C might still contain omitted words or phrases. For instance, in Table 1, the 15th utterance with a phrase "I have a friend who's a psychologist" matches the key information "friend" in both reference and candidate, and this utterance  would be included in both G and C. However, the keyword "psychologist" is actually omitted in the candidate, so the 15th utterance should be labeled as an omission. In other words, some utterances in the intersection of G and C may also be omissions.
To further discover the potential omission utterances from G ∩ C = {u|u ∈ G, u ∈ C}, we empirically adopt a word-level comparison approach. Specifically, for each utterance u in G ∩ C, we further extract the overlapping words W u G / W u C 5 between u and reference/candidate summary. If W u G ⊆ W u C , we deem this corresponding utterance includes some key messages that are omitted in the candidate, and thus it should be labeled as an omission. During this process, we could obtain the omission words of utterance u, which is denoted as Redundancy Removal After the omission identification, we can obtain the omission set O. However, some utterances in O can be redundant since they could share the identical missing content. For example, for utterance u 1 and u 2 , their omission words W u 1 and W u 2 can be equal so that we can argue these two utterances share similar omission information. To reduce this redundancy, we only keep the utterance with the front position if multiple utterances have the same omission words.

Quality Assessment
To assess the quality of the extracted omission labels for the OLDS dataset, we also conducted human evaluation to validate the correctness of the labeled utterances. We recruited three annotators with NLP backgrounds and each annotator is required to answer the question whether the set of labeled omission utterances is Accept or Reject.
The set should be marked as Reject as long as it 5 We process words in a case-insensitive setting. We keep the original form of words but perform word stemming for comparison. Besides, stop words are removed. misses any critical utterance (recall of labeled omissions), or includes any redundant or uninformative utterance (precision of labeled omissions). Otherwise, it should be marked as Accept. To this end, we randomly sampled 200 dialogue-candidate pairs from each domain for assessment. Table 3 reports the results of the human evaluation for quality assessment. The acceptance rate of human evaluation ranges between 91.2%-98.5%, which validates the effectiveness of our omission extraction strategy. Furthermore, in order to evaluate the reliability of this assessment, we measure the agreement between different annotators by reporting Fleiss' Kappa values (Fleiss et al., 1971) among the possible combinations of two annotators, as reported in Table 3. We find that the overall Kappa score is 0.653, which shows the substantial agreement between annotators. Overall, the results of human evaluation demonstrate that our omission extraction strategy is able to produce high-quality omission labels automatically. More details about human evaluation can refer to Appendix A.4.

Dataset Format and Statistics
An example of our OLDS dataset is shown in Table 1, which contains the basic information, such as dialogue, reference, candidate, and omission labels. In the released version of OLDS, we further provide some auxiliary information. The detailed dataset format and a complete example can be seen in Appendix A.5. Table 2 shows the statistics of the OLDS dataset. We can see that the dialogues are from different domains, with different lengths and turns. Besides, the lengths of summaries also differ from each other, and the employed abstractive models are able to produce candidates with different qualities. We expect that our dataset could pave the way for analyzing the omission problem across different domains and diverse candidate summaries.

Understanding the Omission Problem
In this section, we explore the omission problem in different aspects and analyze why we should pay attention to omission in dialogue summarization.

Distribution of Omission Information
To explain the importance of the omission problem, we answer the following two questions.  Figure 1: The percentage of candidate summaries with omission errors. We report the results of six adopted models on the test set of each dialogue domain. the percentage of candidates which include omission information (i.e., the omission set O = ∅). Generally, a lower percentage means the model's ability to identify the salient information in dialogue is more powerful. Figure 1 shows the statistical results of each model on different dialogue domains. We find that using pre-trained models always produces a lower ratio than the vanilla Transformer. Nevertheless, even using pre-trained models, we find it still reaches a high omission ratio of at least 70%. The omission phenomenon is worse in QMSum and TweetSumm, that almost 90% of their candidates have omission errors. From this perspective, we can conclude that omission is a general and grievous problem in dialogue summarization, and how to alleviate the omission problem is still intractable.
Q2: How is the omission information distributed in the dialogue? To answer this question, we investigate the position distribution of omissions in dialogues. Just as shown in Figure 2, we observe that the omitted utterances are randomly distributed in each position of the dialogue, regardless of its length and domain. This position distribution also indicates that dialogues are unstructured, and how to identify the dispersed key information precisely is still difficult for current models.

Correlation with Reference-Based Metrics
Since omission is defined by the difference between references and candidates, we thus investigate the correlation between the amount of omission content and a variety of reference-based metrics, to verify whether the omission rate of a candidate summary could affect these metrics. Here, we calculate the omission rate as follows: where W u and W u G denote the set of omitted words and the set of gold oracle words shared across u and the reference, respectively. It directly measures the amount of key information omitted by a summary, and a lower rate indicates the candidate is of higher quality.  Raw means the results of raw candidates. +Dial. and +Omit. mean using raw dialogue or omissions as the supplement information for refinement. omission content. By contrast, BLEU shows the least correlation because it is a precision-oriented metric. Empirical analyses indicate that the omission rate is strongly correlated with a wide range of evaluation metrics, and so how to mitigate the omission problem is one of the most important priorities to improve the quality of dialogue summaries.

Omission-based Summary Refinement
The above analyses demonstrate the importance of omission information. So we raise another question: what happens if we utilize the omissions to refine the summary quality? Hence, we adopt a postediting method to investigate the potential of using omissions. Specifically, we formulate summary refinement as a seq2seq task to predict the gold summary. Instead of inputting raw dialogue, we use the concatenation of candidate summary, omission utterances, and non-omission utterances as the input: "Candidate <sep> Omission <sep> Non-Omission". By dividing dialogue utterances into the omission and non-omission groups, the model is able to distinguish omission information while perceiving the whole dialogue simultaneously. If the omission group is empty, it is identical to using candidate and raw dialogue for refinement, and we consider it as the baseline for comparison. We use BART large and T5 small as the backbone model, and the results are shown in Figure 3. The results show that performances are significantly enhanced by the refinement using omissions compared to that using raw dialogues. Notably, on some datasets like SAMSum and DialogSum, T5 small with supplementary omission information even outperforms the raw BART large , which indicates that omissionbased refinement is a promising direction for quality improvement in dialogue summarization. In addition, Figure 3 also shows an upper bound of performance boost by post-editing because we directly employ the gold omission utterances. However, in real situations, we may identify some incorrect omissions. To further explore the impact of wrong omissions on the post-editing results, we investigate three different perturbations by gradually injecting errors into the omission group: 1) we keep the precision as 1 and decrease the recall by moving utterances from the omission group to the non-omission group; 2) we keep the recall as 1 and decrease the precision by moving utterances from the non-omission group to the omission group; 3) we gradually exchange utterances in the two groups until they are swapped, and both the precision and recall decrease from 1 to 0. Figure 4 depicts the trend of performance degradation as the error rate increases. From the curves, we can find that the precision is relatively more important because the refinement model performs more robustly in the first type of perturbation and is sensitive to the addition of wrong omissions.

The Omission Detection Task
Since candidate summaries could be effectively improved given the gold omission information, how to accurately detect omission utterances in dialogue naturally becomes a critical question. In this section, we formulate the omission detection task in a reference-agnostic setting. Formally, given a dialogue D = {u 1 , u 2 , .., u N } along with a candidate summary c, a detection model is required to extract a set of omission utterances O from D without knowing the reference summary. In this section,  we introduce three typical frameworks as baselines and conduct evaluations to see how this task could benefit from them.

Model Settings
To build a foundation for the omission detection task and explore what model architecture the task could benefit from, we investigate three frameworks as baselines, which have different input formats and structures. Their implementation and training details can be found in Appendix B.1.
Pair-wise Classification A straightforward way is to model this task as an utterance-level classification problem. The input pattern for this paradigm is: <s> c </s> u i </s>, where <s> and </s> denote the classification token and separation token, respectively. c is the candidate summary and u i is the i-th utterance in the dialogue. The model would perform binary classification for the candidateutterance pair as y ∈ {0, 1}, where y = 1 represents that the utterance is identified as an omission.
Sequence Labeling Inspired by BERTSum (Liu and Lapata, 2019) that formulates extractive summarization as a sequence labeling problem at the sentence level, we employ a similar strategy which assigns each utterance a label y i ∈ {0, 1} indicating whether the utterance is an omission. We append the candidate summary in front of the dialogue, as <s> c </s> <s> u 1 </s> <s> u 2 </s> ... <s> u N </s>. The last hidden layer of each <s> token will be used as utterance representations for classification.
Pointer Network Pointer network is to select the omission utterance recurrently using glimpse operation (

Evaluation Metrics
We use the standard Precision (P), Recall (R), and F1-score (F1) metrics on the utterance level to evaluate omission detection models. Furthermore, we calculate the percentage of gold omission words that are hit in the detected utterances to measure the word-level omission recall: # means the counted number. The closer the wordlevel omission recall is to 1, the more the omission information is collected by the detection model. Table 5 presents the experimental results on OLDS. All detection models are separately trained on the five domains. For each omission detection framework, we employ BERT base and RoBERTa base as the backbone model to extract text features. Among these three frameworks, pair-wise classification performs the worst in most cases since it does not consider contextual information of dialogue. Meanwhile, sequence labeling is on par with the pointer network, which indicates that dialogue context is a crucial factor for models to detect the omitted content. However, although omission detection models only need to make a choice of whether the given utterance is an omission, the task is still very challenging. In Table 5, the best F1 score is around 50% in all five domains, while the recalled omission words in extracted utterances (WR) are around 60%. Besides, models in QMSum only achieve at most a F1-score of 41.35 and we guess it is due to the effect of longer dialogue in QMSum (over 1K tokens in Table 2). Intuitively, summarizers produce the candidates that have picked the   low-hanging fruit, and the remaining omission information is a tough nut to crack. In other words, there exists some salient information omitted by the summarizer that is still difficult for detection models to capture.

Analysis and Discussion
To understand what factors may affect the performance of the detection model, we conduct the following explanatory experiments.
Label Imbalance We first calculate the percentage of omission utterances against non-omission utterances in five domains to investigate whether the label imbalance problem exists in the datasets. Figure 5 shows that the proportion of positive labels is always smaller than 25%, which indicates that label imbalance is a common problem in omission datasets. Besides, we observe that the degree of label imbalance is consistent with the performance of detection models, according to the results in Table 5. For example, the models achieve nearly 50% F1-score in EmailSum and TweetSumm, which have a ratio of 25% and 23% omission utterances. However, in QMSum, the detection models only achieve a 40% F1-score as the omission proportion of this dataset is only 8%. Hence, how to alleviate label imbalance is critical for omission detection and we leave it as future work.
Candidate Quality Furthermore, we evaluate the performance of detection models on the can-didates produced by different abstractive summarizers to investigate whether the candidate quality may influence detection models. The results are shown in Table 6, and we find the result of omission detection is negatively correlated with the performance of summarizers. For instance, BART L and Pegasus L produce candidates with higher quality, yet the detection model has difficulty obtaining their omissions. On the contrary, Transformer produces relatively low-quality candidates, while the detection model could produce better results (i.e., 55.94% F1-score). It indicates that capturing the remaining omissions for high-quality candidates is difficult, and how to address this issue is also valuable.

Cross-Domain Results
In addition, we conduct the cross-domain evaluation to investigate domain gaps and the generalizability of detection models. From Table 7, we can conclude that there are obvious differences between these five domains. For example, the models trained on the other domains perform poorly when tested directly on QMSum. Among these five domains, the difference between SAMSum and DialogSum is relatively small due to their similar performances across domains. We also find that the model trained on the large dataset SAMSum has a better capability of generalizing to other domains, even achieving the best result on the small datasets DialogSum and EmailSum.

Future Research Opportunities
From the results in Table 5, we could observe that omission detection is a challenging task. Hence, we summarize some research directions as follows: • One direction is to develop a more advanced model for omission detection. Based on the analysis of Section 3.3, we could focus on improving the precision of omission detection results because a high precision of detected omissions would bring benefit to the refinement model. An  ideal detection model could serve as a modelbased metric for reference-free summary evaluation. Besides, we could use the detected omission to improve the results of summarization.
• Another research direction is to develop a refinement model for summary improvement using the detected omissions. In this paper, we briefly touch on this by introducing a post-editing approach in Section 3.3. The approach is straightforward, and the whole summarization procedure becomes a summarize-then-refine pipeline. However, the results show that the model is sensitive to wrong omissions. Hence, how to design a robust refinement model is also noteworthy.

Omission in Text Generation Tasks
Omission is a common error in machine translation (MT) (Russell, 1999; Sharma, 2015; Yang et al., 2019) and automatic speech recognition (ASR) tasks (Weng et al., 2020), which usually denotes the missing source information in the generated sequences. Although both summarization and MT/ASR belong to generation tasks, the definitions of omission error are different among these tasks.
In MT/ASR tasks, the tokens between source and target sequences are usually well aligned, which means each token in the target sequence can locate its corresponding content in the source sequence. Due to such characteristics, previous works (Tu et al., 2016) in MT/ASR tasks usually adopted coverage mechanisms to eliminate the influence of omission error. Nevertheless, the source sequences in summarization tasks usually include abundant redundant and useless information, especially in dialogue scenarios, which makes omission a more serious problem in summarization-like tasks.

Conclusion
In this work, we systematically study the omission problem in dialogue summarization based on the curated OLDS dataset, which collects candidate summaries from multiple models and domains and provides high-quality omission labels for them. We discover that omission is a significant problem that directly affects the results of dialogue summarization, and the defective candidate summary could be largely improved by leveraging the omission information properly. We further introduce an omission detection task to identify omission content, which is a challenging and valuable task that paves the way to omission mitigation and summary improvement in dialogue summarization.

Limitations
The omission problem is critical in dialogue summarization, but even if this problem is solved, we still cannot guarantee a candidate is appropriate because it might bring hallucination content that is not presented by the source dialogue. Previous works (Tang et al., 2022; Maynez et al., 2020) also concluded that factual inconsistency is a critical problem in dialogue summarization, and it is not easy to distinguish. How to mitigate the omission problem while avoiding the occurrence of new errors is not discussed in this paper, and we hope to address this issue in future work. A Details of the OLDS dataset A.1 Example of Automatic Omission Labeling Figure 6 shows an example of the complete process of automatic omission labeling, which consists of three steps: oracle extraction, omission identification, and redundancy removal. For oracle extraction, we select utterances greedily from the dialogue to maximize the Rouge score with respect to the summary. We obtain this subset of utterances as oracle labels, representing their membership in the summary. In this example, we generate oracle labels for the reference as Gold Oracles, i.e., an utterance set of {0, 2, 5, 6, 9, 12, 13, 14, 16, 19}, and oracle labels for the candidate as Candidate Oracles, i.e., {0, 7, 12, 14, 16, 19}. In the process of omission identification, we traverse the utterances in Gold Oracles and extract W u G , which is a set of words containing the overlapping words between u and the reference. For instance, in the 14th utterance, "soon, Hector, Ashley" are the keywords appearing in the reference. Similarly, we extract W u C that contains the overlapping words between u and the candidate summary, where u ∈ Gold Oracles. Then, by comparing W u G and W u C , we could obtain the omission words W u = {w|w ∈ W u G , w / ∈ W u C }. For any utterance u where W u = ∅, we label it as an omission utterance. In the example of Figure 6, the 14th utterance contains the keywords "soon, Ashley" which are omitted by the candidate, and it should be labeled as an omission.

References
Finally, we conduct redundancy removal to discard redundant omission utterances. In Figure 6, the 2nd, 5th, and 19th utterances have redundant omission words W u , which are the same as those in other omission utterances. Hence, we remove these utterances and the final omission labels are {0, 9, 14, 16}.

A.2 Dialogue Domains
We build the OLDS dataset upon five existing dialogue summarization datasets that cover different domains, which are described as follows: SAMSum It is the first high-quality online chat summarization corpus (Gliwa et al., 2019), which contains about 16k simulated conversations created by linguists with corresponding summaries.
DialogSum It is a summarization dataset (Chen et al., 2021) with 13.5k real-life scenario dialogues, which are face-to-face spoken dialogues that cover a wide range of daily-life topics.
EmailSum It is an email thread summarization dataset (Zhang et al., 2021) that consists of 2,549 email threads along with annotated summaries. The dataset has two types of summaries, short summary (<30 words) and long summary (<100 words). Here, we use the short version as references because they are more abstractive and challenging.
QMSum It is a query-based multi-domain meeting summarization benchmark (Zhong et al., 2021) that contains 1,808 query-summary pairs over 232 meetings. We concatenate queries with their corresponding text spans as the input dialogues 6 .
TweetSumm It is a dataset focused on customer service conversations (Feigenblat et al., 2021), which contains 1,100 dialogues, each accompanied by 3 extractive and 3 abstractive summaries. We use the longest abstractive summary as the gold reference.

A.3 Candidate Generation
We use 6 abstractive models to generate candidates for the dialogues in OLDS, including BART large/base , T5 base/small , vanilla Transformer, and Pegasus large . Pegasus large is only used to generate candidates for dialogues in evaluation sets.
To obtain the candidate summaries in training sets, we train the summarization models by adopting a 10-fold cross-validation approach, and each model generates 10 candidates for each dialogue in the validation fold via different configurations of beam search and sampling. As a result, we can obtain 50 candidates (5 models × 10 inferences) for each dialogue in the training set. To ensure the diversity of the generated candidates, we further calculate the average Levenshtein distance (Levenshtein, 1965) for each candidate and pick out 10 candidates with the largest scores. Specifically, we combine these candidates in pairs (a total of 50 × 50 = 2,500 pairs) and calculate the Levenshtein distance between them. Then, for each candidate, we average the distance results against the other 49 candidates to obtain the average Levenshtein distance. Finally, we rank these candidates based on the scores in descending order and pick out the top 10 candidates. As a result, we have 10 diverse candidates for each dialogue in the training sets.
For the evaluation set of OLDS, we train the aforementioned 6 models on the training set of each domain to produce candidate summaries. Each summarization model produces 2 candidates, which are decoded by beam search (beam size = 5) and sampling, respectively. Hence, we totally have 12 candidates for each dialogue in evaluation sets.
The training and inference process was conducted based on the official code of pre-trained language models 7 . All experiments were conducted on one node with 4 32GB V100 GPUs. The learning rate is set to 5e-5 for pre-trained models and is set to 1e-4 for Transformer. The pre-trained models are fine-tuned with 3 epochs, while the vanilla Transformer is trained with 20 epochs. For SAMSum, the maximum source length and target length is 512 and 90, and for DialogSum, Email-Sum, QMSum, and TweetSumm, this setting is 512/150, 1,024/65, 2,048/200, and 1,024/120, respectively. The other hyper-parameters are set by default.

A.4 Details of Quality Assessment
Time Budget We recruited three annotators to conduct the quality assessment for OLDS. The total hits of judgment are 3000 (5 domains × 200 samples × 3 annotators). The annotating speed is 25 samples per hour and the workload is 120 hours (1000 / 25 * 3 = 120) in total.
Instructions Each annotator was presented with a sample containing the dialogue, reference summary, candidate summary, gold oracles, candidate oracles, and the labeled omission utterances along with their corresponding omitted words. We instruct the annotators to make a binary choice whether the set of labeled omission utterances is Accept or Reject. Annotators should compare the candidate with the reference and find out omissions. Then, they should locate omissions in the original dialogue and record the corresponding utterances. Finally, they should compare the automatically la-beled utterances with the recorded ones and make a judgment. The set of labeled omission utterances should be marked as Reject as long as it misses any critical utterance, or includes any redundant or uninformative utterance. Otherwise, it should be marked as Accept. To ensure that each choice is justified, we additionally asked annotators to perform corrections and renew the corresponding omitted words if the choice is Reject. Thus, we could verify why the labeled omission is marked as Reject.

A.5 Data Format
To facilitate the community to explore the effect of possible elements on the omission problem, in the released version of OLDS, we additionally provide some auxiliary information. Specifically, apart from the basic information of dialogue, reference summary, candidate summary, and omission labels, we further provide the intermediate information during labeling, including Gold Oracles, Candidate Oracles, omission words, and the source model and decoding strategy for each candidate summary, e.g., 'bart_base, beam', which represents that the candidate is generated by BART base using beam search. A complete example is shown in Table 9.
A.6 More Results of Candidate Summaries

B.1 Implementation Details
We use BERT base and RoBERTa base as the backbone pre-trained encoder for the three frameworks. All the experiments were conducted on one node with a single A100 80GB GPU. For all three frameworks, the learning rate is set to 5e-5 and the training epoch is set to 5. The batch size was set to 128 for pair-wise classification and was set to 16 for sequence labeling and pointer network. We saved checkpoints after each epoch. The best performing checkpoint on the validation set was evaluated on the test set to report the final results.
Pair-wise Classification For the framework of pair-wise classification, we use the official code of classification with pre-trained language models 8 . The input format is <s> c </s> u i </s>, where <s> and </s> are classification token and separation token, respectively. c and u i represent the candidate and the i-th utterance in the dialogue.
Sequence Labeling We use the same implementation as the extractive summarization model proposed by Liu and Lapata (2019). The only difference is that we append the candidate summary in front of the dialogue, denoted as <s> c </s> <s> u 1 </s> <s> u 2 </s> ... <s> u N </s>. The <s> token before the candidate summary is not involved in the calculation. For SAMSum, we truncate each input into a maximum length of 512, while for Di-alogSum, EmailSum, QMSum, and TweetSumm, this setting is 512, 1,024, 2,048, and 1,024.
Pointer Network The autoregressive decoder of our pointer network is implemented by a Transformer decoder, which is proposed by Zou et al. (2021b) and was previously used for extractive summarization. Here, we also append the candidate summary in front of the dialogue, which has the same input format as in sequence labeling. The <s> token before the candidate summary is not involved in the calculation. We also set the same maximum length as in sequence labeling for input sequences in different domains.
Step 1: Oracle Extraction Gold Oracles: 0,2,5,6,9,12,13,14,16,19 Candidate Oracles: 0,7,12,14,16,19 Step 2: Omission Identification Step 3: Redundancy Removal  Figure 6: An example of the complete process of automatic omission labeling, which is sampled from the training set of SAMSum. W u G is a word set that contains all overlapping words between u and the reference summary. Similarly, W u C contains overlapping words between u and the candidate summary. W u is the set of omission words.