CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.


Introduction
Text summarization is used to generate a concise and accurate summary of a long text while focusing on the sections that convey the most useful information (Gurevych and Strube, 2004). In recent years, the resurgence of dialogue summarization has attracted significant research attentions (Mc-Cowan et al., 2005;Gliwa et al., 2019;Koay et al., 2020;Zhong et al., 2021;Zhu et al., 2021;Chen et al., 2021c;Fabbri et al., 2021;Chen et al., 2021d). The goal of dialogue summarization is to condense the conversational input into brief sentences version but cover salient information (McCowan et al., 2005;Yuan and Yu, 2020). Significant progress has been made recently on abstractive dialogue summarization with various pre-trained models. However, such pre-trained models are susceptible to generating hallucinate content that is not supported by the source documents (Cao et al., 2018;Maynez et al., 2020;Kryscinski et al., 2020). To tackle the issue of factual inconsistency in dialogue summarization, recent works correctly encode the names of speakers (Zhu et al., 2020), explicitly incorporate coreference information , and order the personal named entities (Liu and Chen, 2021). But it is still challenging to improve the quality of summaries generated by different models and decrease the hallucination at the same time.
To better understand the types of hallucinations generated by the pre-trained models, we devised a linguistically motivated taxonomy of factual errors for dialogue summarization, instead of simply classifying the summary as faithful or not. Based on our typology, we defined an annotation protocol for factuality evaluation of dialogue summarization. We then conducted a human evaluation of several pre-trained abstractive summarizers, including BART , Pegasus (Zhang et al., 2020), and T5 (Raffel et al., 2020), aiming at identifying the proportion of different types of factual errors and studying the weaknesses of the pre-trained models. Our typology and annotation helps us gain deeper insights into the causes of factual inconsistency. Unlike news summarization (Pagnoni et al., 2021), we found that the challenges posed by dialogue summarization are more related to dialogue flow modeling, informal interactions between speakers, and complex coreference resolution. Figure 1 shows a dialogue-summary pair with three specific errors.
In order to tackle the top factual errors produced by existing models, we propose to replace the most commonly used fine-tuning with a linguisticallyinformed contrastive fine-tuning approach. For Hannah needs Betty's number but Amanda doesn't have it. Amanda needs to contact Larry.

Reference
(c) Missing Information (b) Modality & Tense Error (a) Coreference Error Amanda can't find Betty's number. Larry called her last time they were at the park. Amanda will text Larry. Figure 1: Sample summary of a SAMSum dialogue (Gliwa et al., 2019). The summary is generated by BART . Errors are highlighted. example, the reason for producing wrong reference errors is that models cannot understand the role in the dialogue, which goes beyond the events. Our goal is to drive the model to pay attention to the grounds of specific errors during the finetuning, and learn how to reduce the generation of such errors. To be more specific, CONFIT learns to distinguish whether there are factual errors in the summaries and capture the key information in the dialogue content, such as numbers and person names. Experiments on SAMSum (Gliwa et al., 2019) and AMI (McCowan et al., 2005) show the generalizability of CONFIT when it is applied to different pre-trained models and datasets. Furthermore, we employ both automatic evaluation and human evaluation on faithfulness and show that CONFIT significantly reduces all different factual errors and generates summaries that are more factually consistent. Moreover, we analytically find that optimizing the contrastive fine-tuning is quite beneficial for improving the robustness of models, which brings further benefits.

Generated Summary
Our contributions are as follows: • We introduce the first typology of factual errors for dialogue summarization and use it to conduct comprehensive annotation and focused analysis.
• Targeting different categories of factual errors in the annotations, we reduce occurrence of such errors generated by various pre-trained models with a novel linguistically-informed contrastive fine-tuning CONFIT approach.
• We validate our method on a widely used dia-logue summarization corpus, SAMSum, and extend it to a meeting summarization corpus AMI. Evaluations of output summaries on automatic metrics like ROUGE, BARTScore as well as human evaluations show that CONFIT outperforms baseline pre-trained models.

New Taxonomy of Factuality Errors for Abstractive Dialogue Summarization
In order to gain deeper insights into the types of factuality errors introduced by different abstractive dialogue summarization systems, we proposed a new taxonomy of factuality errors for abstractive dialogue summarization based on our empirical experiments and annotations of the performance of a set of representative baseline summarization models on the SAMSum dataset, which is a widelyused large-scale dialogue summarization dataset of chat message dialogues in English (see Section 4.1). Specifically, we generate summaries of SAM-Sum dialogues using state-of-the-art abstractive dialogue summarization models, including models fine-tuned based on T5 (Raffel et al., 2020), Pegasus (Zhang et al., 2020), BART , D-HGN (Xiachong et al., 2021), and S-BART (Chen and Yang, 2021b). We then manually annotate all different types of errors in these generated summaries that are inconsistent with the source dialogue, compute detailed statistics of all these factuality errors, and then classify them into different categories. Based on our annotation and analysis, we propose a new taxonomy of errors with the majority focusing on factuality error, which includes the following 8 error types: Category 1 -Missing Information: The content of the generated summary is incomplete compared to the reference. Example: [Reference Summary] Williams invites Ms. Blair for a coffee. They will go to her favourite coffee place near the square in a side alley at 2 p.m.
[Model-Generated Summary] Ms. Blair is going to a coffee place near the square in a side alley.
Category 2 -Redundant Information: There is redundant content in the generated summary compared to the reference. Example: [Reference Summary] Paula helped Charlotte with correct pronunciation of "Natal Lily." [Model-Generated Summary] Charlotte asks Paula how to pronounce the name of the plant "Natal Lily." Paula confirms that the stress on the second syllable is 2nd.
Category 3 -Circumstantial Error: Circumstantial information (e.g., date, time, location) about the predicate doesn't match the reference. Example: [Reference Summary] The USA was founded in 1776.
[Model-Generated Summary] The USA was founded in 1767.
Category 4 -Wrong Reference Error: A pronoun is with an incorrect or nonexistent antecedent, or a personal named entity in the generated summary is in the place of a different personal entity in the reference. Example: [Model-Generated Summary] Tara raised her spoon.
Category 7 -Tense Error: This encompasses factual errors resulting from discrepancies in grammatical tense between the generated summary and the reference. Example: [Reference Summary] The children will go to the library.
[Model-Generated Summary] The children went to the library.
Category 8 -Modality Error: This includes factual errors resulting from modal discrepancies, such getting words like "may", "should", "could" wrong, between the generated summary and the reference. Example: [Reference Summary] School may be cancelled today.
[Model-Generated Summary] School is cancelled today.

Annotation and Analysis
Using our proposed taxonomy of factuality errors, we compute the proportion of each type of factuality errors across different summarization models.
We then investigate the model generation behavior that is indicative of errors, which guides the design of our proposed model. We performed a human evaluation of four model outputs from 19 SAMSum dialogues in order to identify the limitations of abstractive summarization models in dialogue summarization tasks. The four models used in this human evaluation are two BART models with different random seeds BART and S-BART are pre-trained models (PLM), and D-HGN is trained from scratch. Since we are focusing on the dialogue domain, most of the factual errors in the model summaries are related to coreference, anaphora, and other dialogue-specific characteristics. In fact, approximately 45% of all errors fall into the categories of Missing Information and Wrong Reference. The distribution of these errors throughout these pre-existing models informs the limitations of each model. Our proposed CON-FIT model targets the top errors generated by the current state-of-the-art models to reduce factual inconsistency.

CONFIT Model
Standard fine-tuning parameterizes the probability p α of the generator on a task-specific labeled dataset by maximizing cross-entropy loss.
However, the cross-entropy loss has several shortcomings that can lead to factual inconsistency in dialogue summarization due to its sub-optimal generalization and instability. We propose a more efficient fine-tuning method CONFIT for factual consistency driven by the intuition that good generalization requires capturing the similarity in one class and contrasting them in other classes. In CONFIT, we introduce two additional losses: contrastive loss and self-supervised loss. We use two weights, actually which is coefficients, to adjust the ratio of L con and L self in the total loss of CONFIT.
The final training objective J (θ) of the proposed framework is as follows: Our linguistically-informed typology and annotation help us gain deeper insights into the causes of different factual errors. To help our models generate more faithful summaries, the proposed CONFIT learns to concentrate on the essential elements of dialogue and capture the dynamic role information as illustrated in Figure 3.

Classifier Dialogue
Reference Summary Emma is about to take a nap in bus to New York. Ben and Emma will be there around 4:30. Ben will wake Emma up 15 minutes prior to their arrival.
Ben and Emma will be there around 1:23.
Emma is about to take lunch in bus to Paris.

Classifier
Emma will wake Ben up 12 minutes prior to their arrival.

Contrastive Loss
In order to reduce the occurrence of factual errors, we propose a contrastive loss that uses the following negative sample generation techniques to target each error type in our proposed taxonomy: • Swap the nouns in the reference summary with each other randomly. This aims to reduce wrong reference and object errors by providing negative samples.
• Swap the verbs in the reference summary with each other randomly. This aims to the model reduce circumstance (and, to a lesser extent, tense and modality) errors.
• Mask numbers and years in the dialogue and then pass it into the model to generate a negative sample summary. This aims to reduce circumstance errors.
• Randomly delete 30% of the sentences in the dialogue and then pass it into the model to generate a negative sample summary. This aims to reduce missing information errors.
• Mask-and-fill coreferent entities with BART in the dialogue and then pass it into the model to generate a negative sample summary. This aims to reduce wrong reference errors.
Equation 3 demonstrates our contrastive loss function. During the fine-tuning, we have the positive samples, which is the reference summaries and another set of incorrect summaries, which is the negative samples. The contrastive objectives are learning representations that are invariant to different views of positive pairs; while maximizing the distance between negative pairs (Gunel et al., 2020). Our goal is to maximize the likelihoods of the positive samples and minimize the likelihoods of the negative samples as well. We use the following contrastive learning objective where y i and y j are positive summary pairs generated by back translation technology and y k is from negative set of examples and c i ,c j , c k are their BART decoder representations.

Self-supervised Loss
One unique challenge in abstractive dialogue summarization is the use of first-person pronouns (such as "I" or "we") in speaker utterances, which the model has to correctly identify as being a reference to the speaker. This can lead to wrong reference errors in the summary, as the model cannot understand which participant is speaking and thus cannot accurately resolve first-person references. To address this problem, we design a self-supervised loss that aims to determine whether two tokens belong to the same speaker. Based on these findings, we design a self-supervised loss to enable CONFIT to capture the dynamic roles in the dialogue.
After the BART encoder, the input dialogue is encoded into hidden vectors C. Here, we first randomly select k pairs of two tokens t m and t n from the input dialogue, with labels s m and s n denoting which speaker they are coming from. We also do the same for utterances. Given the concatenation of the encoder representation of dialogue, t m and t n , we use the following loss function to classify whether the two tokens or two utterances are from the same speaker.
log P (s m = s n |t m , t n , C) (4) This supplementary loss function helps CONFIT keep track of speaker information, thus improving the faithfulness of its summaries for dialogues that contain several first-person references.

Dataset
We evaluate our new model on the popular SAM-Sum dialogue summarization dataset. Then, we extend our model to meeting summarization with the AMI Meeting Corpus. SAMSum (Gliwa et al., 2019) is a recently proposed large-scale dialogue summarization dataset consisting of 16,369 chat message dialogues in English written by linguists, and each message dialogue is annotated with a multi-sentence summary written by language experts. 75% of the dialogues in the SAMSum dataset (Gliwa et al., 2019) are between two interlocutors, and the other 25% are among three or more interlocutors. The AMI Corpus is another well-known dialogue summarization dataset consisting of 137 multiparty meeting transcripts extracted from 100 hours of meeting recordings. Each meeting transcript in the dataset is also annotated with a generic abstractive summary. We use these two representative dialogue summarization datasets to empirically test our new model's abstractive summarization performance in the settings of both short conversation-style dialogues and long meeting-style dialogues. See Table 2 for detailed statistics of the two datasets.

Experiment Settings
In our experiment using SAMSum, we trained BART for 3 epochs with a learning rate of 1e − 05, Pegasus for 20 epochs with a learning rate of 1e − 04, and T5 for 20 epochs with a learning rate of 1e − 05. In our experiment using AMI, we trained BART for 6,000 steps with a learning rate of 1e − 05, Pegasus for 24,000 steps with a learning rate of 1e − 05, and T5 for 20,000 steps with a learning rate of 1e − 05.

Evaluation Metrics
To evaluate our model, we use three metrics: ROUGE (Lin, 2004): ROUGE measures Ngram overlap between the reference and the automatically generated summaries.
BARTScore ( (Feng et al., 2021b) 50.91* 17.75* 24.59* ---Pre-trained Models T5 (Raffel et al., 2020) 42  , 2005) and SAMSum (Gliwa et al., 2019) datasets. We adopt some results reported from the literature (Feng et al., 2021a) and implement the pre-trained models for a fair comparison. All results marked with an asterisk (*) are from Feng et al. (2021b).  2021) have been proposed to evaluate faithfulness more precisely. BARTScore is a transformer-based measure that scores a dialogue and the corresponding automatically generated summary and has been shown to be strongly correlated with human evaluations of faithfulness (Yuan et al., 2021).

Dialogue Speakers Turns Length
Human Evaluation: Finally, we conduct human evaluations on 100 SAMSum (Gliwa et al., 2019) and 20 AMI (McCowan et al., 2005) dialogues. Tang et al. (2021) found that Likert scales are a more consistent measure of factuality for abstractive dialogue summarization than Best-Worst Scaling. We have human evaluators directly rate the summaries on a scale from 1 to 10 corresponding to their faithfulness. In addition, using the error taxonomy proposed in Section 2, we have them mark whether each error type appeared in the given summary. We do this in a blinded fashion, so that the annotators do not see the corresponding model of the summary. Additionally, in order to prevent model information from leaking to the annotators, we randomly shuffle outputs within each dialogue before assigning them to annotators. Table 1 shows the ROUGE scores of our models, the baseline models they were fine-tuned from, and a number of other abstractive summarization models on the SAMSum and AMI datasets. Tables 5  and 6 show the average human faithfulness and BART scores respectively for each model's outputs on 100 SAMSum and 20 AMI dialogues.

Results
We observe that for all three pretrained models CONFIT significantly beat baselines on ROUGE-1, ROUGE-L, and human faithfulness score for both datasets. For BARTScore, we note that, while performance increases on SAMSum for all models, it decreases on AMI. However, given the fact that human evaluators rated the outputs of all three CONFIT models as more faithful than those of their corresponding baselines on both datasets, the decreases in BARTScore on AMI can likely be attributed to the imperfection of automated metrics at capturing faithfulness in text.

Error Analysis
Tables 3 and 4 show the percentage of summaries that were labeled with each error type in our taxonomy of factual errors (discussed in Section 2.) for both the baseline and CONFIT models on the SAMSum and AMI datasets respectively.
We observe that on SAMSum, our fine-tuning method greatly reduces missing information, redundant information, wrong reference, and circumstance errors for all models. The largest reduction is on the "wrong reference" error type (20%, 7%, and 33% for BART, Pegasus, and T5 respectively),     likely owing to the self-supervised loss function introduced in Section 3.2 that was designed to help the model more effectively capture speaker information. For AMI, however, our fine-tuning method is not as consistent at reducing the frequency of each error type across models. It is possible that this is due to sample size (20 AMI dialogues vs. 100 SAMSum dialogues). Figure 4 shows the results of human annotation on the model outputs of a selected SAMSum dialogue. Note that all of the autogenerated summaries, both baseline and CONFIT, were marked as having missing information errors by the annotator, likely due to the omission of Ernest's relief upon hearing that the car that was crashed into did not belong to Mike. As a result, none of the models achieved a perfect factuality score on this dialogue; however, the scores for each CONFIT model were higher than those of their corresponding baselines. It can be observed that while baseline BART outputs a summary with a circumstance error, mistakenly asserting that Mike parked his car on Ernest's street, the BART+CONFIT fixes this error, correctly asserting that Mike took his car to the garage today; as a result, the human annotator gave this summary a higher score than the predicted summary from baseline BART. Baseline T5 outputs a summary with two coreference errors; specifically, it contains a missing subject in the first sentence and incorrectly implies that the car that got crashed into belonged to Mike in the second sentence. The T5+CONFIT is able to fix both of these errors,  adding "Mike" to the beginning of the first sentence and changing "his red Honda" to "a red Honda just like Mike's" in the second sentence. Similarly, the output of baseline Pegasus contains a coreference error in the first sentence, implying that Mike owns the car that was crashed into while the output of Pegasus+CONFIT does not.

Related Work
Multi-party dialogues are especially challenging to summarize using automated models, given that they often contain pauses, false starts, reconfirmations, hesitations, and speaker interruptions (Sacks et al., 1978;Feng et al., 2021a;Chen and Yang, 2021a). Previous work in the field has addressed these challenges by incorporating semantic features, including keywords (Zhu et al., 2020), domain terminologies (Koay et al., 2020), topics Liu et al., 2021a), entailment knowledge , and background knowledge (Feng et al., 2021c). Other works exploit personal named entities (Liu and Chen, 2021) and coreference information  to learning to distinguish complex coreferent relationships expressed through personal pronouns (including the first person "I") in the conversation (Lei et al., 2021). Researchers have also explored conversational structure (Zhao et al., 2021), utterance flow modelling , syntactic structure (Lee et al., 2021), granularity control (Wu et al., 2021), but they have not yet converged to a simple and practical solution.
Our proposed taxonomy of factual errors and annotations help us gain deeper insights into the causes of factual inconsistency in abstractive dialogue summarization outputs.

Conclusion
We presented CONFIT, a novel method to improve the faithfulness of abstractive dialogue summarization models via contrastive and self-supervised fine-tuning. By adapting the objective function during fine-tuning to incorporate a contrastive loss that learns to distinguish positives from examples with factual errors, and a self-supervised dialogue-specific loss that captures important dialogue information flow between multiple interlocutors, CONFIT can significantly improve the faithfulness of the abstractive summaries generated by transformer-based sequence-to-sequence language models, and reduce multiple categories of factuality errors in the abstractive summaries by large margins. In our experiment on SAMSum and AMI, we demonstrated that CONFIT achieves better empirical performance compared to the baseline models fine-tuned with the traditional crossentropy loss, based on both automatic evaluation metrics and human evaluation. Our work provides new insights into improving the faithfulness of abstractive summarization systems using carefully designed novel objective functions for fine-tuning that captures important structures and features of the text to summarize.

Human Evaluation
We recruited seven volunteer participants for our error annotation, requesting speakers of English. The internal annotators are Xiangru Tang, Arjun Nair, Borui Wang, Jai Desai, Aaron Wade, Anushka Nijhawan, and Dragomir Radev. These annotators are participating voluntarily. Our participants are free to opt out of the study at any point in time. We have written four scripts for use in the annotation process: (1) the first script generates an annotation spreadsheet and a key spreadsheet from the model outputs. The annotation spreadsheet does not contain the model names; however, it contains an id that can be used to recover the model name from the key spreadsheet. For ease of annotation, summaries from the same dialogue are grouped together; however, they are randomly shuffled within each dialogue so that the annotators cannot guess from the ordering as to which model is which. (2) The second script splits an annotation spreadsheet into multiple spreadsheets so that the work can be distributed amongst annotators. (3) The third one merges these spreadsheets back together after the annotation process is finished. (4) The last script recovers the model names from the key spreadsheet and inserts them into the annotation spreadsheet. Each evaluator is asked to examine whether there is an error and the full context (dialogue, generated summaries, and reference) and give a score on a scale of 1 to 10 for each of the criteria. We only consider faithfulness, instead of general quality. E.g. 1: very poor, 3: poor, 5: neutral; 7: good; 10: very good. We asked each internal annotator to evaluate 300 samples.
Other Ethical Issues (1) We did not use any personally identifiable information in the experiments.
(2) The goal of the project, improving the faithfulness of automatically generated summaries, is to make the output of the summarization system more reliable and minimize confusion for the readers of the summaries. (3) We used existing summarization datasets that do not contain any sensitive information and are unlikely to cause any harm to the annotators.