Analyzing and Evaluating Faithfulness in Dialogue Summarization

Dialogue summarization is abstractive in nature, making it suffer from factual errors. The factual correctness of summaries has the highest priority before practical applications. Many efforts have been made to improve faithfulness in text summarization. However, there is a lack of systematic study on dialogue summarization systems. In this work, we first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues. Furthermore, we present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations. Experimental results show that our evaluation schema is a strong proxy for the factual correctness of summarization models. The human-annotated faithfulness samples and the evaluation toolkit are released to facilitate future research toward faithful dialogue summarization.


Dialogue:
Freddie: Nanna, are you coming to visit us soon?Winnie: Oh darling, Nanna has broken her leg, you'll have to visit me instead.Freddie: I forgott.Well come soon.Winnie: Good, ask Mummy and Daddy and they will come when they can.Freddie: Yes love you.Leg better soon?Winnie: Yes, quite soon.Tell mummy to ring me.Bye darling xxxxx Summaries: Human: Winnie has broken her leg and will not visit any time soon.Freddie will ask mummy to call Winnie up.✓ BART: Nanna has broken her leg, so Freddie will have to visit her instead.Nanna will get better soon.✓ MV-BART: Nanna has broken her leg and Freddie will have to visit Winnie instead.Mummy and Daddy will come to visit them soon.✗ Coref-BART: Freddie wants to visit Winnie, but Nanna has broken her leg, so he will have to visit her instead.Mummy and Daddy will come when they can.✗ CondigSum-BART: Winnie's Nanna has broken her leg and Freddie will have to visit her instead.✗ Table 1: A real example from SAMSum dataset.Span of factual errors are marked with underline.
As an essential way of exchanging information, conversations usually involve multiple participants, informal language usage, repetition, and negations (Sacks et al., 1978;Chen and Yang, 2020).Therefore, dialogue summarization is vulnerable to factual issues due to its abstractive nature.Table 1 gives an example of factually incorrect dialogue summaries.The problem of factual correctness is broadly studied for text summarization in news and article domains (Nallapati et al., 2016;Narayan et al., 2018).The progress is primarily because of the availability of factually annotated data at both summary and token levels (Kryscinski et al., 2020;Wang et al., 2020;Pagnoni et al., 2021;Cao et al., 2022).Many studies are proposed to evaluate and reduce factual errors in the generated summaries.However, due to the interactive nature of dialogues, we cannot simply transfer these methods to dialogue summarization.
In this work, we first categorize the most frequently occurred factual errors for dialogue sum-marization into 6 types.Then, we collect finegrained factual annotations for human reference and the output of 4 recent dialogue summarization systems ( §3).At least two annotators are involved, and a verification process is incorporated to ensure the annotation quality.As a result, our study on human-annotated data suggests that over 35% of the generated dialogue summaries contain at least one factual error.Similar observations have been made in the news summarization domain where 30%-80% of generated text are factually inconsistent (Cao et al., 2018;Pagnoni et al., 2021).More research attention should be made toward faithful dialogue summarization.
The unavailability of faithful evaluation methods hinders the development of effective dialogue summarization models.In this work, we present a model-level evaluation schema, FacEval, targeting dialogue summarisation models' faithfulness ( §4).First, we synthesize a set of positive and negative summaries for each dialogue with back-translation or rule-based transformations.Then, a summarization model is asked to distinguish positive and negative summaries based on conditional generation probabilities.More correct judgements indicate the model is more factually competent.
To compare the model-level performance of evaluation methods, we leverage two ad-hoc training schema to synthesize a series of models with different capability ranks.Then, the evaluation methods are used to predict the ranking of trained models.Seven non-factual and factual evaluation methods have been examined, followed by a detailed discussion of their properties.The effectiveness of FacEval is also proven by showing a strong correlation with the factual correctness of summarization models.

Summarization Methods
Text summarization is one of the most important tasks in natural language generation (NLG).With the development of pre-trained language models, a lot progress has been made to abstractive text summarization (See et al., 2017;Zhang et al., 2020;Liu et al., 2022), especially in news domain (Hermann et al., 2015;Narayan et al., 2018).With the availability of datasets (Carletta et al., 2005;Gliwa et al., 2019;Zhu et al., 2021b), dialogue summarization research has attracted a lot of attention.For dialogue summarization, fine-tuning pre-trained gen-eration models including T5 (Raffel et al., 2020), PEGASUS (Zhang et al., 2020) and BART (Lewis et al., 2020) are served as a strong baseline, where BART achieves the SOTA performance on ROUGE scores.Some recent works consider the dialogue properties for more advanced summarization models.Chen and Yang (2020) and Liu et al. (2021a) incorporate the conversational structures into the semantic encoding process of dialogue.Conversations involve lots of co-references.Therefore, Liu et al. (2021b) proposes injecting co-reference information into the transformer layers by adapting attention maps or through graph convolutional networks (GCN).We include the outputs of recent dialogue summarization models in our analysis.

Faithfulness Analysis
Previous works spot that the factual consistency problem is one key aspect of improving text summarization (Kryscinski et al., 2020;Cao et al., 2020).The analysis of factual errors in summaries is mainly performed in the news domain.Kryscinski et al. (2019) and Falke et al. (2019) conducted the initial crowdsourcing of binary factual annotations and found that nearly 30% of the generated summaries are factually inconsistent.Recent extensions focus on more fine-grained analysis (Cao and Wang, 2021;Pagnoni et al., 2021) and also discovering factual evidences at entity level (Cao et al., 2022) or span level (Huang et al., 2020;Maynez et al., 2020;Goyal and Durrett, 2021).
Recently, CONFIT presented the first study on the faithfulness of dialogue summaries (Tang et al., 2022b).Similar to our work, they also define a taxonomy of factual errors and conduct fine-grained annotations.However, they focus on comparing reference summaries and generated summaries without referring to the whole dialogue.It is suboptimal because the reference summary cannot fully represent the entire dialogue and also can be incorrect according to our analysis in Section 3. Besides, the missing and redundant information is categorized as factual errors, which we consider less proper.More recent advanced dialogue summarization models are also not included in their analysis.

Faithfulness Evaluation
The default evaluation metric for summarization, ROUGE, is based on n-gram overlaps between a generated summary and the corresponding references, rendering it less sensitive for capturing fac-  tual errors.Therefore, several new metrics are proposed to evaluate the faithfulness in the news domain (Kryscinski et al., 2019;Fabbri et al., 2021;Tang et al., 2022a).There are two major groups, one is based on natural language inference, and the other is based on question-answering.Kryscinski et al. (2020) and Goyal and Durrett (2020) propose to leverage entailment relationship.Scialom et al. (2021) and Wang et al. (2020) involves question generation, answer generation and answer-overlap as the factual consistency measure.Zhao et al. (2021) proposes to evaluate the faithfulness of taskoriented dialogue summarization by calculating the amount of overlapped dialogue states, which requires additional human annotations.

Fine-grained Faithfulness Analysis
Previous studies of factuality analysis in summarization mainly focus on the news domain.The typology of factual errors for dialogues can be very different.Therefore, we first define a taxonomy of frequently occurred factual errors for dialogue summaries.A fine-grained analysis is then performed by measuring the factual consistency within dialogue summary pairs.

Taxonomy of Factual Errors
We collect the generated summaries using four SOTA dialogue summarization models on the popular dialogue summarization dataset, SAMSum (Gliwa et al., 2019).The selected models are BART (Lewis et al., 2020), MV-BART (Chen and Yang, 2020), Coref-BART (Liu et al., 2021b) and CondigSum-BART (Liu et al., 2021a).We define five most frequently occurred error types in dialogue summaries as below.An example for each error type is shown in

Annotation Process
We random sample 150 dialogues from the test set of SAMSum.Five summaries are listed for each dialogue, including the human-written one and four model-generated summaries.Falke et al. (2019) founds that it needs at least 12 annotations to reach an inter-annotator agreement of coefficient k = 0.75, which can lead to high annotation costs and unreliable results with fewer annotators (Kryscinski et al., 2020).There-  fore, we perform a two-step verification process to ensure the annotation quality.First, each sample is annotated by two distinct annotators.If there is a disagreement about whether a summary contains factual errors, a third annotator is involved in making the final decision while considering inputs from the previous two annotators.As a result, we have collected 750 fine-grained faithfulness annotations from 30 participants.

Results and Analysis
The detailed annotation results are shown in Figure 1.There are several exciting findings: 1) the human annotations contain non-negligible factual errors at around 17%; 2) 36% to 50% of generated summaries from dialogue summarization models contain at least one factual error; 3) three advanced dialogue summarization models perform worse than their baseline on factual consistency.
First, the popular SAMSum dataset (Gliwa et al., 2019) associates each dialogue with one humanwritten reference summary.However, we found that 17% of reference summaries have factual errors.Therefore, we encourage people to be aware of the issue, especially for evaluation.It is because the dialogue annotation process for SAMSum only involved one annotator per sample, and no further verification process was executed.We notice that the source of factual errors for human summaries is also different from machine-generated ones.Some factual errors in human-written summaries are caused by typos, which rarely occur in machine-generated summaries.
For dialogue summarization models, we found that 35%-50% of generated summaries contain factual errors.The most frequent error types are SubObjE and ParE.Because dialogue often involves scattered information exchange with multiple speakers in multiple turns, it is very challenging to accurately locate who and whom in who-didwhat-to-whom.That is the leading cause of Sub-ObjE.ParE is the second most frequent error type, indicating that the generated summaries express the same topic but do not accurately capture the details.OthE occurs less frequently.It shows that our taxonomy of factual errors can cover the most frequent error types for dialogue summarization.
Surprisingly, we found that MV-BART, Coref-BART and CondigSum-BART perform even worse than the baseline model, with an increase of around 10% overall factual error rate.They are accepted as more advanced summarization models and perform better on ROUGE scores.It indicates that enhancing topical information is not necessarily contributing much to factuality (Chen and Yang, 2020;Liu et al., 2021a).Coref-BART aims to improve BART with co-reference information (Liu et al., 2021b).However, our result shows it does not bring obvious benefits.In conclusion, we encourage the future development of summarization models to pay more attention to the factuality perspective, and a more diverse evaluation schema  beyond ROUGE scores should be incorporated.

Model-level Faithfulness Evaluation
Some efforts have been made toward sample-level factual error evaluation.An example is shown in Figure 3.The sample-level evaluation methods are model-agnostic and examine a model solely based on its output sequences.Most existing evaluation methods, including ROUGE score, human evaluation and recent factual evaluation methods, belong to this type.One ultimate goal for factuality evaluation is to discriminate better summarization models.We propose directly probing models' generation probability with a constrained search space.First, FacEval generates a set of positive and negative samples with variant factual errors by rule-based transformations.Then, the generation probabilities of positive and negative summaries are compared for each dialogue.A better summarization model should be more likely to generate positive summaries than negative ones.

Dialogue-summary Pair Generation
We design transformations to synthesize negative samples with factual errors.Given the source and target text, one or more modifications are performed to the target text while referring to the information of the source text.It is because the frequently occurred errors are conceptual confusions from the source.Our designed transformations are listed as follows: • Speaker Swap (SS): We first spot the name of speakers from the source text by colon symbol and then swap the names at the target text.
• Entity / Pronoun / Date / Number Swap (ES / PS / DS / NS): An NER system is first applied to both source and target text.The entities from the target text are randomly swapped with entities from the source text if they share the same entity type.
• Negation (NG): Negation is performed using a set of hand-crafted rules.Auxiliary verbs are first scanned.Then, positive verbs are negated by adding not or n't.Similarly, negative sentences are inverted by negation removal.
First, we paraphrase the summary to create more positive samples through back-translation (BT).The Google Cloud API is leveraged for this task1 .Then, we generate new summaries with factual errors by corrupting positive summaries, which means the summaries are treated as the target text, and the dialogue is the source text.
Ideally, the negative summaries should be prone to errors generated in real-world scenarios.Therefore, our designed transformations try to mimic that.In the context of the analysis presented in Section 3, we have the following list of correspondences: 1) SS-SubObjE; 2) PS-ProE; 3) NG-NegE; 4) ES/DS/NS-ParE.

Comparison of Generation Probabilities
An illustration of probability comparison is shown in Figure 4. Given a dialogue D, a summary S = [y 1 , ..., y L ] and a summarization model f s (•), we can compute a generation score (GS) for D-S pair from the generation probability: where the generation probability for each token is as follows: We leverage the above generation score from decision process of beam search algorithm (Graves, 2012), where the sequence length is taken into consideration.In default, we set the length penalty parameter α as 1.0.For dialogue D i , there is positive summary set S = [S 1 , ..., S M ] and negative summary set Ŝ = [ Ŝ1 , ..., ŜN ].We evaluate the number of times the positive samples have higher scores than the negative samples concerning the same dialogue.The factuality score (F S) of model f s (•) is then computed as follows: where |D| is the number of dialogues.

Evaluation Preparation
A series of models need to be prepared with different faithfulness capabilities to evaluate the effectiveness of model-level evaluation methods.One option is to collect as many well-trained models as possible and refer to human annotations to rank models based on factuality.However, it is hard to reach a high agreement and may not be trustworthy with limited annotators, as indicated by Falke et al. (2019) and Kryscinski et al. (2020).Therefore, instead, we construct a series of models using the following two ad-hoc methods: Limited data training (LDT).One joint agreement is that more training data lead to better model performance.Therefore, we train 20 models using different proportions of the training data from 5% to 100%.
# Diag # Spk # Turn Sum.Len.The header refers to the number of dialogues, the average number of speakers, the average number of dialogue turns, and the average summary lengths.

Mixed data training (MDT).
In this setting, we randomly replace the human-labelled training samples with noisy ones.The noisy samples are created by corrupting only the dialogue using transformations introduced in Section 4.1.Here, the source and target are both the dialogue.The trained model is more likely to be confused and generate more factual errors with noisy data.Here, we obtain 21 models with different replacement ratios from 100% to 0%.
LDT will cause a model to be less competent for generation in all aspects.In comparison, MDT will lead the model to generate summaries with more factuality errors while less affecting other properties like fluency.Therefore, we expect a better factuality evaluator to correlate more with MDT models.All correlations are computed on model-level instead of sample-level judgements.

Experimental Settings
SAMSum dataset (Gliwa et al., 2019) is used for all experiments.It consists of 16,369 dialoguesummary pairs written by expert linguistics.One human-written reference summary is provided for each dialogue.The detailed dataset statistics are listed in Table 3.The samples from the test set are used for all evaluation methods.
For backbone models, we exam with BART Large , BART Base (Lewis et al., 2020), T5 Base and T5 Small (Raffel et al., 2020), which are SOTA summarization models.Each model is trained with both LDT and MDT methods.As a result, we obtained 164 trained models, divided into eight groups.The models in each group are associated with increasing levels of capabilities.The Spearman's rank correlation coefficient (ρ) between these models and evaluation scores is reported.For sample-based evaluation methods, the scores on all test set samples are averaged as the model-level performance.We ensure  all models are appropriately trained and avoid training collapses by examining their ROUGE scores.The best hyper-parameters are used and kept the same for models from the same group.

Results and Analysis
Table 4 shows the fine-grained results of FacEval.First, we found that FacEval has a higher correlation with MDT models than LDT models.The LDT models are less competent in all aspects as fewer data are involved with training.The generated summaries are weaker in multiple elements, including factuality, fluency, coherence, and granularity.In contrast, the MDT models mainly deteriorate in factuality with factually corrupted training data.Therefore, it is desired that FacEval shows a higher correlation to MDT models.Second, when considering each negative sample type, a relatively higher correlation is shown with negation (NG), pronoun swap (PS) and speaker swap (SS).It is because more comparison pairs are created with these methods.Also, for chit-chat dialogues, almost all summaries contain reasoning concerning speakers and personnel in the dialogue.And the confirmation of action is happening in multiple utterances.As a result, these several error types are more commonly witnessed in dialogue summarization, as illustrated in Figure 1.In contrast, the negative pairs generated by entity swap (ES), date swap (DS) and number swap (NS) show a lower correlation.It is because these samples are more related to particular errors which appear in various formats and are more challenging to simulate.Even though solely considering these samples shows a lower correlation, we still include them in the overall comparison process to have a more comprehensive evaluation.

Comparison with Other Metrics
We include a list of popular evaluation methods for summarization to compare our evaluation schema with existing ones.It contains three generic evaluation methods and four dedicated faithfulness evaluation methods.

Baseline Metrics
Three generic evaluation methods are as follows: ROUGE (Lin, 2004) score is the default evaluation metric for summarization.We experiment with the F-measure of ROUGE-1, ROUGE-2 and ROUGE-L2 , which are derived from the uni-gram overlap, the bi-grams overlap and the longest common subsequence (LCS) between generated and reference summaries, respectively.
BLEU (Papineni et al., 2002) score is the primary evaluation metric for machine translation.It is mainly designed for corpus-level similarity computation derived from n-gram overlaps.In the following experiments, we report the most commonly used BLEU-4 score.
BERTScore (Zhang* et al., 2020) leverages the pre-trained contextual embeddings from BERT and computes the similarity between text sequences by matching words in candidate and reference by cosine similarity.Four faithful evaluation methods are as follows: FactCC v1 (Kryscinski et al., 2020)  from the document sentences and fine-tune a pretrained language model BERT to classify whether the summary is consistent or inconsistent with the documents.It is initially trained in the news summarization domain.
FactCC v2 is an adapted FactCC v1 to the dialogue domain by us.The negative summaries are generated using our transformations discussed in Section 4.1.We train a T5 Small model as the classifier and take dialogue and summary as input to predict their consistency.
FEQA (Durmus et al., 2020) is a question generation and answering method for faithfulness evaluation.It first extracts question-answer pairs from summaries with pre-trained models.Then, a QA model pulls answers from the document with the same questions.A matching method is deployed to measure the similarity between both answer responses from the summary and document as the factuality score.Note that the model is designed for documents.
NLI (Falke et al., 2019) is an entailment-based method which takes the maximal entailment probability between summary and document sentence as the factual consistency score.As no dialogue-based entailment model is available, we compute the entailment probability between reference and generated summaries with a BERT-based entailment model trained on SNLI and MultiNLI datasets.

Results and Analysis
The experimental results are shown in Table 5. Non-factual Evaluator: The non-factual evalua-tion methods measure the similarity between reference and generated summaries.ROUGE and BLEU are derived from n-gram overlaps, which indicate the overall generation quality.It is expected that evaluators have a reasonable correlation with LDT models as training with fewer data will resulting quality degradation of the summary in all aspects.For MDT models, they also show a good correlation.We observe that R-2 and R-L are better indicators than R-1 for factuality evaluation.It is because simply replacing isolated tokens can easily change the factual correctness of a summary without much influence on the R-1 score.
Factual Evaluator: As FactCC v1 is trained for the news summarization, we found that the released model is incapable of making predictions for dialogues.Similarly, FEQA is not a good indicator of model performance because the question and answer generation models are optimized for documents, which limits its transferability to the dialogue domain.In comparison, FactCC v2 and NLI are better evaluation methods for factuality and can make good predictions on MDT models.
FacEval Properties: FacEval is the only modellevel evaluation schema.The examined model requires reasonable predictions on single sentences and differentiation between positive and negative pairs.Therefore, FacEval shows a strong correlation with LDT and MDT models.The exceptional performance on MDT models indicates that FacEval can effectively reflect model's capability on factuality.It is beneficial to provide benchmarking performance and analysis on popular dialogue summarization models.As discussed in Sec. 3, dedicated dialogue summarization models do not outperform their baseline models in terms of faithfulness.Therefore, we evaluate on T5 and BART models instead.
The benchmarking results are shown in Tab 6.There are several interesting findings.First, BART Large has the largest model size as well as the overall best performance.We can also conclude that larger pre-trained models are more faithful based on our evaluation.Second, BART model is generally better than T5 in factuality with model size taken into consideration.This may be because that BART is designed for the generation with various denoising objectives, while T5 is a sequenceto-sequence model for different tasks including but not limited to generation.Third, from fine-grained analysis, we can see that speaker information (from SS) is a major challenge for dialogue summarization.This is because dialogue involves multiple speakers and their roles are tightly involved in the ideal summarization.Therefore, how to improve the model's understanding capability on speaker roles is an interesting direction to explore (Liu et al., 2021b).Meantime, because some faithful errors are coming from lack of commonsense for existing models (Wang et al., 2021).How to effectively combine hidden semantics (Wang and Kuo, 2020) and well-structured knowledge (Ge et al., 2022) are also worth exploration.

Conclusion
We believe our faithfulness analysis and evaluation method can facilitate the development of dialogue summarization systems.Instead of measuring faithfulness on generated summaries, we directly assess the model's capability by multi-choice questions.We expect FacEval to be effectively extended to other generation scenarios.

Limitations
The testing samples used in our method are obtained by rule-based transformations of the reference and back-translated summaries.It is still limited to the types of transformations designed.More transformation methods need to be proposed to have a comprehensive evaluation.To obtain more natural summaries, we can gather generated summaries and perform annotation by humans.The model can be evaluated in more aspects and closer to real-world scenarios with more available samples.
Verifying the effectiveness of the model-level evaluation schema requires various models and their corresponding rankings.However, such model rankings are currently unavailable because 1) there are not enough varieties of dialogue summarization models as it is still a developing field; 2) the annotations on the faithfulness of dialogue summaries are not adequate.Therefore, in this work, we refer to heuristic methods to manually create a series of models with desired capability levels.When new evaluators are proposed, the best practice is to leverage model-level human rankings for performance benchmarking.

Figure 1 :
Figure 1: The proportion of summaries with different types of factual errors.Note that one summary can contain multiple error types.

Figure 2 :
Figure 2: The proportion of summaries with at least one factual error.
Model-level evaluation schema.

Figure 3 :
Figure 3: An illustration of two types of evaluation paradigms.

Figure 4 :
Figure 4: An illustration of comparing the generation probability of positive and negative samples.Solid and dashed lines refer to probability comparison and sample construction, respectively.
Ref. Summary: Fiona doesn't know what she should give to her dad as a birthday gift.He likes military.Jonathan suggests a paintball match.SubObjE: Jonathan doesn't know what she should give to her dad as a birthday gift.He likes military.Jonathan suggests a paintball match.ProE: Fiona doesn't know what he should give to her dad as a birthday gift.He likes military.Jonathan suggests a paintball match.NegE: Fiona doesn't know what she should give to her dad as a birthday gift.He hates military.Jonathan suggests a paintball match.ParE: Fiona doesn't know what she should give to her dad as a Christmas gift.He likes military.Jonathan suggests a paintball match.HalE: Fiona doesn't know what she should give to her dad as a birthday gift.He likes military.Jonathan invites Fiona to watch a military movie.

Table 2 :
An illustration of the taxonomy on factual error types.

Table 2 .
Subject It is used to classify factual errors that do not belong to any of the above types.Note that the above-mentioned error types are not exclusive to each other.That is, one summary may contain multiple error types.
Object Error (SubObjE): The subject(s) or object(s) involved for an event is (partially) wrong.It includes substitution, addition and deletion of any related subject(s) or object(s).Pronoun Error (ProE): Pronoun references are frequently occurred in dialogue summarization.This error includes wrong references and ambiguous ones that cannot be fully understandable relying on the summary.Negation Error (NegE): Dialogues can contain confirmation utterances.This error means that the generated summary makes wrong conclusions when contradictory or unconfirmed events are presented in the dialogue.Particulars Error (ParE): The summary presents related events, but some details are inaccurate or faulty.It can include incorrect information like date, time and location.Hallucination Error (HalE): Generation models have the imaginary ability and can be triggered by certain prompt words in the dialogue.The hallucination error refers to the cases where the summary contains events not presented in the dialogue.Other Error (OthE):

Table 3 :
The detailed statistics of the SAMSum dataset.

Table 4 :
Detailed correlation analysis between model series and negative sample types.For each column, one negative type is involved.'all' indicates the usage of all negative types.

Table 5 :
first augment summaries by applying rule-based transformations Comparison of a series of automatic evaluation metrics.The result shown is Spearman's rank correlation between model ranks and predicted scores.