Give the Truth: Incorporate Semantic Slot into Abstractive Dialogue Summarization

Abstractive dialogue summarization suffers from a lots of factual errors, which are due to scattered salient elements in the multi-speaker information interaction process. In this work, we design a heterogeneous semantic slot graph with a slot-level mask cross-attention to enhance the slot features for more correct summarization. We also propose a slot-driven beam search algorithm in the decoding process to give priority to generating salient elements in a limited length by “filling-in-the-blanks”. Besides, an adversarial contrastive learning assisting the training process is introduced to alleviate the exposure bias. Experimental performance on different types of factual errors shows the effectiveness of our methods and human evaluation further verifies the results.ive dialogue summarization suffers from a lots of factual errors, which are due to scattered salient elements in the multi-speaker information interaction process. In this work, we design a heterogeneous semantic slot graph with a slot-level mask cross-attention to enhance the slot features for more correct summarization. We also propose a slot-driven beam search algorithm in the decoding process to give priority to generating salient elements in a limited length by “filling-in-the-blanks”. Besides, an adversarial contrastive learning assisting the training process is introduced to alleviate the exposure bias. Experimental performance on different types of factual errors shows the effectiveness of our methods and human evaluation further verifies the results.


Introduction
Current state-of-the-art conditional text generation models accomplish a high level of fluency and informativeness, mostly thanks to advances in seq2seq architectures with the attention and copy mechanisms (See et al., 2017) and the pre-trained transformer-based models for natural language understanding (Lewis et al., 2019;. Despite this progress (Kryscinski et al., 2019), there are still many limitations facing neural text summarization, the most serious of which is their tendency to generate summaries with a substantial number of factual errors. Besides, the ROUGE scores (Lin, 2004), the most commonly used evaluation metrics, are inadequate to quantify factual correctness and only capture the information coverage at token-level, i.e., n-gram overlap, which does not always convey the desired semantics and reach consensus with human judgement.
Recently, as people increasingly exchange information in the way of dialogue, giving a highquality summarization for the dialogue is particularly necessary, which can help people quickly grasp the core information of the long dialogue history without reviewing the complex context and is significant to improve the efficiency of social contact. However, as a special kind of text form, the dialogue is usually informal and dynamic. Utterances are often said by different speakers alternately in different language styles, which leads to the description of one event being fragmented and scattered in multiple utterances. These inherent differences between dialogues and documents make it easier to product various factual errors, i.e., factual inconsistent and incomplete, in the generated summaries, as shown in Fig.1. Therefore, it is urgent to develop a neural model, focusing on exploring the factual correctness, to generate overall high-quality summaries for the multi-people dialogue scene.
There have been some recent researches on abstractive dialogue summarization, such as deploying document summarization methods to the conversation settings (Gliwa et al., 2019), utilizing the dialogue acts (Goo and Chen, 2018) and key point sequences (Liu et al., 2019a), topic word information (Zhao et al., 2020), and analyzing the conversational structures Yang, 2020, 2021). Other researches have also pushed the frontier of guaranteeing the factual consistency in abstractive document summarization systems via proposing related evaluation metrics (Kryscinski et al., 2020;Maynez et al., 2020) and designing models Cao et al., 2020). However, the current methods (1) fail to utilize the unique semantic and structural information of dialogues to identify the salient elements and guide the decoding process, so that to deal with factual errors for dialogue summarization, (2) lack overall evaluation metrics for factual correctness. We argue that the slot-aware structure is important to improve the performance of factual correctness for dialogue summarization. Besides, except for factual consistency metric, the factual completeness metric is also an indispensable key to evaluate factual correctness.
In this paper, we propose a Semantic Slot guided Adversarial sequence-to-sequence network (SSAnet). The SSAnet contains a heterogeneous semantic slot (HSS) graph, where different types of nodes represent different slot labels and the edges are the dependencies between slot values. Attentions of three different granularities, i.e. tokens, utterances, and slots, are unified into one architecture to promote the learning of the relationships between all granularities. Crucially, the slot-level attention mechanism can make the model directly select the appropriate slot features from the HSS graph to fill the corresponding slot in the summary sequence, which ensures the correctness and completeness of the salient information in the generated content. In the decoding process, we propose a slotdriven beam search algorithm based on Song et al. (2021) to give priority to generating salient elements in a limited length by "filling-in-the-blanks". Besides, to alleviate the exposure bias, we also use a contrastive learning strategy with adversarial perturbations (Lee et al., 2020) by actively exposing some wrong tokens during training. Finally, we propose a new evaluation metric to quantify factual completeness at the slot-level.
Our contributions can be summarized as follows: (1) To the best of our knowledge, we are the first to design a novel slot-level attention operation by copying features from an HSS graph to the corresponding slots. (2) We propose a slot-driven beam search algorithm to give priority to generating salient elements in a controlled way, which ensures the fluency and factuality of summaries.
(3) A contrastive learning with adversarial perturbations is introduced to alleviate the exposure bias for dialogue summarization. (4) We perform experiments on two large-scale datasets to verify the effectiveness of our proposed methods and propose a new metric to evaluate the factual completeness.
2 Related Work

Abstractive Dialogue Summarization
Recently, abstractive dialogue summarization has attracted more attention. Some early researches adopted the dialogue act (Goo and Chen, 2018), key point sequence (Liu et al., 2019a), and topic segmentation (Liu et al., 2019b;Li et al., 2019) for dialogue summarization. However, the used datasets are either very small or non-public. Later, Gliwa et al. (2019) proposed a large-scale dataset about daily chats, named SAMSum. On this basis, some studies attempted to leverage the topic word information (Zhao et al., 2020) and the conversational structures Yang, 2020, 2021) to improve the performance. Besides, (Zhu et al., 2021b) recently propose another large-scale media interview dataset (namely MediaSum) and evaluate several benchmark summarization models. However, current methods only focus on modeling the dialogue context via different ways to raise the ROUGE scores, but ignore whether the generated summaries are correct. To this end, we utilize the semantic slot information to guide the model to focus on generating the salient elements. While ensuring the high-level ROUGE scores, it also improves the factual consistency and completeness.

Fact-aware Summarization
When it comes to factual errors, some work focuses on designing evaluation metrics towards factual consistency, as many human evaluations have shown that ROUGE scores correlate poorly with faithfulness (Maynez et al., 2020). They range from using fact triples (Goodrich et al., 2019;Zhang et al., 2020), textual entailment predictions (Falke et al., 2019), adversarially pre-trained classifiers (Kryscinski et al., 2020), to question answering (QA) systems Durmus et al., 2020). Another line of the related work focuses on enforcing factuality in summarization models. Cao et al. (2017); Zhu et al. (2021a) proposed RNN-based and Transformer-based decoders that attend to both source texts and extracted knowledge triples, respectively. Li et al. (2018) proposed an entailment-reward augmented maximumlikelihood training objective. ; Cao et al. (2020) designed post-editing correctors to boost factual consistency in generated summaries. Our model is inherently different from these models, as we try to boost the factuality via incorporating the semantic slot information while generating the summary, instead of correcting after generating, which can significantly improve the performance of multiple factual correctness metrics without a huge drop on ROUGE scores.

Heterogeneous Semantic Slot Graph
The semantic slot information is a specific concept in the dialogue system and the slots can be understood as the defined attributes of the event, that is, the backbone of the dialogue content (Yuan and Yu, 2019). Although the current neural models are supposed to, or might implicitly recognize some salient contents in dialogue, they are often difficult to describe events consistently and completely. Therefore, we extract the slot values from the dialogue context and construct a heterogeneous semantic slot graph.
Specifically, we first define the slot labels via Stanford CoreNLP and get the slot values by finetuning Chen et al. (2019) (See Sec 4.1). Then, we use a dependency parser tool (Manning et al., 2014) to dig out the dependencies between slot values, which are formed as (slot 1 , dependency, slot 2 ). By integrating the triples, we obtain the graph G=(V, E), where slot values v i are nodes in V , and nodes belonging to the same slot label are regarded as the same type of nodes. E is an adjacent matrix, where e ij =1 indicates that there exists some dependency between slot values.

Dual-Encoder
To model the dialogue context, we utilize a dualencoder, i.e., a sequence encoder and a graph encoder, which obtain the hidden representations at token-level, utterance-level, and slot-level.

Sequence Encoder
A pre-trained encoder, i.e., BART (Lewis et al., 2019), is adopted as the feature encoder to extract token representations and utterance representations due to its effectiveness in representation learning. Given a dialogue D with m utterances, (1) Here we add a special token w i0 =[CLS] (i ∈ {1, ..., m}) at the beginning of each utterance and regard x i0 as the utterance-level representation.

Graph Encoder
Initializers For node initialization, we employ the token-level output embeddings from sequence encoder to initialize each token in v i and then average all token embeddings as the initial representation s i of the node. For edge initialization, the BART(·) is used to encode the dependency e ij into the initial representation r ij .

Relational Graph Attention Layer
Based on the constructed HSS graph, we apply a graph attention network (Veličković et al., 2018) with the dependency information to aggregate the slot-level features. This layer following a residual connection is designed as: (2) where W * are weight matrices, σ is the activation function, and N is the neighborhood of v i in G.  Figure 3: An illustration of the decoding process (beam_size=1). Our decoder selects the most probable token of the same slot label from all positions and give priority to non-"O" slot labels, i.e., PERSON, ACT, etc.

Slot-aware Decoder
To aggregate multi-granularity representations, we improve the BART decoder based on Transformer with two extra cross-attentions added to each decoder layer, which attends to the representations at utterance-level and slot-level. It is worth noting that the slot-level mask cross-attention is realized with a novel Mask, which represents the corresponding relationship between the slot values in HSS graph and the tokens in target summary, that is, whether they belong to the same slot label. In each decoder layer, after performing the token-level cross-attention and the utterance-level cross-attention, the slot-level mask cross-attention operation is then performed to conduct cross attentions over slot nodes {v 0:|V | } of the HSS graph encoded from graph encoder to obtain the slot-attended representations. Concretely, the summary tokens are regarded as a query matrix and the slot node representations act as a key matrix, so that every summary token simultaneously assesses how much information shall be obtained from every representation of the same type slot node. In this way, the target summary sequence representation Y, which is the sum of token-level cross-attention representation T and utterance-level cross-attention representation U, is projected to query matrix Q ∈R n×d . The slot node H g is projected to key matrix K ∈R |V |×d and value matrix V ∈R |V |×d by linear projections without bias: where [] is the concatenating operation. Slot-level mask cross-attention is calculated by: (3) where * denotes element-wise multiplication, and M ∈R n×|V | is the utilized mask which is defined: Just like Transformer (Vaswani et al., 2017), the output vectors are then feed into a feed-forward network for forward passing in the decoder. for i = 1, ..., n do 9: Cand ← {} 10: for hyp ∈ H do 11: if i ≤ L then 12: Calculate probability of non-O slot values 13: else 15: Calculate probability of O-slot values 16: for s k , w k , p k ∈ P do 21: Record the tokens and positions 22: score ← score + s k 23: S ← replace(S , p k , w k ) 24: end for 29: end for 30: return H The best summary sequence 31: end procedure Slot-driven Beam Search The general beam search algorithm is a form of pruned breadth-first search and seeks the K-best candidate summaries having the highest log-likelihood to generate the next token one by one. Although beam search is one of the few NLP algorithms that has stood the test of time and has been widely used in many text generation tasks, it products a high search error rate due to the long-distance probability transition (Meister et al., 2020).
Inspired by Song et al. (2021), our slot-driven beam search simultaneously predicts the most probable tokens for all positions and decodes the summary tokens in order of priority under the guidance of semantic slot information rather than using a leftto-right order, as shown in Fig.3 Besides, a position matrix M p ∈ {0, 1} n×|V | is also contained to record what positions have been filled by summary tokens and what positions remain available. The detailed process is shown in Algorithm 1

Learning Objective
Following Lee et al. (2020), we introduce a contrastive learning strategy during training to improve generalization, which is realized by respectively adding a small perturbation and a large number of perturbations to the hidden representations of the target summary sequence to generate negative examples and positive examples. In this way, the conditional likelihood of the negative example is minimized but very close to the source sentence in the embedding space, and the conditional likelihood of the positive example is enforced to remain high. The overall objective is as follows: where α, β are hyperparameters, searched through cross-validation and control the importance of contrastive learning and KL divergence.

Datasets
We experiment with two large-scale abstractive dialogue summarization datasets: the SAMSum dataset (Gliwa et al., 2019), which is about natural conversations in various scenes of the real-life, and the MediaSum dataset (Zhu et al., 2021b), which is about interview transcripts from NPR and CNN.

Data Preprocessing
We give the semantic slot information by following steps: (1) We firstly use Stanford CoreNLP (Manning et al., 2014) to do NER to get the nominal slots by integrating the high-frequency entity types with similar concepts into one slot label such as (COUNTRY, CITY, STATE_OR_PROVINCE)→ LOCATION and retaining the low-frequency entity types with special significance such as MONEY→PRICE.
(2) We then use Stanford CoreNLP to do Pos Tagging to get the verbal slots and adjective slots. The slot label corresponding to the tokens marked as VB, VBP, VBZ, VBN and VBG is regarded as "ACT". The slot label corresponding to the tokens marked as JJ, JJR, JJS is regarded as the "STATE". (3) By manually integrating and modifying the results of NER and Pos Tagging, 15 types of slot labels and 17 types of slot labels are defined for SAMSum dataset and MediaSum dataset. (4) Finally, we fine-tune the pre-trained slot filling model (Chen et al., 2019) to get the complete semantic slot information.

Baselines
The following models are adopted as baselines: Structure-aware sequence-to-sequence model (S-BART) (Chen and Yang, 2021). .

Evaluation Metrics
We use three automatic evaluation metrics to evaluate our models. The first is ROUGE scores (Lin,  2004), the standard summarization quality metrics, which compare the word-level unigram, bigram, and longest common sequence overlap with the gold summary. Since the ROUGE scores have been criticized for their poor correlation with factual consistency, we use the QA-based model, i.e., QGQA , which have a high correlation with human judgements on factuality. Except for factual consistency, factual completeness is also crucial to evaluate the factual correctness. To fill in this gap, we propose a new evaluation metric: slot Information Completeness (SIC). Formally, SIC is a recall of semantic slot information between a candidate summary and a gold summary, which is defined as follows: where S stands for a set of slot values in the gold summary, Count match (s) is the number of values co-occurring in the candidate summary and gold summary, and |S| is the number of values in set.

Main Results
Results on SAMSum As reported in Table 2, all baselines and our model are evaluated automatically with ROUGE scores, QGQA, and SIC. We can observe that, compared to simple sequenceto-sequence models (PG and TRAN), incorporating the extra information such as commonsense knowledge (D-HGN) and topic word information increases all scores. However, the performance of factual correctness metrics (QGQA and SIC) are very poor. Besides, although the utilize of pretrained models, i.e., BART, M-BART, and S-BART, achieves a high level of ROUGE scores, the QGQA and SIC do not improve significantly, especially SIC. It suggests that the previous models only focus on improving the ROUGE scores, but ignore the exploring on factual consistency and factual completeness, which would cause the generated summaries with high ROUGE scores, but are in-correct and low-quality. It is worth noting that our SSAnet significantly boosts factual consistency measure and factual completeness measure (QGQA and SIC) by large margins, with improvements on ROUGE scores at the same time. This shows our model has the ability to improve the correctness of system-generated summaries via semantic slot information without sacrificing the informativeness.
Results on MediaSum As shown in Table 2, we notice that all results for the abstractive summarization models are especially lower than those on SAMSum dataset, because of the increase in the number of speakers and turns, and the high requirement of compression ratio. However, it is encouraging that the SSAnet surpasses the best performing model S-BART by 6.41 points and 7.18 points for QGQA and SIC scores, which shows that the semantic slot information guides the model to generate salient elements and plays an important role in reducing factual errors.

Analysis of Error Types
To examine the performance of models, we qualitatively analyze the actual factual errors produced by them. We identify the factual errors through manual inspection and define two broad categories of errors: factual inconsistency and factual incompleteness. The errors of factual inconsistency occur at slot-level and event-level, each of which is further divided into intrinsic and extrinsic.

Factual Inconsistency
(1).Slot-Int:The tokens for slot values are in-correctly replaced by other slot values also appeared in the original dialogue within the same type of slot label, i.e., "at (TIME) 6 (NUMBER) pm (TIME)"→"at (TIME) 8 (NUMBER) pm (TIME)".
(2).Slot-Ext:The tokens for slot values do not present in the original dialogue, i.e., hallucination.
(3).Event-Int:Due to the misinterpreting and wrongly integration salient elements, the semantic of the summary is in contradiction to the original dialogue. For example, "Sara baked cookies and Sally ate some"→"Sally baked cookies and ate".
(4).Event-Ext:The pragmatic meanings described in the summary are not mentioned in the original dialogue, such as "Bob buys an apple"→"Bob plays the basketball".

Factual Incompleteness
The salient elements (slot values) presented in the original dialogue are lost in the summary, such as "Mary is going to a bar on Green Street for the birthday party at 10 p.m"→"Mary will go to a bar for the birthday party".
We use the above taxonomy to annotate examples from SAMSum and MediaSum. For each dataset, we use the state-of-the-art model M-BART (Chen and Yang, 2020) to generate summaries followed by manual annotation (100 examples). Additionally, our model SSAnet is also annotated for error analysis in the same way. Fig.4 shows the distribution of factual errors for these different settings. We first analyze the performance conducted by the M-BART on two dialogue summarization datasets. For SAMSam, we can see that 75% of the generated summaries contain factual errors. Of these 75%, the bulk of the produced errors is intrinsic, which is because that this dataset contains human-written gold summaries and is generally more reliable. Besides, the errors of factual inconsistent (35%) and factual incomplete (19%) are primarily event-related caused by sentence compression or fusion. For MediaSum, more summaries (90%) generated by M-BART model are factually incorrect and most of them (63%) is extrinsic. One reason for this is that the MediaSum data is automatically constructed according to topic descriptions and does not contain fact-related overviews. We then observe the results on two datasets trained by our SSAnet, which shows that most error types are reduced. Especially, the Event-Int error of factual consistency and the errors of factual incompleteness drop to 18% and 9% for SAMSum, and the Slot-Ext and Event-Ext errors related to factual consistency decreased by 3 and 5 points for MediaSum. It demonstrates that our methods effectively alleviate many kinds of factual errors in dialogue summarization.

Ablation Study
As shown in Table 3, we first explore the contributions of the slot-level mask cross-attention module and the slot-driven beam search algorithm on SAMSum dataset. We can see that removing any components leads to the decline of performances. The removal of SCA almost has the same effect on R-1, R-2, R-L, QGQA, and SIC, which indicates that the SCA can comprehensively improve the n-gram overlap, factual consistency and factual completeness. However, deleting the SBS, that is, using the traditional beam search algorithm as the decoding strategy, makes little impact on ROUGE scores, but results in the decreases of 2.63% and 4.68% for QGQA and SIC. The huge impact on factual correctness evaluation metrics shows that our SBS can effectively reduce the factual errors in the generated summaries by controlling the decoding process. When the SCA and SBS are removed at the same time, the structure of the model is similar to BART and the performance of all metrics are also similar. We then examine the contrastive learning framework. We can see that the adversarial perturbations, i.e., positive and negative pairs, can improve the performance to some extent. The visualization of this process is shown in Fig.5. Concretely, we apply the average pooling to the embeddings of the encoder outputs corresponding to source dialogue se- Dialogue two: Lilly: sorry, I'm gonna be late. Lilly: don't wait for me and order the food Gabriel: no problem, shall we also order something for you? Gabriel: so that you get it as soon as you get to us? Lilly: good idea! Lilly: pasta with salmon and basil is always very tasty there.
Ground Truth: Lilly will be late. Gabriel will order pasta with salmon and basil for her. M-BART: Lilly is going to be late, so Gabriel will order the food for her.

SSAnet:
Lilly is going to be late. Gabriel will order pasta with salmon and basil. Dialogue two: Lilly: sorry, I'm gonna be late. Lilly: don't wait for me and order the food Gabriel: no problem, shall we also order something for you? Gabriel: so that you get it as soon as you get to us? Lilly: good idea! Lilly: pasta with salmon and basil is always very tasty there.
Ground Truth: Lilly will be late. Gabriel will order pasta with salmon and basil for her. M-BART: Lilly is going to be late, so Gabriel will order the food for her.

SSAnet:
Lilly is going to be late. Gabriel will order pasta with salmon and basil. dad. David will order an ipad online. 11 15,17 13 14 21 24 10,20 Figure 6: Sample summaries for dialogues from SAMSum dataset. The numbers underlined indicate the order in which the summary tokens are generated. "there's" stands for "there is". It maps to two tokens according to Byte Pair Encoding (BPE). Each sentence has an ending period, so the last word also maps to two tokens.
quences H s , the decoder outputs corresponding to target sequences H t , the additional positive exam-plesH, and the negative examplesĤ. All of them are projected onto a two-dimensional space with t-SNE. As shown in Fig.5(b), the model pushes away theĤ from the H t and pulls theH to the embedding of the H s . However, for the model without contrastive learning, the H t andH are far away from the H s , and theĤ are very close to them as shown in Fig.5

Human Evaluation
We run a human evaluation to investigate the quality of summaries. 100 samples are randomly selected from the test set of SAMSum and five annotators are hired from Amazon Mechanical Turk to rate the readability and factualness of ground truth, and summaries generated from M-BART, S-BART, and our models. Each annotator uses a Likert scale to score summaries from 1 (worst) to 5 (best) on readability-how fluent and grammatical the summaries are, and on factualness-whether the summaries are consistent with the original dia-logue and the events described in the summary are complete.
As shown in Table 4, the generated summaries perform poor in grammaticality, factual consistency, and completeness. Compared with S-BART, the SSAnet and its variants have lower scores on readability, but have higher scores on factualness, which is due to the strategy of giving priority to generating salient elements by "filling-in-the-blanks" in the decoding process. Therefore, the fluency and grammaticality scores of SSAnet without SBS increase to 4.32 and 3.79. However, the SSAnet greatly improves the scores of factual consistency and completeness and is almost close to the performance of ground truths, which indicates that both of the SCA and SBS in our model play important roles in factual correctness. The outputs of three samples from SAMSum dataset can be found in Fig.6. Dialogue one and two show the ability of our model to solve factual incompleteness issues and Dialogue three mitigates the inconsistent facts in the generated summaries.

Conclusion
In this work, we propose a semantic slot guided adversarial sequence-to-sequence network for abstractive dialogue summarization, which utilizes the semantic slot information to improve the model architecture and decoding algorithm via the slot-level mask cross-attention mechanism and slot-driven beam search. A contrastive learning with adversarial perturbations is also introduced to assist the training process. Experiments demonstrated the effectiveness of our proposed models in terms of both readability and factualness.

B Training Details
Our methods are implemented with PyTorch (Paszke et al., 2019) and HuggingFace. We finetune the BART-large (Lewis et al., 2019) for all experiments. For parameters in the original BART encoder/decoder, we followed the default settings and set the learning rate 5e-5 with 120 warm-up steps. For graph encoder, we set the number of hidden dimensions as 1024, the number of layers as 2, and the dropout rate as 0.1. For the two extra crossattention added to BART decoder layers, we set the number of attention heads as 4. The learning rate for parameters in newly added modules was 3e-4 with 60 warm-up steps. The model is fine-tuned for 20 epochs and the batch size is 128. At test time, the minimum lengths of generated summaries for two datasets are 35 and 20, and the beam size is 10.

C Human Evaluation Guidelines
In this subsection, we give details of the human evaluation guidelines.

C.1 Readability Annotation Guidelines
For readability, we make the annotators focus on how fluent and grammatical the summary is and we provide them the following guidelines: 1. First, the annotators judge whether the given sentence is complete or not. If the sentence is incomplete, the annotators will rate the scores as 1 both for fluency and grammaticality.
2. The annotators can understand the meaning of a complete sentence through their analysis, but there are many grammatical problems in the sentence. The annotators rate the scores as 2 or 3 both for fluency and grammaticality.
3. The annotator can easily understand the meaning of the sentence, and there are only minor grammatical problems in it. The annotators rate the scores as 4 or 5 both for fluency and grammaticality.

C.2 Factualness Annotation Guidelines
For factualness, we make the annotators focus on two types of unfaithful errors: (a) factual inconsistency, and (b) factual incompleteness. The guidelines are as follows: 1. We ask the annotators to check whether the given sentence is consistent with the source texts and whether the given sentence contains the complete fact descriptions.
2. If the matching degree is less than 30%, the annotators rate the scores as 1 or 2; if the matching degree is more than 30% and less than 60%, the annotators rate the scores as 3; if the matching degree is more than 60% and less than 100%, the annotators rate the scores as 4 or 5.