Topic-Informed Dialogue Summarization using Topic Distribution and Prompt-based Modeling

,


Introduction
In general, text summarization aims to generate a summary by capturing the core meaning from an original document consistently written by one participant on a single topic, such as news, scientific publications, etc (Rush et al., 2015;Nallapati et al., 2016).On the other hand, since a dialogue consists of multi-speakers, the topic of the dialogue may be changed as a topic drift according to the speaker's intentions (Zhao et al., 2020;Feng et al., 2021a).Therefore, the dialogue summarization should take into account the distribution of multiple topics in a dialogue and reflect this distribution in generating the summary (Zou et al., 2021a).Figure 1 shows an example of dialogue summary generation results from BART (Lewis et al., 2019) and TIDSum models in the SAMSum dataset (Gliwa et al., 2019); BART is a baseline model that has been widely used due to high performance in summary tasks.The content of the example dialogue in Figure 1 can be divided into three parts: 1 Tina and Steve are having pasta for dinner, 2 they will do the shopping together, and 3 they make an appointment to meet in the car park after Steve finishes work.That is, we can think that the dialogue contains three topics.However, BART did not capture the most important purpose of Tina and Steve's appointment to have pasta for dinner.They made an appointment to go shopping for ingredients for pasta in dinner, so this should be included in the summary.Therefore, we focus on generating a more comprehensive summary that captures all the topics in the dialogue, without missing an important topic.
In this paper, we propose a novel model, Topic-Informed Dialogue Summarizer (TIDSum), that generates a summary by considering the distribution of topics existing in the dialogue.To estimate the distribution of multiple topics in a dialogue, we exploit the TopClus model to obtain the distribution for the input dialogue, which was developed for automatic topic discovery from text corpora (Meng et al., 2022).Moreover, a task-specific soft-prompt, namely the topic-informed prompt, is added to make the encoder context vector1 well capture the dialogue topic information in the decoding phase as well as the encoding phase.The topic-informed prompt is created by the latent embedding of the auto-encoder in the TopClus model.Since the latent embedding of dialogue from the TopClus model sufficiently contains the topic information of the dialogue, the topic-informed prompt can influence the context vector of other tokens in the encoder and decoder of summarizer through the encoder self-attention and decoder cross-attention processes.The output context vectors from the decoder are used for the topic extractor to estimate the topic distribution of the generated summary with it.Then the topic extractor provides an auxiliary loss function to reduce the difference between the dialogue topic distribution and the summary topic distribution in the training phase.This learning approach generates a better summary that well reflects the topic of the dialogue.
In the experiments, two daily dialogue summarization datasets, SAMSum and DialogSum, were used to evaluate our model.Compared to the previous model, the proposed model improved the SOTA performance by 1.19%p and 1.94%p in Rouge-1 for SAMSum and DialogSum, respectively.

Related Work
Recently, there has been increasing attention on neural summarization for dialogues.Current studies mainly apply transformer-based models (e.g., BART (Lewis et al., 2019)) to abstractly summarize dialogues.However, these models are pre-trained on generic text corpora and it is essential to finetune them in a specific way for dialog data.Many studies have investigated how to find topics in dialogues.Zhao et al. (2020) modeled the dialogue as an interactive graph according to the topic word information extracted from LDA (Blei et al., 2003).Feng et al. (2021b), which used DialoGPT (Zhang et al., 2019) as the annotator, performed three dia-logue annotation tasks, including keywords extraction, redundancy detection, and topic segmentation.Liu et al. (2021) proposed two topic-aware contrastive learning objectives.This method implicitly modeled the topic change and handled information scattering challenges for the dialogue summarization.Since summarizing dialogues is essential in customer services, Zou et al. (2021b) proposed a topic-augmented two-stage summarizer with a multi-role-specific topic modeling mechanism.Li et al. (2022) presented a novel curriculum-based prompt learning and applied a topic-aware prompt, from which we got the idea for a topic-informed prompt.

Topic-informed Summary Generation Framework
To perform the dialogue summarization effectively, it is necessary to identify the distribution of topics in the dialogue scattered across multiple utterances.Therefore, we propose a model to generate a topic-informed summary by reflecting the dialogue topic distribution to the summary topic distribution.
The base architecture of TIDSum is a Transformerbased auto-regressive language model, BART.

Topic-Informed Prompt
In Figure 2-(1), we input a dialogue into TopClus to obtain the latent topic embedding and dialogue topic distribution.To be specific, the latent topic embedding lte is derived from the auto-encoder structure, TopClus, which ignores extraneous elements and contains only salient information from the input.Each topic t k is associated with the dialogue-topic distribution p(t k |lte), where k is the number of topics2 .The distribution not only represents all the topical information present in the dialogue but also distinguishes between important and unimportant topics.As you can see in Figure 2-(1), the topic-informed prompt tip is created by concatenating two ltes to match the dimensions of BART.Since lte is a hidden state of the TopClus, an auto-encoder structure, it must be smaller than the input dimension of TopClus.
The TopClus input is fixed at 768 dimensions because it origins from the CLS token of BERT that encodes the input dialogue.Thus, we set it to 512 dimensions to easily match the input dimensions of BART-large (1024 dims) by concatenating two ltes (1024 dims).Then, tip is located at the first token of the encoder input, preceding the dialogue D = {w 1 , w 2 , ..., w n } with n tokens.The final input form is X = {tip, w 1 , w 2 , ..., w n }.The topic information of the prompt tip infuses each token through a process known as encoder self-attention during training.In this manner, the encoder context vectors are topically enhanced within the encoder and their topic information is also propagated into the tokens of the decoder via encoder-decoder cross-attention.

Topic Extractor
In Figure 2-(2), the topic extractor, composed of MLP, serves to extract the distribution of the summary.Through encoder-decoder attention, the topic information sourced from tip is reflected in the decoder tokens.Therefore, we perform mean pooling on all tokens of the decoder to get the decoder topic-informed vector dti as follows: where M is a length of the summary and Y = {y 1 , y 2 , ..., y m } is the corresponding summary of m tokens.The topic extractor estimates a summary topic distribution p(t k |dti) of k topics from dti.It is trained with a cross-entropy loss to reduce the difference between the dialogue topic distribution and the summary topic distribution.The topic distribution loss L top is formulated as follows: where K is the number of topics.This ensures that the summary Y is generated to reflect the dialogue topics.

Topic-informed Summary Generation
The generation loss L gen is typically defined as the negative log-likelihood of the target summary given the input dialogue.
The final loss L f inal is a weighted sum of the generation loss L gen and the topic distribution loss L top : where λ is a hyperparameter that controls the relative importance of these two losses.Experimentally, setting λ to 0.75 showed the best performance.By minimizing the final loss L f inal during training, the model is encouraged to generate summaries that are faithful to the input dialogue and well reflect the topic distribution of the dialogue.

Experimental Settings
We loaded the pre-trained "facebook/bart-large" 3 for initialization.The learning rates of SAMSum and DialogSum were set to 1e-5 and 3e-5, and the train batch size were 2 and 4, respectively.The training was conducted at Nvidia Quadro RTX 8000 48G.We employed Py-rouge package to evaluate the models following (Feng et al., 2021b;Liu and Chen, 2021).

Comparison Models
TGDGA uses topic information and interactive graph structures.BART (D ALL ) performs three dialogue annotation tasks using PLM.ReWrite-Sum used the utterance rewriting mechanism to complete the omitted content.ConFiT is also trained via a novel contrastive fine-tuning.SICK++ summarized the dialogue in a way that utilizes 3 https://huggingface.co/facebook/bart-largecommensense knowledge.Li et al. (2022) applies a curriculum-based prompt learning.Ctrl-DiaSumm+Coref+DA generates controllable summaries with personal named entity.

Main Results
To evaluate our model, we employed the ROUGE scores, which are widely used in summarization tasks.In detail, the Rouge-1, Rouge-2 and Rouge-L variants, which consider unigram, bigram, and longest common subsequence overlap between generated and reference summaries, were utilized in our experiments (Lin, 2004).
Table 2 provides a comparison of our model with previous approaches on SAMSum and DialogSum.As shown in Table 2, TIDSum achieved the-stateof-the-art performances on both datasets.TIDSum obtains relative improvements of 1.19%p on Rouge-1, 1.71%p on Rouge-2 and 1.03%p on Rouge-L compared with the previous SOTA model in SAM-Sum, and 1.94%p, 0.85%p and 3.07%p in Dialog-Sum.These results demonstrate the effectiveness of our technique for generating a summary by distinguishing and reflecting on the topics that appear in the dialogue.
In the ablation study, the topic extractor was found to have the most impact on performance, as it is responsible for creating a summary topic distribution in the framework that essentially reflects dialogue topics.For qualitative measurement of the generated summaries, we conducted human evaluations on three metrics by just following Feng et al. (2021b).Informativeness evaluates how well the generated summaries capture more salient information.Conciseness measures how well the summary discards redundant information.Coverage measures how well the summary covers each part of the dialogue.We randomly sampled 60 dialogues with corresponding generated summaries to evaluate the SAMSum dataset.We asked three expert evaluators to rate each metric on a scale of 1 to 5, with higher scores being better.The results are shown in Table 3.

Human
In the case of informativeness, the golden summary has the highest value because it is a summary written by a person.However, TIDSum showed higher performance than BART.Conciseness is probably the shorter, the better, so it was slightly higher for TIDSum using only the topic extractor than for TIDSum infused with more topic information, but the performance was almost the same.Finally, coverage is a metric about whether a summary covers the entire content of the dialog, and TIDSum scored higher than the golden summary on this metric.This result shows that TIDSum effectively covers all the content of the dialogue.We herein attempt to verify that our model works better in multi-topic dialogues than in single-topic dialogues and it can generate comprehensive summaries well.The SAMSum test dataset was divided into single-topic and multi-topic ones.Dialogues with an entropy value of topic distribution less than 0.5 were regarded as single-topic ones.Eventually, the total dialogues are separated into 178 singletopic dialogues and 641 multi-topic dialogues.As a result, TIDSum showed larger improvement differences over baseline, BART-large in multi-topic dia-logues (b) as shown in Figure 3.The result proves that TIDSum is effective for summarizing more multi-topic dialogues.In real-world scenarios, TID-Sum can be applied not only to simple dialogues between two speakers, but also to multi-party dialogues, discussion summarization, etc. with more speakers and various topics.Figure 3 shows that the performance difference between TIDSum and BART-large is much larger in multi-topic than in single-topic.This suggests that it is applicable to dialogues with more diverse topics.

Conclusion
We propose TIDSum, a novel model for dialogue summarization.By reflecting the distribution of topics in the dialogue, TIDSum generates comprehensive summaries.We utilize the TopClus model to estimate topic distributions in the dialogue, and introduce a task-specific soft-prompt, the topic-informed prompt, to capture and infuse topic information through the encoding and decoding phases.The generated summaries were evaluated using SAMSum and DialogSum datasets, and our model outperformed previous approaches with a significant improvement in ROUGE scores and human evaluation results.Overall, TIDSum effectively captures and summarizes the details of each topic in the dialogue, resulting in high-quality summaries.

Limitations
The proposed method needs to train the TopClus model with the dialogue data to get the topic distribution and latent topic embedding of the dialogue before fine-tuning the BART based summarization model.Since TopClus is an auto-encoder model with high dimensional layers, it takes a long time to train.With the obtained topic distribution and latent topic embedding, the BART based summarization model is trained and generates a summary in inference phase.This two-stage process is complicated and requires some time.Therefore, in order to simplify this process, our future work is to incorporate only the essential parts of TopClus into the main learning process.

Figure 1 :
Figure 1: Dialogue summary examples of SAMSum generated by BART and TIDSum.

Figure 3 :
Figure 3: ROUGE scores differences for single-topic and multi-topic dialogues mation & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00369, (Part 4) Development of AI Technology to support Expert Decision-making that can Explain the Reasons/Grounds for Judgment Results based on Expert Knowledge), in part by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government (MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques), and in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00421, AI Graduate School Support Program (Sungkyunkwan University))

Table 1 :
Data description for SAMSum and DialogSum.
sApp and WeChat.DialogSum is a real-life dialogue dataset containing diverse task-oriented scenarios and topics.It consists of a formal style of dialogue.Table1shows the additional details for each dataset.

Table 3 :
Human evaluation on SAMSum."Info.","Conc.",and "Cov."stand for Informativeness, Conciseness and coverage, respectively.w\o tip means that we do not use the topic-informed prompt.