Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization

Current dialogue summarization systems usually encode the text with a number of general semantic features (e.g., keywords and topics) to gain more powerful dialogue modeling capabilities. However, these features are obtained via open-domain toolkits that are dialog-agnostic or heavily relied on human annotations. In this paper, we show how DialoGPT, a pre-trained model for conversational response generation, can be developed as an unsupervised dialogue annotator, which takes advantage of dialogue background knowledge encoded in DialoGPT. We apply DialoGPT to label three types of features on two dialogue summarization datasets, SAMSum and AMI, and employ pre-trained and non pre-trained models as our summarizers. Experimental results show that our proposed method can obtain remarkable improvements on both datasets and achieves new state-of-the-art performance on the SAMSum dataset.


Introduction
Dialogue summarization aims to generate a succinct summary while retaining essential information of the dialogue (Gurevych and Strube, 2004;Chen and Yang, 2020). Theoretically, Peyrard (2019) point out that a good summary is intuitively related to three aspects, including Informativeness, Redundancy and Relevance.
To this end, previous works have taken the above three aspects into account by incorporating auxiliary annotations into the dialogue. To improve informativeness, some works annotated linguistically specific words (e.g., nouns and verbs), domain terminologies and topic words in the dialogue (Riedhammer et al., 2008;Koay et al., 2020;Zhao et al., 2020). To reduce redundancy, some works used sentence similarity-based methods to annotate redundant utterances. (Zechner, 2002;Murray et al., 2005). To improve relevance, some works annotated topics for the dialogue Chen and Yang, 2020). However, these annotations are usually obtained via open-domain toolkits, which are not suitable for dialogues, or require manual annotations, which are labor-consuming.
To alleviate the above problem, we explore the pre-trained language model as an unsupervised annotator to automatically provide annotations for the dialogue. Recently, some works have investigated the use of pre-trained language models in an unsupervised manner. For example, Sainz and Rigau (2021) exploited pre-trained models for assigning domain labels to WordNet synsets. The successful recipe is that a model is obtained extensive knowledge via pre-training on a huge volume of data. When it comes to the dialogue domain, DialoGPT (Zhang et al., 2020b) is a SOTA conversational response generation model, which is pre-trained on the massive dialogue data. Therefore, we draw support from DialoGPT and present our DialoGPT annotator, which can perform three dialogue annotation tasks, including keywords extraction, redundancy detection and topic segmentation, to measure informativeness, redundancy and relevance of the input dialogue, respectively.
Keywords Extraction aims to automatically identify important words in the dialogue (shown in Figure 1(a)). Our DialoGPT annotator extracts unpredictable words as keywords. We assume that keywords contain high information, which are difficult to be predicted considering both background knowledge encoded in the DialoGPT and contextual information of dialogue context. Redundancy Detection aims to detect redundant utterances that have no core contribution to the overall meaning of the dialogue (shown in Figure 1(b)). Our DialoGPT  (Gliwa et al., 2019) with the human annotated summary. (a) Keywords extraction aims to extract words that are most important to the dialogue. (b) Redundancy detection aims to detect nonsignificant utterances in the dialogue. (c) Topic segmentation aims to divide the whole dialogue into several fine-grained topics. All three auxiliary information can do good to final summary generation.
annotator detects utterances that are useless for dialogue context representation as redundant. We assume that if adding a new utterance does not change the dialogue context representation, then this utterance has no effect on predicting the response, so it is redundant. Topic Segmentation aims to divide a dialogue into topically coherent segments (shown in Figure 1(c)). Our DialoGPT annotator inserts a topic segmentation point before one utterance if it is unpredictable. We assume that if an utterance is difficult to be inferred from the dialogue context based on DialoGPT, this utterance may belong to a new topic.
We use our DialoGPT annotator to annotate the SAMSum (Gliwa et al., 2019) and AMI  datasets. Each annotation is converted into a specific identifier and we insert them into the dialogue text. Then, we employ pre-traind BART (Lewis et al., 2020) and non pre-trained PGN (See et al., 2017) as our summarizers. Extensive experimental results show that our method can obtain consistent and remarkable improvements over strong baselines on both datasets and achieves new stateof-the-art performance on the SAMSum dataset.

Preliminaries
In this section, we will describe the task definition as well as the background of DialoGPT.

Task Definition
Given an input dialogue D, a dialogue summarizer aims to produce a condensed summary S, where D consists of |D| utterances [u 1 , u 2 , ...u |D| ] and S consists of |S| words [s 1 , s 2 , ...s |S| ]. Each utterance u i is compose of a sequence of words and EOS i indicates the end of the utterance. Besides, each utterance u i associates with a speaker p i . Thus, this task can be formalized as producing the summary S given the dialogue sequence: D = [p 1 , u 1,1 , ..., EOS 1 , ..., p |D| , u |D|,1 , ..., EOS |D| ]

DialoGPT
DialoGPT (Zhang et al., 2020b) is a neural conversational response generation model, which inherits from GPT-2 (Radford et al., 2019) and is trained on 147M conversation-like exchanges extracted from Reddit comment chains. There are 3 different sizes of the model with total parameters of 117M, 345M and 762M respectively. It achieves state-of-the-art results over various dialogue generation benchmarks. Given the dialogue context u i−1 = [u i−1,1 , ..., u i−1,|u i−1 | , EOS i−1 ], DialoGPT aims to produce the response u i = [u i,1 , ..., u i,|u i | , EOS i ], which can be formalized as the conditional probability of P (u i |u i−1 ). It first takes the context word sequence of no more than 1024 tokens and outputs the representation of the se- where h i−1,EOS i−1 can be viewed as the representation of dialogue context u i−1 . Then, Di-aloGPT starts decoding the response by attending to the context token representations and partially decoded response tokens until reaching EOS. The loss function is the negative log-likelihood of the response word sequence L DialoGP T = − |u i | t=1 log p (u i,t |u i,1 . . . u i,t−1 , u i−1 ). It's worth noting that DialoGPT tokenizes texts with the same byte-pair encoding as GPT-2, thus either context or response tokens are tokenized into subwords.

Method
In this section, we will first introduce our DialoGPT annotator. The workflow consists of three steps (1) dialogue preprocessing; (2) DialoGPT forward passing; (3) annotation. The overall framework is shown in Figure 2. Then, we will describe our dialogue summarizer, including BART and PGN.
Specifically, we transform it into two formats. The first one is context-response pairs (shown in Figure 2(a)). Given a dialogue D, two adjacent utterances (u i−1 , u i ) are combined into a contextresponse pair, where i ∈ [2 : |D|] . The second one is dialogue sequence (shown in Figure 2(b)). All the utterances in the dialogue D are serialized into a sequence [u 1,1 , ..., EOS 1 , ..., u |D|,1 , ..., EOS |D| ], with EOS separates each utterance.
Note that either for context-response pairs or the dialogue sequence, we do not take speaker information p into consideration. The reason is that DialoGPT is trained on a huge volume of conversational data without speaker information. Even so, Zhang et al. (2020b) proved that DialoGPT can simulate real-world dialogues in various scenes and has already learned diverse response generation patterns between the same speakers or different speakers according to the given context.

DialoGPT Forward Passing
DialoGPT forward passing has two purposes. (1) For each context-response pair, we aim to get the word-level and utterance-level predicted losses for the response (shown in Figure 2(c)).
(2) For the dialogue sequence, we aim to get the representations for each EOS (shown in Figure 2(d)).
For the first purpose, given one context-response pair (u i−1 , u i ), we input the context words u i−1 = [u i−1,1 , u i−1,2 , ..., u i−1,|u i−1 | , EOS i−1 ] into the Di-aloGPT and start to decode the response. At each decode step t, we calculate the negative loglikelihood between the predicted distribution and the golden target from the given response.
where loss i,t and loss i are the predicted losses for each word and each utterance respectively 2 .
For the second purpose, after the single forward pass of DialoGPT over the dialogue sequence, we can get representations H for each token on the top of the DialoGPT. Afterward, we extract all representations for each EOS.
where each h EOS i can be viewed as the representation for the dialogue context [u 1 , ..., u i ]. The initial redundant utterances set is ∅. h EOSi is the representation for dialogue context covering the first i utterances. We detect redundant utterances based on the cosine similarity between representations of dialogue context. For example, the similarity score between h EOS4 and h EOS5 exceeds the pre-defined threshold (t RD is 0.99), which means adding utterance u 5 into the dialogue context brings little information, thus the utterance u 5 is detected as redundant.

Keywords Extraction: DialoGPT KE
Motivation Considering both background knowledge encoded in the DialoGPT and contextual information of the dialogue context, if one word in the golden response is difficult to be inferred from DialoGPT, we assume that it contains high information and can be viewed as a keyword.
Given a dialogue D, we have loss loss i,j for each word u i,j , where i ∈ [2 : |D|]. We extract r KE percent of words with the highest loss as keywords, where r KE is a hyper-parameter 3 . Moreover, the names of all speakers P mentioned in the dialogue are also added into the keywords set. Finally, we append a specific tag #KEY# and the keywords to the end of the original dialogue D.
The new dialogue with keywords annotation is We use a heuristic rule to predetermine the possible value of rKE by calculating the average of length of summaries (remove stopwords) divided by the length of dialogues in the train set. We search the best rKE based on the calculated score. 4 In experiments, we find that the predicted loss for the first word of each utterance is extremely high, probably due to the first word in the response is the most uncertain and hard to be predicted. Thus, we ignore the first word of each utterance.

Redundancy Detection: DialoGPT RD
Motivation DialoGPT inherits a decoder architecture, where one token attends to all previous tokens to aggregate information. Thus, given the representation h EOS i for each EOS i , it can be viewed as the representation for the dialogue context [u 1 , u 2 , ..., u i ]. Adding a new utterance u i+1 , if the new context representation h EOS i+1 is similar to the previous h EOS i , we assume that the new utterance u i+1 brings little information and has small effects on predicting the response, thus u i+1 becomes a redundant utterance.
We start with the last two dialogue context representations h EOS |D|−1 and h EOS |D| , and calculate the cosine similarity between them. If the similarity score exceeds the threshold t RD , the utterance u |D| is detected as redundant. t RD is a hyper-parameter. If the similarity score doesn't exceed the threshold t RD , we move forward one step to calculate the similarity between h EOS |D|−2 and h EOS |D|−1 , and repeat the process until reaching h EOS 1 . An example is shown in Figure 3.
We insert a specific tag [RD] before each redundant utterance.

Topic Segmentation: DialoGPT TS
Motivation DialoGPT is skilled in generating the context-consistent response. Therefore, if the response is difficult to be predicted given the context based on DialoGPT, we assume the response may belong to another topic and there is a topic segmentation between the context and response.
Given a dialogue D, we have loss loss i for each utterance u i , where i ∈ [2 : |D|]. We select r TS percent of utterances with the highest loss as topic segmentation points. r TS is a hyper-parameter 5 . Before each selected utterance, we insert a specific tag [TS]. For example, if there is a segmentation point between utterance u 1 and utterance u 2 , the new dialogue with topic annotation is D TS = [p 1 , u 1,1 , ..., EOS 1 , [TS], p 2 , u 2,1 , ..., EOS 2 , ...].

Summarizer
We employ two kinds of summarizer, one is BART (Lewis et al., 2020), which is a Transformer-based model and pre-trained on a huge volume of data. The other one is PGN (See et al., 2017), which is a LSTM-based model. Both models inherit a typical sequence-to-sequence framework, which first encodes the source dialogue D to distributed representations and then generates the target summary S with the decoder.
BART BART adopts the Transformer (Vaswani et al., 2017) as the backbone architecture. It first map the source dialogue into distributed representations, based on which a decoder generates the target sequence: denotes M identical decoding layers, X 0 denotes the sum of the word embeddings X emb and position embeddings X pos of D, Y 0 denotes that of the shifted right S, FFN(·) denotes a positionwise feed-forward network, and ATT(·) denotes a multi-head attention. Residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) are used in each sub-layer, which are suppressed in Equation 3 for clarity. Finally, the output representation Y M of the decoder is projected into the vocabulary space and the decoder outputs the highest probability token.
PGN PGN is a hybrid model of the typical Seq2Seq Attention model (Nallapati et al., 2016) and Pointer-Network (Vinyals et al., 2015). The input dialogue is fed into the LSTM encoder token by token, producing the encoder hidden states. The decoder receives word embedding of the previous word and generates a distribution to decide the target token, retaining decoder hidden states. PGN not only allows to generate from the fixed vocabulary, but also allows to copy from the input tokens.
Training Objective Model parameters θ are trained to maximize the conditional likelihood of 4 Experiments

Datasets
We experiment on 2 datasets (statistics in Table 1): SAMSum (Gliwa et al., 2019) is a humangenerated dialogue summary dataset, which contains dialogues in various scenes of the real-life. AMI ) is a meeting summary dataset. Each meeting contains four participants and is about a remote control design project.

Implementation Details
DialoGPT We initialize DialoGPT with DialoGPTlarge 6 . For SAMSum, we set keywords extraction ratio r KE to 15, similarity threshold t RD to 0.99 and topic segmentation ratio r TS to 15. For AMI, r KE is 4, t RD is 0.95 and r TS is 5 7 .
BART We initialize BART with bart.large 8 . For fine-tuning on SAMSum, the learning rate is set to 3e-05, the dropout rate is 0.1, the warmup is set to 400. At the test process, beam size is 5, minimum decoded length is 5 and maximum length is 100. PGN The word embedding size is set to 300 and initialized with the pre-trained GloVe vector. The dimension of encoder and pointer decoder is set to 200. The dropout is set to 0.5. The learning rate is 0.001. At the test process, beam size is 10, minimum decoded length is 280 and maximum length is 450 9 .  Table 2: Test set results on the SAMSum dataset, where "R" is short for "ROUGE". BART means finetuning BART on the original SAMSum. BART(D KE ), BART(D RD ) and BART(D TS ) represent fine-tuning BART on the SAMSum with keywords, redundancy and topic annotation respectively. D ALL means the SAMSum with all three annotations. † and † † indicate the first-ranked and second-ranked results respectively.

Baselines and Metrics
For SAMSum, LONGEST-3 views the first three utterances as the summary. TextRank We adopt ROUGE (Lin, 2004) and BERTScore (Zhang et al., 2020a) for evaluating our models.

Automatic Evaluation
The results on SAMSum and AMI are shown in Table 2 and 3 respectively. We can see that using our annotated datasets D KE , D RD and D TS , both BART and PGN can obtain improvements. Furthermore, our BART(D ALL ) achieves SOTA performance.
For SAMSum, it's worth noting that BART(D KE ) performs better compared with BART(D RD ) and BART(D TS ). We attribute this to the fact that keywords can retain essential information for shorter dialogues. For AMI, PGN(D RD ) contributes the most, which shows the importance of detecting redundancy in verbose meeting transcripts. Although HMNet and TopicSeg achieve better scores, HM-Net needs news summarization dataset to pre-train the model and TopicSeg designs complex attention mechanism to incorporate topic information.
In terms of new embedding-based metric BERTScore (shown in Table 4), our method BART(D ALL ) and PGN(D ALL ) can consistently outperform the baseline models 10 .

Human Evaluation
We conduct a human evaluation of the dialogue summary to assess its informativeness, conciseness and coverage. Informativeness measures how well the summary includes key information. Conciseness measures how well the summary discards the redundant information. Coverage measures how well the summary covers each part of the dialogue.
We randomly sample 100 dialogues (SAMSum) and 10 meetings (AMI) with corresponding generated summaries to conduct the evaluation. In order to reduce variance caused by humans, we have 4 human evaluators and they were asked to rate each summary on a scale of 1 to 5 (higher is better) for each metric. The results are shown in Table 5.
We can see that our method can achieve higher scores in all three metrics. Especially, combined with D RD , our model can get the best score in conciseness. Besides, combined with D TS , our model can perform better in coverage. However, HMNet gets the best score in informativeness and coverage. We argue this is because HMNet forces a minimum summary length of 400. Due to this, it scores the worst in conciseness. For the AMI, we also find there is still a gap between the scores of generated summaries and the scores of golden summaries, indicating that the AMI is more difficult.  Table 6: Test set results of fine-tuning BART on the SAMSum that is annotated with keywords using various methods. Entities, nouns and verbs are obtained by Qi et al. (2020). Topic words are obtained by a pre-trained LDA model (Narayan et al., 2018). Key-BERT (Grootendorst, 2020)

Analysis
Effect of DialoGPT KE . To verify the effectiveness of our DialoGPT KE method, we fine-tune BART on SAMSum, which is annotated by various keywords extraction methods. The results are shown in Table 6. We can see that our method achieves higher scores. The results also show that entities play an important role in the summary generation. Besides, combined with DialoGPT embeddings, KeyBERT can get better results.
To give a quantitative evaluation, we view reference summary words as golden keywords and calculate the precision, recall and F 1 scores for extracted keywords. The results are shown in Table  7. Directly using entities as keywords can get the best precision score. However, both TextRank and Entities perform poorly in recall. Our method gets the best score in terms of F 1 and its advantage is mainly reflected in recall score, which shows our method can extract more diverse keywords.  Table 8: Test set results on the SAMSum and AMI datasets that are annotated with redundant utterances. "Rule-based" indicates annotating utterances that contain no noun, verb and adjective as redundant.
Effect of DialoGPT RD . To verify the effectiveness of our DialoGPT RD method, we compare it with a Rule-based method (Dinarelli et al., 2009), which annotates utterances without noun, verb and adjective as redundant. The results are shown in Table 8. We can see that our method performs better. Especially, our method shows more advantages for long and verbose meeting transcripts in the AMI.
Effect of DialoGPT TS . To verify the effectiveness of our DialoGPT TS method, we compare it with the C99 algorithm (Choi, 2000), which is a sentence similarity-based segmentation method. Chen and Yang (2020) enhance it with BERT (Devlin et al., 2019) embeddings. We further combine the algorithm with DialoGPT embeddings. The results are shown in Table 9. We can see that our method can get comparable results with the strong baseline C99(w/ DialoGPT emb). For AMI, combined with golden topic annotation, PGN can achieve the best result, which shows modeling topics is an essential task for dialogue summarization. Figure 4 shows summaries generated by different models for an example dialogue in the SAMSum dataset. We can see that BART (Lewis et al., 2020) tends to generate long and redundant summaries. By incorporating topic and stage information, MV-BART (Chen and Yang, 2020) can generate summaries that cover main topics of the dialogue. However, it still suffers from redundancy problem. Our BART(D ALL ) can get higher ROUGE scores while generating better summaries. The generated summary can include extracted keywords and correspond to each topic of the dialogue. We also find that even some redundant utterances have already been detected, our model still generate the summary contains some redundant information. We  attribute this to the fact that the small dataset leads to insufficient training of the model.

Related Work
Dialogue Summarization Current works mainly incorporate auxiliary information to help better modeling dialogues. Some works used various types of keywords to identify the core part of the dialogue, including entities (Zhu et al., 2020), domain terminologies (Koay et al., 2020) and topic words (Zhao et al., 2020). Some works aimed to reduce redundancy, Zechner (2002); Murray et al. (2005) used sentence-level similarity-based methods. Some works incorporate topics as a coarsegrained dialogue structure Chen and Yang, 2020). Other works also explored dialogue act (Goo and Chen, 2018), dialogue discourse (Feng et al., 2020b) and commonsense knowledge (Feng et al., 2020a). In this paper, we combine three types of auxiliary information to help better modeling dialogues, including keywords, redundant utterances and topics. Pre-trained Language Models Pre-trained models such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) have advanced various NLP tasks. On one hand, some works utilized the knowledge contained in pre-trained models by finetuning on supervised data of downstream tasks (Qin et al., 2019;Liu and Lapata, 2019;Qin et al., 2020). On the other hand, some works examined the knowledge in an unsupervised manner (Jiang et al., 2020;Lin et al., 2020). Ku-

Golden
Rob and Bob are watching the game . Bob will run some errands on the weekend . Jim's birthday is next wednesday . He might organize a meetup this weekend . Bob will see rob on the weekend .  mar et al. (2020) explored pre-trained models for conditional data augmentation.  used the knowledge in pre-trained models to construct knowledge graphs. In this paper, we belong to the second paradigm and propose our DialoGPT annotator that can perform three annotation tasks in an unsupervised manner.

Conclusion
We investigate to use DialoGPT as unsupervised annotators for dialogue summarization, including keywords extraction, redundancy detection and topic segmentation. We conduct our DialoGPT annotator on two datasets, SAMSum and AMI. Experimental results show that our method consistently obtains improvements upon pre-traind summarizer (BART) and non pre-trained summarizer (PGN) on both datasets. Besides, combining all three annotations, our summarizer can achieve new state-of-the-art performance on the SAMSum dataset.

B Ablation Studies for Annotations
To further verify the effectiveness of our method, we conduct ablation studies for each annotation.
The results are shown in Table 10 and Table 11. We can find that: (1) For both datasets, training summarizers based on datasets with two of three annotations can obtain improvements.
(2) For both datasets, training summarizers based on   datasets with two of three annotations can surpass corresponding summarizers that are trained based on datasets with one type of annotation (e.g., BART(D KE+RD ) is better than BART(D KE ) and BART(D RD )).
(3) Compared with summarizers that are trained on D RD+TS and D KE+RD , summarizers that are trained on D KE+TS get relatively small improvements on both datasets. Nevertheless, it indicates that DialoGPT KE and DialoGPT TS still have non-overlapping parts. (4) Combining all three annotations, both summarizers can achieve the best results in all ROUGE scores.

C Hyper-parameter Search Results
Tables 12 to 17 show the hyper-parameter search results. Finally, for SAMSum (Gliwa et al., 2019), we set keywords extraction ratio r KE to 15, similarity threshold t RD to 0.99 and topic segmentation ratio r TS to 15. for AMI , r KE is 4, t RD is 0.95 and r TS is 5.