Multi-Stage Pre-training Enhanced by ChatGPT for Multi-Scenario Multi-Domain Dialogue Summarization

Dialogue summarization involves a wide range of scenarios and domains. However, existing methods generally only apply to specific scenarios or domains. In this study, we propose a new pre-trained model specifically designed for multi-scenario multi-domain dialogue summarization. It adopts a multi-stage pre-training strategy to reduce the gap between the pre-training objective and fine-tuning objective. Specifically, we first conduct domain-aware pre-training using large-scale multi-scenario multi-domain dialogue data to enhance the adaptability of our pre-trained model. Then, we conduct task-oriented pre-training using large-scale multi-scenario multi-domain"dialogue-summary"parallel data annotated by ChatGPT to enhance the dialogue summarization ability of our pre-trained model. Experimental results on three dialogue summarization datasets from different scenarios and domains indicate that our pre-trained model significantly outperforms previous state-of-the-art models in full fine-tuning, zero-shot, and few-shot settings.

Recently, general-purpose pre-trained models have achieved significant success in dialogue summarization tasks (Lewis et al., 2020;Bao et al., 2020;Beltagy et al., 2020).Furthermore, several task-specific pre-trained models (Zhang et al., 2020a;Zhong et al., 2022) have further improved dialogue summarization.Existing dialogue summarization pre-trained model (Zhong et al., 2022) achieves good performance on long dialogue summarization.However, it still has the following limitations: (1) It is only pre-trained on dialogue corpora that include two domains (i.e., Interview and TV show), making it difficult to apply to dialogue summarization in a wide range of scenarios and domains.(2) It utilizes a window-based denoising task as the pre-training objective, which presents a significant gap with the fine-tuning objective.Simultaneously, existing state-of-the-art (SOTA) models generally improve dialogue summarization by modeling dialogue interactions (Lin et al., 2022;Tang et al., 2022), incorporating extra information (e.g., topics and roles) (Wang et al., 2022c;Kim et al., 2022), and rewriting dialogues (Xu et al., 2022;Fang et al., 2022).Although these methods have some effect, they still have limited applicability to downstream datasets in different scenarios and domains and are often difficult to apply within the current pre-training paradigm due to complex model architectures.
To address the limitations of previous works, in this study, our goal is to propose a task-specific pretrained model for dialogue summarization, which has extremely small gap between the pre-training objective and the fine-tuning objective, enabling it to excellently adapt to downstream datasets from a wide range of scenarios and domains in full finetuning, few-shot, and zero-shot settings.
Motivated by the above goal, we consider three key components in the implementation of our pretrained model: model architecture, pre-training corpus, and pre-training strategy.For model architecture, our pre-trained model is based on the standard Transformer (Vaswani et al., 2017) encoder-decoder architecture and is initialized with BART (Lewis et al., 2020).To capture the underlying role interactions during the dialogue process, we incorporate additional speaker embed- dings (Gu et al., 2020(Gu et al., , 2021) ) into token representations.For pre-training corpus, we collect 14 opendomain dialogue datasets across multiple scenarios and 6 multi-domain customer service dialogue datasets.Furthermore, due to the development of Large Language Models (LLMs) (Zeng et al., 2022;Thoppilan et al., 2022;Scao et al., 2022) and their excellent generative ability, obtaining highquality "dialogue-summary" parallel pre-training data has become possible.Therefore, we utilize ChatGPT (Ouyang et al., 2022)  We evaluate our pre-trained model on opendomain dialogue summarization datasets from two scenarios (i.e., Online-Chat (Gliwa et al., 2019) and Daily-Life (Chen et al., 2021)), as well as a cus-tomer service dialogue summarization dataset from a specific domain (i.e., Tweet (Feigenblat et al., 2021)).The experimental results indicate that MP4 significantly outperforms previous SOTA models in full fine-tuning, zero-shot, and few-shot settings, demonstrating remarkable performance improvements.
Our contributions are summarized as follows: • We construct LCM 3 DS, which includes a large-scale collection of multi-scenario multidomain dialogues and their corresponding summaries annotated by ChatGPT.
• We propose MP4, a multi-stage pre-trained model for multi-scenario multi-domain dialogue summarization.
• Our pre-trained model achieves new state-ofthe-art performance on three dialogue summarization datasets from different scenarios and domains in full fine-tuning, zero-shot, and few-shot settings.
Dialogue Pre-Processing and Cleaning.We conduct a series of automated data pre-processing and cleaning to further improve the quality of the dialogues.For pre-processing, we perform the following steps: (1) Normalizing punctuations, special characters, and capitalization in each dialogue.
(2) Following previous studies (Dinan et al., 2019;Chen et al., 2021), we preprocess each dialogue into a dual-turn dialogue format by merging consecutive utterances from the same speaker.For cleaning, we perform the following steps: (1) Removing duplicate and highly similar dialogues using the Jaccard text similarity algorithm.(2) Deleting highly similar dialogues between the dialogue datasets and the evaluation datasets using the same algorithm as in (1), ensuring that they have no intersection.
Role Adding.In order to standardize the different speaker formats of original dialogues across various datasets, we collect a list containing over 4,000 real names.For each dialogue, we randomly selected several real names from the list to assign a role group (e.g., Danny and Alejandra), where the number and order of real names in each role group corresponds to the speakers in the original dialogue.

Annotation
Prompt Format.We follow the previous study of InstructGPT (Ouyang et al., 2022) by inserting the text "Tl;dr:" at the end of each dialogue as a prompt and inputting it into ChatGPT2 (in zeroshot setting) to obtain annotated summaries.We also investigate the performance of three different prompts for dialogue summarization in zero-shot setting, and the details can be found in Appendix A.
Role-Replaced Data Augmentation.In dialogue summarization, there are multiple scenarios and domains involving different roles.To alleviate this problem, we propose a simple yet effective method that can be extended to dialogue summarization involving any role.Specifically, we directly replace the roles in the dialogues and summaries from LCM 3 DS to obtain an augmented parallel corpus.In this study, we perform replacements for two common types of roles, including named coreference and customer service.The example we provide can be found in Appendix C.
Compared to existing dialogue summarization datasets, LCM 3 DS exhibits lower Compression Ratio, moderate Coverage, and lower Density, indicating that the summaries maintain a high degree of abstraction while covering the important content of the dialogue and retaining more details and information from the dialogue.Additionally, LCM 3 DS shows higher Novelty and Redundancy, which is mainly caused by the lower Compression Ratio.Furthermore, unlike existing smallscale, human-annotated, single-scenario, singledomain dialogue summarization datasets, LCM 3 DS is large-scale, ChatGPT-annotated, multi-scenario, and multi-domain.3 Model

Dialogue Modeling
MP4 is based on the standard Transformer (Vaswani et al., 2017) encoder-decoder architecture.For multi-turn dialogue modeling, the input embedding of each token is the sum of the corresponding token, position, and speaker embeddings.Figure 2 illustrates the dialogue modeling of MP4.
Input Structure.Given the dialogue context D, we first concatenate all roles R i and utterances U j in the dialogue context with two additional special tokens as a separate, consecutive token sequence: where the special end-of-role token <eor> and end-of-utterance token <eou> are respectively appended to the end of each role and utterance for multi-turn dialogue separation.Then, we add the start token <s> and the end token </ s> around the token sequence X as the input for MP4.
Speaker Embeddings.To distinguish utterances in the dialogue context and capture underlying role interactions during the dialogue process, we follow previous dialogue modeling studies (Gu et al., 2020(Gu et al., , 2021) ) and add additional speaker embeddings to token representations.This process is performed alternately based on the role transitions and can be extended to dialogues with an unlimited number of roles.The speaker embeddings are combined with the initial token and position embeddings and then fed into the MP4 encoder-decoder framework.

Multi-Stage Pre-training
We conduct multi-stage pre-training to reduce the gap between the pre-training objective and the finetuning objective.Domain-aware pre-training aims to enhance the adaptability of MP4 to dialogues in multiple scenarios and domains, while taskoriented pre-training aims to enhance the ability of MP4 to summarize unstructured spoken multi-scenario multi-domain dialogues into structured written-language summaries.

Domain-Aware Pre-training
General-purpose pre-trained models (Lewis et al., 2020) are pre-trained on free-form text data with universal pre-training objectives, limiting their ability in specific domains and tasks.Therefore, it is common practice to further train these models with the language modeling objective using text from the target domain to reduce negative impact (Zhang and Zhao, 2021;Whang et al., 2021).In this study, we conduct a domain-aware pre-training stage on MP4 using the dialogue data from LCM 3 DS.Specifically, we achieve this by modeling a series of dialogue reconstruction pre-training objectives inspired by BART (Lewis et al., 2020).
Token Masking.For tokens of each utterance in the dialogue, 20% of them are randomly sampled and replaced with a special <mask> token.
Token Deletion.20% of the tokens in the dialogue utterances are randomly sampled and deleted.
Utterance Infilling.Several utterance spans are randomly sampled, and each span is replaced with a single <mask> token.The length of each utterance span is drawn from the Poisson Distribution (λ = 3).0-length spans correspond to the insertion of <mask> tokens.
Utterance Permutation.The order of all utterances in the dialogue turns is randomly shuffled.
In contrast to previous studies (Zhong et al., 2022;Wang et al., 2022b), we did not shuffle the order of roles.Therefore, MP4 needs to reconstruct the correct order of utterances and ensure the precise alignment between utterances and roles.
Utterance Masking.20% of the utterances in the dialogue are selected and replaced with a special <uttr-mask> token.We did not perform random selection but instead followed the method of PE-GASUS (Zhang et al., 2020a) using greedy search to obtain the principal Gap-utterances.During the decoding process, MP4 needs to reconstruct the complete dialogue.
Multi-Task Learning.The model is trained with a maximum likelihood objective L Θ .Given the training sample D = (x, y), L Θ is defined as where Θ is the model parameters, x is the noisy dialogue, and y is the original dialogue.
During each iteration of the multi-task domainaware pre-training stage, training samples are randomly selected from different pre-training tasks as mini-batches and used to calculate the cumulative loss and optimize the model parameters Θ.

Task-Oriented Pre-training
Several task-specific summarization pre-trained models (Zhang et al., 2020a;Xiao et al., 2022;Zhong et al., 2022) reduce the gap with downstream datasets by modeling the task-oriented pre-training objective.Specifically, They typically select segments (e.g., gap-sentences or window-based utterances) of the original text (e.g., document or dialogue) as optimization targets for the decoder.Although they have some effects, however, there still exists a significant gap between the segments selected through unsupervised methods and abstractive written-language summaries.In this study, we directly utilize the "dialogue-summary" parallel data from LCM 3 DS for task-oriented pre-training stage.The learning objective is similar to Eq. ( 1), where the training sample D = (x, y), with x representing the original dialogue and y representing the summary annotated by ChatGPT.

Experimental Setup
Implementation Details.MP4 is initialized with BART-large 3 (Lewis et al., 2020).which is a denoising sequence-to-sequence pre-trained Transformer (Vaswani et al., 2017) model with 12 layers and 16 attention heads.To facilitate performance comparison, we have implemented four types of MP4 models.MP4 (VANILLA) represents Downstream Datasets.We evaluate the performance of MP4 on open-domain dialogue summarization datasets from two scenarios (i.e., Online-Chat and Daily-Life), namely SAMSum (Gliwa et al., 2019) and DIALOGSUM (Chen et al., 2021), as well as a customer service dialogue summarization dataset from a specific domain (i.e., Tweet), namely TWEETSUMM (Feigenblat et al., 2021).Table 1 provides the statistics of the downstream datasets.More details are provided in Appendix B.
Comparison Methods.We compare MP4 with three types of baselines: extractive models, abstractive models, and previous SOTA models.The following presents the comparison methods.Evaluation Metrics.We evaluate the full finetuning, zero-shot, and few-shot performance of all models using ROUGE scores4 (i.e., R-1, -2, and -L), which are standard evaluation metrics.

Full Fine-Tuning Evaluation
To demonstrate the advantages of our pre-trained model with a large amount of training samples, we train the model using the entire training set for full fine-tuning evaluation.
Settings.We provide all the hyper-parameters used for fine-tuning and inference in Appendix E.
During the evaluation, for SAMSum, we followed (Liu and Lapata, 2019) by testing with the top-3 best checkpoints on the validation set and reporting the average ROUGE scores.For DIALOGSUM, we followed (Chen et al., 2021) by reporting the average ROUGE scores between the inference output and multiple reference summaries.For TWEET-SUMM, due to limited research and the lack of detailed evaluation procedures in the original paper (Feigenblat et al., 2021) (Vaswani et al., 2017) 35.91 8.74 33.50 Previous SOTA Models UNILMV2-base (Bao et al., 2020) 47.04 21.13 45.04 BART-large (Lewis et al., 2020) 47.28 21.18 44.83 BART-NARR (Xu et al., 2022) 47.52 20.82 45.10 LA-BART (Wang et al., 2022a) 47.28 21.09 45.11 BART-MT (Bhattacharjee et al., 2022)    Table 5: R-1/R-2/R-L results in zero-shot and few-shot settings.For zero-shot setting, we report the results at the optimal summary length limits.For few-shot setting, we report the average results from 5 random runs on 10 training samples (all models share the same seed set).

Zero-and Few-Shot Evaluation
Many existing studies that apply pre-trained models to dialogue summarization require a large amount of fine-tuning data, which is often impractical in new scenarios or domains.In contrast, we expect our model to quickly adapt to new scenarios or domains without the need for a large amount of fine-tuning data.To validate this hypothesis, we conduct evaluations in zero-shot (no training samples) and few-shot (10 training samples) settings.
Obtaining such a small number of samples is feasible in practice for new scenarios or domains.
Settings.We compare the performance of BARTlarge (Lewis et al., 2020), MP4 (DAP), and MP4 (DAP-TOP) in zero-shot and few-shot settings.In Appendix F, we provide all the hyper-parameters used.Specifically, for zero-shot evaluation, since the models have not been trained on downstream datasets, we report the results of using the optimal summary length limits during inference.For fewshot evaluation, we randomly sample 10 training samples for training.Additionally, to ensure that the results are not affected by sampling variability, we conduct the same experiment five times with different random seeds (shared among all models) and report the average results.
Results.The results presented in Table 5 indicate that our pre-trained model achieves significant improvements compared to BART-large.Specifically, for zero-shot results, MP4 (DAP-TOP) increases the R-1 score by 14.47 (27.94→42.41), 13.26 (25.50→38.76), and 9.50 (29.42→38.92)on SAMSum, DIALOGSUM, and TWEETSUMM, respectively.Moreover, the zero-shot performance of MP4 (DAP-TOP) surpassed the few-shot performance of BART-large on multiple datasets, demonstrating its powerful zero-shot capability.Additionally, the few-shot results also highlight the advantages of MP4 (DAP-TOP).indicating that our pre-trained model converges faster than other models even with only 10 training samples.

Ablation Study
To further validate the contributions of the finegrained components in our pre-trained models, we conduct an ablation study on SAMSum in full finetuning setting.Table 6 shows the evaluation results.
Speaker Embeddings.As the results show, incorporating additional speaker embeddings in dialogue modeling can capture the underlying role interactions during the dialogue process and improve the performance of dialogue summarization.

Human Evaluation
We conduct human evaluation to further evaluate the performance of our pre-trained model and strong baselines under various paradigms, as well as the Ground Truth (i.e., MP4 (DAP-TOP), Chat-GPT, BART-large, Ground Truth).Specifically, we randomly select 50 samples from the test set of SAMSum.Then, we invite 3 participants to rank four candidate summaries according to four metrics: fluency (Flu.), conciseness (Conci.),informativeness (Info.), and comprehensiveness (Comp.).The top-ranking indicates the best performance on that metric.
Table 7 shows the results of human evaluation (lower average rank is better).Our pre-trained model outperforms BART-large in all metrics but falls behind the Ground Truth.Specifically, Chat-GPT achieves the first rank in fluency and comprehensiveness for the summaries generated in zeroshot setting, surpassing the Ground Truth.However, it exhibits the weakest performance in conciseness and informativeness.The main reason for this is that ChatGPT tends to generate longer summaries that describe various aspects of the dialogue, including both important and minor details.Moreover, the longer summaries also contribute to an improved overall impression to some extent.
PTMs for Dialogue Summarization.Recently, general-purpose pre-trained models have achieved significant success in dialogue summarization tasks (Lewis et al., 2020;Raffel et al., 2020;Bao et al., 2020;Beltagy et al., 2020).Furthermore, several task-specific pre-trained models (Zhang et al., 2020a;Zhong et al., 2022) have further improved dialogue summarization.Moreover, existing stateof-the-art dialogue summarization models typically leverage pre-trained models and model the characteristics of dialogues to achieve better results, including modeling dialogue interactions (Lin et al., 2022;Tang et al., 2022), incorporating extra information (Wang et al., 2022c;Kim et al., 2022), and dialogue rewriting (Xu et al., 2022;Fang et al., 2022).Although these models are effective, they often have complex model structures that are difficult to apply within the current pre-training paradigm.

Conclusion
In this study, we propose MP4, a multi-stage pretrained model for multi-scenario multi-domain dialogue summarization.To conduct the pre-training, we construct a large-scale ChatGPT-annotated multi-scenario multi-domain multi-turn dialogue summarization corpus called LCM 3 DS.Extensive experimental results demonstrate that MP4 exhibits remarkable dialogue summarization capabilities.

Limitations
Although we have demonstrated the powerful performance of MP4 in multi-scenario multi-domain dialogue summarization, there are still some limitations that provide directions for future work: (1) Due to limitations in computational resources, we did not consider long dialogues when constructing LCM 3 DS.Therefore, MP4 may be more suitable for short dialogue summarization.(2) MP4 is initialized with BART-large, which has only 0.4 billion parameters.In future work, we will consider using larger base models.Please refer to Table 9.

H Examples of Generated Summaries
Please refer to Figure 4.
to annotate the collected multi-scenario multi-domain dialogues and obtain corresponding summaries.We refer to our pre-training corpus as LCM 3 DS (Large-scale ChatGPT-annotated Multi-scenario Multi-domain Multi-turn Dialogue Summarization) (see Figure 1 (a)).For pre-training strategy, we conduct multistage pre-training to reduce the gap between the pre-training objective and the fine-tuning objective.Specifically, we first conduct domain-aware pretraining using the dialogue data from LCM 3 DS to enhance the adaptability of pre-trained model to dialogues in multiple scenarios and domains.Then, we utilize the "dialogue-summary" parallel data from LCM 3 DS for task-oriented pre-training to enhance the ability of pre-trained model to summarize multi-scenario multi-domain dialogues.We refer to our pre-trained model as MP4 (Multi-stage Pretrained Model for Multi-scenario Multi-domain Dialogue Summarization) (see Figure 1 (b)).

EF
Details of Fine-Tuning and Inference The following are the hyper-parameters used in full fine-tuning setting on SAMSum.−gpus 4 −steps 1150 −batch_size 16 −lr 3e − 05 −warmup_steps 100 −label_smoothing 0.1 −optimizer Adam The following are the hyper-parameters used in full fine-tuning setting on DIALOGSUM.−gpus 4 −steps 1000 −batch_size 16 −lr 3e − 05 −warmup_steps 100 −label_smoothing 0.1 −optimizer Adam The following are the hyper-parameters used in full fine-tuning setting on TWEETSUMM.−gpus 4 −steps 98 −batch_size 16 −lr 3e − 05 −warmup_steps 10 −label_smoothing 0.1 −optimizer Adam All models maintain consistent hyper-parameters across all datasets during inference.Zero-and Few-Shot Evaluation DetailsFor zero-shot evaluation, the optimal summary length limit hyper-parameter max_length for SAMSum, DIALOGSUM, and TWEETSUMM is 60, 40, and 80 respectively.Moreover, other hyperparameters used during inference remain consistent with Appendix E. For few-shot evaluation (with 10 training samples), we provide the hyper-parameter settings used during training below, while the hyper-parameters used during inference are consistent with Appendix E.

Table 2 :
Full fine-tuning results on SAMSum test set.

Table 3 :
Full fine-tuning results on DIALOGSUM test set.We report the average of multiple-reference results.

Table 6 :
Ablation study on SAMSum in full fine-tuning setting.The token-level tasks refer to Token Masking and Token Deletion.

Table 7 :
Human evaluation on SAMSum test set.