DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization

Dialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues have limitations because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pre-train DIONYSUS, we create two pseudo summaries for each dialogue example: one from a fine-tuned summarization model and the other from important dialogue turns. We then choose one of these pseudo summaries based on information distribution differences in different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach on a large dialogue corpus. Our experiments show that DIONYSUS outperforms existing methods on six datasets, as demonstrated by its ROUGE scores in zero-shot and few-shot settings


Introduction
Text summarization aims to produce concise and accurate summaries of long texts. Recent research on pre-trained neural language models has shown success in summarizing monologues (Lewis et al., 2020;Raffel et al., 2022;He et al., 2022), such as news articles Ravaut et al., 2022) and scientific publications (Ibrahim Altmami and El Bachir Menai, 2022;Dong et al., 2021). However, dialogue summarization presents additional challenges due to the different information distribution in dialogues.
Self-supervised text summarization models Wan and Bansal, 2022;Phang Figure 1: A summary of a dialogue in the SAMSum dataset, where the golden summary effectively compiles relevant information (in yellow) from the entire conversation. et al., 2022) are typically pre-trained on free-form text data, with selected sentences as the pre-training objective. While this approach can be effective for monologues such as news articles, it is less successful at summarizing semistructured and multiparticipant dialogues. As illustrated in Figure 1, in daily chats, dialogue information is often dispersed across various dialogue turns, making it difficult to extract all relevant information through a few selected turns. While a golden summary needs to accurately captures vital information throughout the entire conversation. Furthermore, real-world dialogue-summarization applications often have limited or even no labeled data, making it challenging to develop effective models. Therefore, it is crucial to develop dialogue summarization models that can perform well in zero-shot and few-shot settings for their practical use.
To address these challenges, we propose DIONY-SUS, a pre-trained sequence-to-sequence model designed to summarize dialogues in any domain, even when there is a lack of labeled data. To achieve this, DIONYSUS uses pseudo summaries as its pre-training objective, which can be dynamically selected from two sources.
First, for daily chats where multiple dialogue turns are not sufficient to summarize the dialogue, we train a summary helper using high-quality dialogue summarization datasets to generate pseudo summaries for these types of dialogues. On the other hand, for dialogues like meeting minutes, interviews, and debates, which can be summarized through a selection of essential turns, we use a method inspired by the gap sentence generation (GSG) technique in PEGASUS to select these turns as pseudo summaries for training. For instance, choosing the final few turns in a conversation can effectively summarize meeting minutes. We have improved upon the GSG method by using the generated summaries from the summary helper as references during gap sentence selection, as they tend to have less noise compared to the full dialogue context. We refer to this source of pseudo summaries as "Principal" and refer to our improved method as GSG+. We find that our improved method outperforms previous methods in low-resource settings across different domains, such as daily chats, emails, and customer service dialogues. Additionally, we study various objective strategies in order to determine the optimal method for selecting the pseudo summary as a pre-training objective from the generated summary and the "Principal." We evaluate DIONYSUS on six dialogue summarization datasets. Our best model trained on 19 dialogue corpora surpasses PEGASUS LARGE in a zero-shot setting across all domains. We also compare different objective strategies and find that selecting the source with the highest ROUGE score achieves the best performance. In conclusion, our contributions are: • The development of DIONYSUS, a pretrained sequence-to-sequence model for summarizing dialogues in any domain in a zeroshot or few-shot setting.
• The introduction of new self-supervised pretraining objectives for dialogue summarization using a summary helper and GSG+.
• The demonstration that DIONYSUS outperforms baselines on six domains in lowresource settings, and can be fine-tuned with only 10 training examples to outperform vanilla T5 (Raffel et al., 2022) fine-tuning with 1,000 examples.

Approach
Figure 2 outlines the steps for constructing DIONY-SUS: (1) § 2.1 First, a summary helper is constructed using two high-quality dialogue summarization datasets. This helper generates a pseudo summary for each dialogue in our pre-training corpus.
(2) § 2.2 Next, the "Principal" is extracted using GSG+ as the other pseudo summary for the dialogue.
(3) § 2.3 Finally, various strategies are employed to select the best pseudo summaries from the first and second steps to serve as the objective for pre-training.

Summary Helper
In certain types of dialogue, such as daily chats, it can be challenging to gather all necessary information from just a few dialogue turns due to the dispersed nature of dialogue information. To address this problem, we have created a summary helper model that generates pseudo summaries for each training example in our pre-training corpus. We build our summary helper upon the T5 (Raffel et al., 2022) model. To capture essential information in a dialogue, we have trained our helper on the MultiWoz dataset (Budzianowski et al., 2018;Eric et al., 2020) in DS2 (Shin et al., 2022), which contains summaries derived from dialogue states using templates. This allows us to capture essential information from each turn in the conversation. Additionally, we have continued training our helper on the DialogSum (Chen et al., 2021) dataset, a human-annotated dataset in the daily life domain. This allows us to overcome the fixed format of summaries introduced by templates in DS2 and produce more natural pseudo summaries.

Gap Sentence Generation Plus (GSG+)
Dialogues in certain settings, such as meetings and medical dialogues, often include summary turns that summarize the entire conversation. For example, a participant may summarize a meeting, or a doctor may explain the outcome. These summary turns can be used as a pre-training objective because they highlight the main points of the dialogue and provide a concise overview of the topic Figure 2: A diagram of pre-training in DIONYSUS: The summary helper ( § 2.1) generates a pseudo-summary (G) to select dialogue turns ( § 2.2) as the "Principal" (P) and using various strategies ( § 2.3) to choose between the generated summary and the principal as the pre-training objective.
. Algorithm 1 GSG+ 1: P ← ∅ 2: for j ← 1 to m do 3: k := argmax{s i } n 6: P := P ∪ {x k } 7: end for discussed. In order to make DIONYSUS more adaptable to these scenarios, we have improved the independent principal method in the Gap Sentence Generation (GSG) method  by using it to select essential summary turns as pseudo summaries for training. Our new method, called Gap Sentence Selection Plus (GSG+), differs from the original GSG method by using the ROUGE1-F1 score between each dialogue turn x i and the generated summary G from the helper in Section 2.1 rather than the remaining text D \ x i to determine the importance of each turn. The generated summary eliminates much of the extraneous information from the dialogue and thus tends to have less noise than the full dialogue context, resulting in a less cluttered summary. This enables us to select the top-m-scored summary turns as the "Principal," which we believe will provide a more comprehensive overview of the vital information in the dialogue. For instance, when creating pseudo summaries for meeting minutes, using the original GSG may lead to selecting random dialogue turns. However, by using the summary helper to identify the key points, we have a higher likelihood of selecting the final dialogue turns as the "Principal," which can serve as a comprehensive summary.
Specifically, given a dialogue D = {x i } n , we use Algorithm 1 to obtain the pseudo-summary "Principal" P for the dialogue. The input for our training example is the remainder of the dialogue D \ P . In Section 5.7, we explore the impact of the dialogue turns order on the formation of the "Principal". Through the use of GSG+, we can effectively identify essential summary turns and generate more accurate pseudo-summaries than with the original GSG method. To generate the final pseudo summary S for each specific dialogue training example, we consider three strategies. These strategies are based on the generated pseudo summary G and the extracted "Principal" P . These strategies serve as the pretrain objective for the dialogue training example.

Pre-training Objectives Strategy
All G S = G: We always select the generated summary from the summary helper as the pretraining objective.
All P S = P : We always select the "Principal" as the pre-training objective.
Better ROUGE We use either G or P based on the recall of information from the dialogue to determine the pre-training objective. We utilize Algorithm 2 to get the pre-training objective by calculating the ROUGE1-F1 score for the pseudo summaries and the dialogue, excluding the "Principal" D \ P . It is important to note that we use the same reference to ensure a fair comparison.
For pre-training with above strategies, if we choose G as the pseudo summary, we input the full dialogue. If we choose P , we input the dialogue, excluding the "Principal," D \ P to create an abstract summary. However, we also include the "Principal" with a probability, using a copying mechanism to create an extractive summary. More information about this copy mechanism can be found in Section 5.4. It is important to note that we do not combine these two pseudo summaries for a single training example. Each example in our pre-training corpus will have either G or P as its designated pseudo summary.

Training Corpus
To train DIONYSUS, we utilized 19 conversational corpora that do not come with pre-defined dialogue summaries. We employed a self-supervised approach by using pseudo-summaries as the pretraining objective.
Conversational Corpora We collect 19 available conversational corpora consisting of 1.7M examples after truncating for pre-training. Corpus information is listed in Table 1. We access these corpora through ConvoKit v2.5.3 1 . This helps us to ensure that DIONYSUS is well-equipped to handle a variety of conversational scenarios.
DS2 This dataset (Shin et al., 2022) creates dialogue summaries for the MultiWOZ (Budzianowski et al., 2018;Eric et al., 2020) dataset by heuristic rules from the dialogue states. It has five different domains with 10,000 dialogues in total.
DialogSum This dataset (Chen et al., 2021) collects human annotated summaries for daily-life dialogues from three datasets: DailyDialog (Li et al., 2017), DREAM (Sun et al., 2019), and MuTual (Cui et al., 2020), as well as dialogues from an English-speaking practice website. It has 13,460 dialogues in total.

Downstream Tasks and Metrics
We evaluate our methods on three public dialogue summarization datasets or benchmarks: SAMSum (Gliwa et al., 2019), ConvoSumm (Fabbri et al., 2021), and TWEETSUMM (Feigenblat et al., 2021) SAMSum This dataset contains natural messenger-like dialogues created by linguists fluent in English. It has over 16k dialogues with manually annotated summaries by language experts.
ConvoSumm It is a benchmark of four domains: New York Times comment, StackExchange, W3C email, and Reddit. Dialogues are extracted from publicly available data, and each domain has 500 dialogues. They hire crowdsorce workers on Amazon Mechanical Turk to annotate dialogue summary.
TweetSumm This dataset contains 1,100 reconstructed real-world customer support dialogues from Tweet. Each dialogue has human annotated abstractive summaries and extractive summaries. We only use abstractive summaries in the dataset as references in our experiments.

Baselines
We compare our methods with three competitive baselines.
T5v1.1 It is an improved version of the original T5 model (Raffel et al., 2022). Since the original T5 model is pre-trained on downstream tasks in supervised learning, the test set of downstream tasks overlaps with the pre-training data. To make a fair comparison in a zero-shot setting, we choose T5v1.1 as it is pre-trained on C4 without mixing in the downstream tasks.
PEGASUS  propose this pre-trained model for abstractive summarization tasks. The pre-training objective is gap sentence generation, transforms any text into an abstractive summarization example by selecting important sentences as output summaries. We use the PEGASUS LARGE checkpoint 7 as there is no publicly available PEGASUS BASE checkpoint. GSG* We use the individual principal strategy of gap sentence generation (GSG) training objective in PEGASUS  but pre-train DIONYSUS with our training corpora. We build this baseline to explore the performance gap between our pre-training objective and GSG.

Implementation Details
Following Raffel et al. (2022) and  to save time and computation, we first conduct ablation experiments on a reduced-size T5v1.1 BASE model with 250M parameters. Then we scale up with the best settings to the final T5v1.1 LARGE model with 800M parameters. We use heuristics to clean up our pre-training corpora. First, we remove dialogues with less than two dialogue turns since they are too short to summarize. Then we remove URLs and emojis in the text. DIONYSUS is implemented with Huggingface Pytorch Transformers 8 (Wolf et al., 2020). We split dialogue turns with line breakers in pre-training input and add a "[Summary]" prefix. For pseudo summary creation, we use a compression ratio of 0.15 for the "Principal." This means that for a dialogue with l turns, we select 0.15l turns as "Principal." We explore the effect of different compression ratios in Section 5.3. We use Adam (Kingma and Ba, 2014) with weight decay for pre-training. We truncate dialogue training examples to ensure a maximum length of 512. Models are pre-trained with batch size 8 and learning rate 0.00001 on 16 Nvidia V100 GPUs until we observe no progress on validation data or up to 5 epochs. For few-shot experiments in Section 5.2, we fine-tune models up to 20 epochs with batch size 8 and learning rate 0.00005, and pick the checkpoint with the best validation performance.

Results and Analysis
We focus on low-resource dialogue summarization settings because it is difficult to collect enough training examples. We evaluate DIONYSUS with "All G", "All P", and "Better ROUGE" strategies in zero-shot and few-shot settings and compare it to the baselines.

Zero-Shot Results
In order to evaluate the effectiveness of DIONYSUS, we conduct a zero-shot test on DIONYSUS LARGE with all strategies and other baselines. We present the results in Table 2. The ROUGE1-F1, ROUGE2-F1, and ROUGEL-F1 scores are used as the standard evaluation measures for summarization tasks. Our models show impressive performance improvements over the baselines on all downstream datasets. Specifically, DIONYSUS LARGE with the "Better ROUGE" strategy performs the best overall across all downstream datasets (Average: ROUGE-1/2/L: 29.7/8.0/20.2), indicating that it benefits from both generated and extractive pseudo summaries and can adapt to various domains. The "All P" strategy performs better than the GSG* baseline on most datasets, indicating that our Gap Sentence Selection Plus method can effectively select dialogue turns that provide an accurate dialogue summary. Additionally, the DIONYSUS LARGE with "All G" and "Better ROUGE" strategies demonstrate significant improvement compared to T5v1.1 LARGE (Average ROUGE2: +5.6/ + 6.1) and PEGASUS LARGE (Average ROUGE2: +2.2/ + 2.7), indicating that pre-training with our summary helper is highly beneficial. However, the "All G" strategy only performs as well as the "Better ROUGE"  , suggesting that the improvement from the summary helper is more pronounced on this particular dataset. This may be due to the similarity between the datasets used to train the helper and the SAMSum dataset, which we discuss further in Sections 5.5 and 5.6. Overall, our models outperform previous methods, such as PEGASUS, in a zero-shot setting, demonstrating their effectiveness and potential for further development.

Few-Shot Results
In order to investigate the potential of reducing annotation labor in dialogue summarization tasks. We further investigate a few-shot dialogue summarization. We report ROUGE1-F1, ROUGE2-F1, ROUGEL-F1, and ROUGELSum-F1 scores to evaluate model performance. Specifically, We finetune DIONYSUS LARGE , PEGASUS LARGE , and T5v1.1 LARGE with the first 1/10/100/1K/10K training examples from the SAMSum dataset. We show the results of our experiments with varying training data sizes in Figure 3. We found that all three models improve in performance as the number of training examples increases. Among these models, DIONYSUS LARGE consistently outperformes both PEGASUS LARGE and T5v1.1 LARGE when trained with a dataset ranging from 0 to 10, 000 examples. This suggests that our pre-training process helps DIONYSUS adapt to downstream tasks more quickly. Additionally, we observed that PEGASUS LARGE outperformed T5v1.1 LARGE due to its pre-training on summarization tasks. As shown in Figure 3, the gap between DIONYSUS LARGE and PEGASUS LARGE is particularly significant when using fewer than 100 training examples, indicating that our model has better recall capabilities in dialogue summarization than PEGASUS. Furthermore, even when using only 10 training examples, DIONYSUS LARGE achieves higher ROUGE scores than the T5v1.1 LARGE model trained with 1,000 examples. This demonstrates that our model is the best option for lowresource dialogue summarization.

Effect of Compression Ratio
In GSG+, we can choose a fixed number of turns in the dialogue as a training objective or select turns with a compression ratio. We investigate the compression ratio in a dialogue turn level as the number of selected turns over the number of totals turns in the dialogue (N principal /N dialogue ). A low compression ratio will select fewer turns in the dialogue as the objective, making pre-training less challenging. However, it tends to have a lower ROUGE1-F1 score with the remaining dialogue turns, meaning the "Better ROUGE" strategy selects more generated summaries as the objective. While choosing a high compression ratio will make the pre-training more challenging. Nevertheless, it has a higher ROUGE score compared to generated summaries, leading to more principal under the "Better ROUGE" strategy. We show the zero-shot performance on development sets of the SAMSum dataset and TweetSumm dataset with compression rates from 10% to 60% in Figure 4. It shows that the model with 15% compression ratio achieves the highest ROUGE-2 score.

Effect of Copying Mechanism
The copying mechanism is a critical aspect of some dialogue types, e.g., meetings and medical dialogues. In these dialogues, the content in several dialogue turns could summarize the whole dialogue. As shown in Table 3, we compare the performance of the "All P" strategy to a scenario where 50% of the selected dialogue turns are retained in the input rather than being removed. In this case, the input for each pre-training example includes the entire dialogue D, rather than D \ P . This leads the model to focus on extractive summarization. We observed    Table 3: ROUGE-1/2/L scores of zero-shot setting for DIONYSUS BASE with "All P" strategy and "All P" without copying mechanism on SAMSum, Convo-Summ, and TweetSum.
that adding a random copy mechanism significantly improved the overall performance. Additionally, to further investigate the effect of the probability in the copying mechanism, we evaluate the "Better ROUGE" strategy with different copying probabilities ranging from 0.15 to 0.7. In these experiments, we choose top-2 dialogue turns as principal, which Figure 5: Comparison of the performance of different probabilities of copying selected sentences in the input of the "Principal" in DIONYSUS BASE using the "Better ROUGE" strategy. The performance of each probability is evaluated using the ROUGE2-F1 metric on the SamSum and TweetSumm development datasets. results in 51.9% of pre-training objectives being the principal, and the rest is the generated summary. We show the results in Figure 5, which shows that leaving 15% of dialogue turns in the principal best enhances the overall quality of dialogue summarization.  Table 4: ROUGE-1/2/L scores of zero-shot setting for DIONYSUS BASE with "All G" strategy and the summary helper on SAMSum, ConvoSumm, and Tweet-Sum.

Comparison Between All G and Summary Helper
Since the summary helper model provides the generated summary as an objective candidate. The helper model may have demonstrated strong capa-bilities in zero-shot dialogue summarization. As shown in Table 4, we compare the helper model to our "All G" model in a zero-shot setting. The difference is that we train the "All G" model on the pre-training corpora annotated by the helper. We found that the helper model is not on par with our model. While the helper model may have performed well on a particular task (NYT), its overall performance is not as strong as our model. This is because our model has been extensively trained on various dialogue datasets, which makes our model consistently perform well in a wide range of tasks and scenarios. Table 5: Percentage of overlap between the SAMSum test set and the datasets used for pre-training. The Con-voKit corpora were comprised of a randomly selected 10% of the total datafor calculating the similarity.

Test-Set Overlap with Pre-Training Corpora
In order to ensure a fair comparison, we check for overlap between the dialogue datasets we use for pre-training and a downstream test set. This is done by calculating the similarity between all pairs of test set targets in the SAMSum dataset and pre-training documents using the ROUGE2recall measure, which is calculated as the number of overlapping bigrams divided by the total number of bigrams in the test target. We then count the number of test set examples that have a similarity to any pre-training example above a certain threshold. As shown in Table 5, the overlap between the SAMSum dataset and the datasets used for training the helper and the pre-training datasets is low when the similarity threshold is set between 0.4 and 1.0. This suggests that there is not significant similarity between our test set and the pre-training datasets. It indicates that the improvement in our model is due to the pre-training process rather than potential test data leakage.

Effect of the Dialogue Turns Order in Principal
We could use two possible orders to align the dialogue turns in the principal. The first order is to  align the text with the ROUGE1-F1 score. The second order is to align the principal with the order in the original dialogue. This means that the principal will be arranged in the same order as in the original dialogue, without rearrangement. This option helps preserve the original flow and structure of the dialogue. We compare these two orders of principal in the GSG* baseline. As shown in Table 6, the results suggest that keeping the order in the original dialogue helps improve zero-shot performance as it provides a more nuanced understanding of the dialogue. We choose this order for all our models.

Related Work
Dialogue summarization is a rapidly growing area of research that focuses on automatically generating concise and informative summaries of conversations (Feng et al., 2022). Unlike research on traditional documents like news articles (Fabbri et al., 2019;Ahuja et al., 2022) or scientific papers (Lu et al., 2020;Ibrahim Altmami and El Bachir Menai, 2022), dialogue summarization is particularly relevant in multi-party interactions, such as emails , meetings (Carletta et al., 2005), medical dialogues (Zeng et al., 2020), and daily chats (Chen et al., 2021). However, many existing methods for dialogue summarization require a large training dataset with annotated summaries. This can be a major barrier to applying these methods in real-world scenarios, particularly in cases with limited or no annotated data available. Our study looks at the potential for dialogue summarization in low-resource settings, which are prevalent in many real-world applications. We aim to make dialogue summarization more practical and easier to use in various contexts with minimum effort. Pre-trained Transformer-based (Vaswani et al., 2017) language models (Devlin et al., 2019;Radford et al., 2019; have become increasingly popular in natural language processing tasks for tackling the data shortage problem. However, many of these models have limitations when it comes to dialogue summarization.  proposes PEGASUS, which masks multiple whole sentences and pre-trains sequenceto-sequence models to reconstruct the original text. Built on that, Wan and Bansal (2022) improves the sentence selection strategy and adds modules for ensuring factuality during fine-tuning to address the problem of factuality in summarization. Phang et al. (2022) extends Pegasus with a modified architecture and long-sequence pre-training to tackle long-input summarization. He et al. (2022) proposes ZCode++ a pre-trained language model optimized for abstractive summarization with improved encoder. However, all these methods rely on the Gap Sentence Selection method, which is not optimal for dialogue summarization. In contrast, our approach uses pseudo-summary construction for different types of dialogues as the pre-training objective. By pre-training our models on large-scale dialogue corpora, we are able to make it possible for zero-shot dialogue summarization.
Another line of work focuses on pre-trained models for dialogues. DialoGPT  and PLATO (Bao et al., 2020) are pre-trained on large-scale conversation datasets such as Reddit, allowing them to generate responses to user input. For dialogue summarization, Jia et al. (2022) posttrains pre-trained language models to rephrase dialogues into narratives and then fine-tunes them for summarization. In contrast, our approach follows the T5 model's unified text-to-text format, using it both in pre-training and fine-tuning. Zhong et al. (2022) trains UNILM  with a window-based denoising framework for long dialogue understanding and summarization, but this approach does not focus on low-resource settings. Zou et al. (2021) proposes a pre-training paradigm that pre-trains the encoder and decoder separately in a supervised manner. However, our method uses a self-supervised pre-training approach that applies to any dialogue dataset. This makes it easier to extend to larger pre-training corpora for further improvement.

Conclusion and Future Work
We present DIONYSUS, a pre-trained encoderdecoder model for zero-shot dialogue summarization in any new domain. We pre-train DIONYSUS using a self-supervised approach that generates pseudo-summaries for large dialogue corpora as the pre-training objective. We investigate the impact of various pre-training objective strategies and model sizes on dialogue summarization performance. Our experiments demonstrate that DIONYSUS outperforms state-of-the-art models on six datasets in a zero-shot setting. Furthermore, DIONYSUS can be fine-tuned with only 10 examples to outperform vanilla T5 fine-tuning with 1,000 examples. Our work makes dialogue summarization more practical and easier to use in various contexts with minimal effort. We plan to extend this method to abstractive summarization tasks to develop a general zero-shot summarization model. Table 7 presents the results of DIONYSUS BASE in a zero-shot setting, and Figure 6 compares the few-shot results of DIONYSUS BASE with those of the t5 base model. These initial results demonstrate the potential for further analysis and optimization of DIONYSUS. Upon comparison with other baselines, it is clear that DIONYSUS performs better under both zero-shot and few-shot conditions, outperforming the GSG* model. These results provide valuable insight into the capabilities of DIONYSUS and can inform the development of larger models.

B Pre-training Steps
To evaluate the performance of DIONYSUS during pre-training, we measured the ROUGE1-F1, ROUGE2-F1, ROUGEL-F1, and ROUGELSum-F1 scores on the SAMSum dataset in Figure 7. We keep track of the model's progress by logging its performance every 1,000 training steps. This allows us to monitor the model's improvements over time and confirm that it is learning effectively.  Table 7: The ROUGE-1/ROUGE-2/ROUGE-L scores of the DIONYSUS BASE when implemented with different strategies and compared to T5v1.1 BASE in a zero-shot setting on three datasets: SAMSum, ConvoSumm, and TweetSumm.