Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining

With the rapid increase in the volume of dialogue data from daily life, there is a growing demand for dialogue summarization. Unfortunately, training a large summarization model is generally infeasible due to the inadequacy of dialogue data with annotated summaries. Most existing works for low-resource dialogue summarization directly pretrain models in other domains, e.g., the news domain, but they generally neglect the huge difference between dialogues and conventional articles. To bridge the gap between out-of-domain pretraining and in-domain fine-tuning, in this work, we propose a multi-source pretraining paradigm to better leverage the external summary data. Specifically, we exploit large-scale in-domain non-summary data to separately pretrain the dialogue encoder and the summary decoder. The combined encoder-decoder model is then pretrained on the out-of-domain summary data using adversarial critics, aiming to facilitate domain-agnostic summarization. The experimental results on two public datasets show that with only limited training data, our approach achieves competitive performance and generalizes well in different dialogue scenarios.


Introduction
With the explosion in the quantity of dialogue data from the Internet and daily life, there is growing interest in automatic dialogue summarization for various scenarios and applications, such as email threads, meetings, customer service, and online chats (Murray and Carenini, 2008;Shang et al., 2018;Zou et al., 2021a,b). Unfortunately, creating large-scale dialogue datasets with annotated summaries is costly and labor-intensive, which makes it difficult to build and train large summarization models using adequate supervision signals, especially in a new domain. Hence, it is necessary to develop models for dialogue * Corresponding authors. summarization in low-resource settings, where only limited or even no training examples are available.
Recently, domain adaptation approaches with large-scale pretraining have attracted much attention in low-resource summarization (Wang et al., 2019;Yang et al., 2020;Zhang et al., 2020). A similar strategy is used in dialogues, whereby external summary data from other domains, e.g., the CNN/Dailymail news dataset (Hermann et al., 2015), are introduced for model pretraining prior to the final fine-tuning on low-resource dialogue summaries. Recent works (Gliwa et al., 2019;Joshi et al., 2020) have also reported the effectiveness of pretrained summarizers for different kinds of dialogue scenarios, such as chat logs and medical conversations.
However, dialogue summary data has several inherent and significant differences from conventional articles in terms of text styles and summary structures. (i) Dialogues generally contain multiple participants who have distinct characteristics. (ii) Rather than the formal expressions found in news documents, dialogues often comprise utterances with informal or ungrammatical phrases. (iii) The structure of a dialogue summary, including length and the level of abstraction, is quite different from that in other domains , e.g., CNN/Dailymail. Thus, considering the huge difference between dialogues and general documents, direct finetuning on dialogue summaries is not ideal when using a model pretrained from other domains.
To better leverage summary data from domains such as news or scientific articles, in this work, we introduce a novel pretraining paradigm called domain-agnostic multi-source pretraining (DAMS) to summarize dialogues in a low-resource setting. We postulate that the pretraining of dialogue summarization could be decomposed into three procedures: the pretraining of encoder, decoder, and the combined encoder-decoder model. Specifically, the dialogue encoder is pretrained on large-scale unannotated dialogues to learn the way of dialogue modeling and understanding. The summary decoder is pretrained on large-scale summarylike short texts to learn a language model in the style of the dialogue summaries. Furthermore, the encoder and decoder are combined and pretrained on external summary data to go through an integral process of summarization. The above pretraining processes from the three sources are performed simultaneously. By this means, DAMS exploits large-scale non-summary data in the same domain to narrow the gap between pretraining and finetuning. Additionally, adversarial critics are used to capture the features shared between dialogues and general documents, and to learn to perform domain-agnostic summarization.
We conducted experiments on two public dialogue summary datasets, namely SAMSum (Gliwa et al., 2019) and ADSC (Misra et al., 2015). Pretraining was conducted on datasets from multiple sources, including dialogue corpora, dailylife short text corpora, and text summarization datasets from the news domain. The experimental results show that with only limited training data of dialogue summaries, our approach achieved competitive performance and showed a promising ability for generalizing different dialogue scenarios. Our codes and datasets are publicly available 1 .
In summary, our contributions are three-fold: 1) We explore the task of dialogue summarization in a low-resource setting with the usage of external multi-source corpora. 2) A novel pretraining strategy is designed to bridge the gap between out-of-domain pretraining and in-domain finetuning for domain-agnostic summarization. 3) Comprehensive studies on two datasets show the effectiveness of our method in various aspects.

Dialogue Summarization
Dialogue summarization is a challenging and valuable task that receives much attention in recent years. Different from studies on conventional documents like news or reviews (See et al., 2017;Narayan et al., 2018;Chu and Liu, 2019), dialogue summarization is investigated in multi-party interactions such as mail threads (Rambow et al., 2004), meetings (Gillick et al., 2009;Shang et al., 2018;Zhong et al., 2021), telephone conversation records (Zechner, 2001;Gurevych and Strube, 1 https://github.com/RowitZou/DAMS 2004), and daily chats (Gliwa et al., 2019;. Most of these approaches share a similar prerequisite: a decent labeled training dataset with annotated summaries. Nevertheless, creating a large-scale dialogue summary dataset is very expensive and labor-intensive, which makes the traditional methods hard to apply in real-world applications, especially when only limited or even no training signals are available. In this work, we explore dialogue summarization in a low-resource setting, and leverage external large-scale corpora to facilitate the task, which is applicable to most dialogue scenarios.

Domain Adaptation for Summarization
Since texts and their summaries across diverse domains might share similarities and benefit from each other, domain adaptation for text summarization has attracted much research interest recently (Hua and Wang, 2017;Wang et al., 2019;Zhang et al., 2020;Yang et al., 2020;. Most existing works perform pretraining on largescale out-of-domain datasets and then adapt to the in-domain summary data. For dialogue summarization, although it is more ideal to perform adaptation from a source dialogue domain to a target dialogue domain (Sandu et al., 2010;Wang and Cardie, 2013), unfortunately, the inadequacy of dialogue summary data makes it infeasible to directly train a large summarization model on the source data in an end-to-end manner. Recently, a couple of works have leveraged large-scale summary data that is more distinct from the dialogue domain, e.g., the news domain, to facilitate dialogue summarization (Gliwa et al., 2019;Joshi et al., 2020). However, the huge gap between dialogues and general articles is barely noticed.  conducted pretraining on the news summary data and the dialogue non-summary data simultaneously, but the two different tasks share a single decoder, which might confuse the model about the knowledge that it learns. To better leverage the outof-domain summary data and the in-domain nonsummary data, we explore the domain-agnostic summarization. It is supported by a multi-source pretraining paradigm with adversarial learning, where the encoder and the decoder are separately pretrained on the in-domain non-summary data and combinedly pretrained on the out-of-domain summary data, aiming to narrow the gap between pretraining and fine-tuning.  Figure 1: The overall architecture of DAMS. The multi-source pretraining includes: (i) encoder pretraining using dialogues (green); (ii) decoder pretraining using short texts (yellow); (iii) Joint pretraining using general articles with corresponding summaries (orange).

Methodology
In this section, we detail the low-resource dialogue summarization under the domain-agnostic multisource pretraining (DAMS). It consists of three pretraining objectives: two reconstruction losses with denoising auto-encoders that learn dialogue modeling and summary-like text generation; a sequence-to-sequence (seq2seq) training objective with the combined dialogue encoder and summary decoder that learns abstractive summarization. Additionally, two adversarial critics are attached to the encoder's output representations and the decoder's input representations, learning to perform domainagnostic summarization. The overall framework is illustrated in Figure 1.

Multi-Source Pretraining
Despite the considerable amount of summary data in other domains such as news and scientific articles, adaptation to the dialogue domain is not easy due to the huge difference between dialogues and conventional articles. To address this issue, we postulate that abstractive dialogue summarization could be decomposed into three procedures: (i) Dialogue modeling for understanding dialogue semantics and capturing dialogue characteristics; (ii) Saliency estimation based on learned representations to identify the important parts of input contents; (iii) Generating a summary grounded on the salient information with a certain style or structure. Although the limited dialogue summary data is inadequate to train the three procedures jointly, each one of them, fortunately, could be well handled by separate pretraining with largescale corpora from different sources. Specifically, dialogue modeling can benefit from the usage of large-scale unannotated dialogues. The external news summary data may contribute to the process of saliency estimation. A language model trained on daily-life short texts can generate discourses with the style of dialogue summaries, rather than formal expressions in news or scientific articles.
To incorporate multi-party information, we add the name of the speaker at the beginning of each utterance. Then, we tokenize utterances into word sequences, denoted as u i = {w i1 , .., w im }, where w ij is the j-th word in the sequence of u i . For noise addition, we randomly mask 15% of the tokens in each utterance with a special [MASK] token similar to BERT 2 (Devlin et al., 2019). The purpose of noise addition is to encourage DAE to reconstruct the original utterances for robust representation learning.
In this work, we employ Transformer with multi-head attentions (Vaswani et al., 2017) as the basic encoder and decoder of DAE. Before inputting word sequences into the encoder, we concatenate a special token [CLS] in front of each sequence similar to BERT. The final hidden state corresponding to this token is used as the aggregate sequence representation for utterance reconstruction. Formally, we transform the modified noisy sequence u i = {w cls i , w i1 ..., w im } into a sequence of hidden vectors by a Transformer encoder: where e ij is the embedding of the j-th word w ij in the word sequence, while w cls i , e cls i represent [CLS] and its embedding.
The decoder is an auto-regressive model that recovers the original utterance conditioned on the input representation h cls i . Here, we use a Transformer decoder with masked attention that conditions by adding h cls i to each input embedding. This is a Transformer variant that removes the decoder-encoder attention layer. Formally, the generation probability is defined by: whereê ij denotes the embedding of the predicted wordŵ ij at the decoding step j. Notably, the decoder applies utterance representations h cls i as memories instead of using word-level attention or copy mechanism. It encourages all semantics to be captured in h cls i . In Section 3.4, we give a further discussion about why we do not choose the wordlevel cross-attention. Finally, we use the original utterance u i as a gold reference to train the DAE for utterance reconstruction on large-scale dialogue corpora, paving the way to dialogue modeling for the downstream summarization task: (3) Pretraining of summary language modeling. We use the similar strategy as in dialogue pretraining to learn a summary language model. Here, we introduce the external corpora that contain dailylife short texts or stories, e.g., BooksCorpus (Zhu et al., 2015), to train the decoder to generate texts in the style of dialogue summaries. We truncate long documents into text pieces to form training samples, each one of which includes several consecutive sentences. We also add noise to these text pieces and train a DAE to recover them. Specifically, given the sentence sequence of a training sample S = {s 1 , s 2 , ..., s n }, we use the same noise addition strategy as for dialogues to construct noisy sentences, and encode them into hidden vectors by a Transformer encoder TF θ s e similar to Eq.1.
The generation process, however, is different from that of utterance reconstruction. Since a summary might contain more than one sentence, we should encourage the decoder to generate all sentences of S sequentially to simulate the process of summary generation. Hence, to further capture the global semantic dependency between sentences, we use another Transformer encoder to hierarchically fuse context information: (4) Here, all sentence representations derived from [CLS] tokens are fed into the hierarchical encoder for information interaction. The output vectors are then used as memories for decoder-encoder attention in a classic Transformer decoder to recover S. The generation probability is: where S represents the noisy text piece andŵ k ,ê k denote the k-th predicted word and its embedding. The difference between Eq.2 and Eq.5 is that the former reconstructs a single utterance, while the latter predicts the entire text sample. Finally, we train the language model conditioned on S as: Pretraining of abstractive summarization. In order to pretrain end-to-end summary generation, we bridge the dialogue encoder TF θ d e with the summary decoder TF θ s g using a context encoder has the same architecture as in Eq.4. Then, we input sentences of a document into TF θ d e and get a predicted summary from TF θ s g , training the model with the following objective: where D s is the document.ŵ k represents the k-th word in the predicted summary. Here, we reuse TF θ d e and TF θ s g for abstractive summarization, and its purpose is to bridge the gap between separate pretraining on multi-source texts and joint finetuning on dialogue summaries. By an integral process of text summarization, the combined encoderdecoder model learns to capture salient information from sentence (or utterance) representations and generate summaries accordingly.

Domain-Agnostic Summarization with Adversarial Learning
Ideally, the DAE learns a high-level latent content conveyed in representations, disentangled from their original attributes, e.g., styles of informal dialogue utterances and formal news sentences, adapting the way of saliency estimation and summary generation to the dialogue domain. However, models often learn domain-specific features, making it difficult to generalize in a new domain . To address this issue, inspired by recent works of adversarial summary generation (Liu et al., 2018;Rekabdar et al., 2019), we add an adversarial discriminator (critic) that learns to identify the domain of each representation, and use a gradient reversal mechanism (Ganin and Lempitsky, 2015) to ensure that the feature distributions over different domains are made similar (as indistinguishable as possible for the discriminator), thus resulting in the domain-invariant features and encouraging the summarizer to only focus on content rather than domain-specific attributes.
Here, we add two adversarial critics D e , D g on the output vectors of TF θ d e and the input vectors of TF θ s g , respectively (see Figure 1). The former classifies output vectors as dialogue utterances or news sentences, and the latter tries to distinguish news articles from short texts. The adversarial critic is a simple binary classifier with a multilayer perceptron and a sigmoid activator trained by a logistic loss function, denoted as L D e , L D g for D e and D g , respectively. Finally, we combine all pretraining losses and adversarial signals to jointly train the model, where α is a hyper-parameter to adjust the loss proportion:

Fine-tuning on Dialogue Summaries
After multi-source pretraining, we further stack TF θ d e , TF θ b h , and TF θ s g for joint fine-tuning on the dialogue summary dataset. The learning objective is similar to Eq.7. Notably, the three modules are fully trained by appropriate data from multiple sources, leading to a higher convergence speed on the target dialogue summaries (see details in Section 5.3), which requires fewer training data points to achieve a competitive performance.

Discussion of the Encoder-Decoder Connection Strategy
The encoder-decoder cross attention for encoding the context information is widely used in transformer-based architectures. Large-scale pretraining models for the summarization task, e.g., BART (Lewis et al., 2020), generally exploit token-level attention to integrate the document context. In this work, we have tried keeping the traditional token-level cross attention in the proposed architecture to directly connect the dialogue encoder and the summary decoder. However, we find that it is difficult to disentangle the encoder and the decoder for separate pretraining. It is also hard to add adversarial critics to token-level representations involved in the cross attention to learn domain-invariant features. Considering the above limitations, we use an embedding concatenation strategy in the dialogue decoder TF θ d g as a DAE to learn utterance representations. The summary decoder TF θ s g still has the cross attention, but keys and values are sentence representations from the context encoder TF θ b h instead of token representations from the dialogue encoder TF θ d e . Here, TF θ b h bridges the dialogue encoder and the summary decoder. It not only captures the context information of sentences (utterances), but also derives sentence-level representations that are applicable for domain identification in adversarial learning. Nevertheless, the abandonment of tokenlevel attention will inevitably affect the fine-grained information integration. In terms of how to keep the token-level cross attention in DAMS, we leave it as a future work for open discussions.

Datasets
Following the latest works Feng et al., 2020), we evaluate our method on two public dialogue summary datasets SAMSum (Gliwa et al., 2019) and ADSC 3 (Misra et al., 2015). Statistics of the dialogue datasets is shown in Table 1. SAMSum originally contains 14k training examples. To simulate a low-resource scenario, we start from using the full training data, and gradually reduce the number of training examples by halving the training set. For multi-source pretraining, we use the following datasets.
Dialogues. We use Reddit Conversation Corpus (Dziri et al., 2019) 4 for the pretaining of dialogue modeling. It contains about 15M context-response pairs for training, where each dialogue context consists of 3.5 utterances on average. Short Texts. We choose MSCOCO (Lin et al., 2014) and BookCorpus (Zhu et al., 2015) to pretrain the summary language model. MSCOCO is a standard benchmark dataset for the image caption generation task, which contains over 120K images and 600K captions describing the prominent object/action in an image. Here, we only use captions to train the generator. BookCorpus is a large-scale corpus containing 11,038 free books from the Internet. We randomly truncate long documents into text pieces as training samples 5 . Each sample contains 1.5 sentences on average and we collect about 5M samples for training.
Summarization Corpus. CNN/DailyMail (Hermann et al., 2015), Gigaword (Rush et al., 2015), and NewsRoom (Grusky et al., 2018) are used as our external summary datasets for joint pretraining. All the three datasets are news articles or headlines with summaries from various news publications. We combine these datasets and the total training set consists of 5.6M samples.

Comparison Methods
For comparison, we select various baseline systems from previous literatures: the basic baseline Longest-3 (Gliwa et al., 2019), which selects the longest three utterances as a summary; Classic seq2seq models, including Seq2Seq+Attention (Rush et al., 2015), Transformer (Vaswani et al., 2017), and PGNet (See et al., 2017); A pipeline method FastRL (Chen and Bansal, 2018) and its variant FastRL Enhanced (Gliwa et al., 2019), which first extracts salient sentences and then 3 Following Feng et al. (2020), we train the model using SAMSum corpus and perform zero-shot testing on ADSC. 4 https://github.com/nouhadziri/THRED 5 Here, we use truncated sentence sequences in Book-Corpus because we did not find other suitable corpora like MSCOCO. A real daily-life corpus with short-text summaries could be better for summary decoder pretraining. refines them; Convolution-based methods Light-Conv (Wu et al., 2019) and DynamicConv (Wu et al., 2019); Methods based on graph neural networks, including D-HGN (Feng et al., 2020) and TGDGA ); A seq2seq model BERT+TRF (Liu and Lapata, 2019) that is equipped with pretrained LMs.

Implementation Details
At the pretraining stage, we mix up the datasets from multiple sources and keep dialogues, short texts, and news summaries in a percentage of 1:1:1. The total data points are around 15M. Since DAMS consists of Transformer encoders and decoders, it can be easily combined with pretrained LMs. Here, we use BERT (Devlin et al., 2019) as the utterance/sentence encoder TF θ d e and use a separate optimization strategy (Liu and Lapata, 2019) to alleviate the mismatch between BERT and other randomly initialized parameters. We apply Adam (Kingma and Ba, 2015) (β 1 =0.9, β 2 =0.999) with learning rate 1e-3 for BERT and 1e-2 for other parameters. All transformer blocks except BERT have 6 layers, 8 heads, 768 hidden units, and the hidden size for all feed-forward layers is 2048. Loss coefficient α is selected from {0.01, 0.05, 0.1, 0.5} to control adversarial signals, and we empirically find that α = 0.1 achieves the best performance on the validation set. The model is pretrained for 250,000 steps with 10,000 warm-up steps on 2 GeForce RTX 3090 GPUs. At the fine-tuning stage, we use the last pretraining checkpoint for fine-tuning on the SAMSum dataset. We continue to train the model for 50,000 steps with 1,000 warm-up steps using Adam (β 1 =0.9, β 2 =0.999, learning rate=1e-3). During the inference time, summaries are decoded in a beam size of 3. The minimal summary length is set to 15 for SAMSum and 100 for ADSC, respectively. Checkpoints are saved and evaluated on the validation set every 2,000 steps. The best checkpoint trained on SAMSum is directly evaluated on ADSC to perform zero-shot testing.

Results and Analysis
In this section, we show the main results of DAMS against other baselines for dialogue summarization, and probe the effectiveness of DAMS by explanatory experiments in various aspects.   Table 3: Results of zero-shot testing on ADSC. Models marked with * use external news summary data. Table 2 and Table 3 show the results of automatic evaluation on the SAMSum and ADSC dataset. We evaluate summary quality using ROUGE F1 (Lin, 2004), including the unigram and bigram overlap (ROUGE-1, ROUGE-2) between system outputs and gold summaries, and the longest common subsequence (ROUGE-L). Some results are from the reported scores in previous literatures (Gliwa et al., 2019;Feng et al., 2020;.

Automatic Evaluation
In Table 2, all baseline methods are categorized into two groups. The first group includes models that are directly trained on the SAMSum corpus, and methods in the second group benefit from external news summary data 6 . DAMS with full training data outperforms all baseline methods and is significantly different from BERT+TRF (+news) with p < 0.05, which probes the superiority of  the multi-source pretraining strategy for dialogue summarization against the general exploitation of news summary data. Without news data, DAMS might be inferior to seq2seq models like PGNet or BERT+TRF, because these models use wordlevel attentions or copy mechanisms, while DAMS focuses on sentence/utterance representations for domain-agnostic representation learning. We also observe that the inclusion of news summary data does not necessarily mean a better ROUGE score (PGNet, FastRL). One possible explanation is that these models learn domain-specific features and have difficulty adapting to the dialogue domain. By contrast, with news summary data, the performance of DAMS increases a lot, which validates that our method can successfully capture useful information from external corpora. Furthermore, we directly test models on the ADSC dataset to verify whether they can generalize well to a new scenario. From Table 3 we observe that DAMS performs best, indicating that our multi-source pretraining strategy enables well-pretrained parameters for the downstream dialogue summarization, which makes the model easier to adapt to other dialogue scenarios.

Human Evaluation
Following Narayan et al. (2018), we randomly sample 100 examples in the test set of SAMSum for human evaluation. Three volunteers are invited to compare summaries produced from 6 systems (including the gold summary). Given a dialogue and two summaries from two out of six systems, each volunteer should decide which summary is better on two dimensions: informativeness (which summary captures more important information?) and fluency (which summary is more fluent?). We collect judgments from three volunteers for each comparison to minimize the inter-human noise. Table 4 shows the system ranking results. Each score is calculated as the percentage of times the system is selected as best minus the percentage of  Table 5: Ablation study of adversarial learning and multi-source pretraining. D e , D g are two critics. Dial., Short, and Summ. denote corpora of dialogues, short texts, and news summaries, respectively. times it is chosen as worst, ranging from -1 (worst) to 1 (best). Gold unsurprisingly ranks best. For informativeness, volunteers exhibit more preference to DAMS. For fluency, models with pretraining (DAMS / BERT+TRF) produce more acceptable summaries. We carry out pairwise comparisons between systems (using a binomial two-tailed test; p <0.05). In terms of informativeness, DAMS is significantly different from all other systems. For fluency, pretrain-based systems significantly differ from other systems, and BERT+TRF is not significantly different from DAMS.

Analysis and Discussion
We also perform qualitative analysis and discuss the effect of multi-source pretraining and adversarial learning with the following experiments.
Ablation Study. Table 5 shows the results of DAMS with different settings of adversarial critics and multi-source pretraining. We can see that the system suffers a performance degradation without the critic. It indicates that a domain-invariant representation is beneficial for downstream dialogue summarization. When any kind of external corpora is removed, the results drop a lot, which validates the effectiveness of multi-source pretraining.
Performance in Low-Resource Settings. To analyze model performances in low-resource set- tings, we gradually reduce the number of training examples in the SAMSum corpus by halving the training set. We report the results of DAMS and two baseline methods (Transformer and BERT+TRF) with different percentages of training data in Figure 2. We also report the performance of two variants of DAMS, without pretraining and only pretrained on the news summary data. Figure  2 shows a performance decline trend when the training data decreases continuously. We observe that with only limited SAMSum training data (40% / 20%), DAMS still achieves competitive results, while BERT+TRF (+News) suffers from a serious performance degradation. It indicates that DAMS has a promising ability of adapting news summaries to dialogue scenarios. Notably, using only 20% of the training data, DAMS achieves a competitive performance against Transformer and DAMS (w/o Pre.) that use the full training data, which proves the effectiveness of exploiting external corpora. When the training set is cut to 5% or even in a zero-shot setting, DAMS with multisource pretraining shows a superior performance against all the other systems, including its variant DAMS (Pre. on News). It validates that our multisource pretraining strategy is more applicable to dialogue summarization in a low-resource setting.
Convergence Rate. In Figure 3, we demonstrate the fine-tuning logs of different models on the SAMSum dataset. The left figure shows the perplexity and the right figure shows the average word accuracy. Unsurprisingly, models that benefit from pretraining have better initialized parameters, leading to faster convergence. Equipped with the multi-source pretraining strategy, DAMS can perform better and even achieve a 40% rate of word accuracy at the beginning of fine-tuning. Domain-Agnostic Representations. To verify the effectiveness of our adversarial strategy that can learn domain-agnostic features, we visualize the latent space of representations in 2-D using t-SNE (Van der Maaten and Hinton, 2008), with and without the critic. In Figure 4(a) where there is no critic, representations indeed show two separate clusters, while in Figure 4(b), hidden vectors with adversarial signals are effectively merged into one region, resulting in domain-agnostic representations. It encourages the summarizer to focus on content rather than domain-specific attributes for better generalization from other domains to the dialogue domain.
Case Study. Table 6 shows the system outputs of an exemplar dialogue. Texts with red color represent salient information in the dialogue, which is reflected in the gold summary. From the table we can see that DAMS can generate a summary that is more fluent and informative, which successfully captures critical information such as 'raining' and 'half an hour', composing a coherent discourse.

Conclusion and Future Work
In this paper, we propose a domain-agnostic multisource pretraining paradigm for low-resource dialogue summarization, which exploits external largescale corpora from multiple sources to facilitate dialogue modeling, summary language modeling, and abstractive summarization. The pretraining is conducted with adversarial signals to learn domainagnostic summarization. The experimental results verify the effectiveness and generalization of our method in low-resource settings. Future directions are exploring how to keep the token-level cross attention in the multi-source pretraining strategy. In this way, we could adopt the strategy in the models with universal transformer architectures, e.g., BART, to benefit from large-scale pretraining language models. Dialogue Val : it's raining! Candy: I know, just started... Val : r we going? we will be wet Candy: maybe wait a little? see if stops Val : ok. let's wait half h and than see Candy: god idea, I call u then Val : great :)

Gold
It's raining, so Val and Candy will wait half an hour before they go.

PGNet
Val is learning to meet Val and Val will see a little.

TRF
Val and Val don't have any news. Val will call him because they got lost. DAMS (w/o Pre.) Candy and Val are going to meet. Val will call Candy instead. BERT* +TRF Val and Candy are going for a little, but they need to wait half an hour.

DAMS*
Val and Candy are going to wait half an hour to see if it's raining. Table 6: System outputs of a dialogue example from the SAMSum test set. Systems marked with * utilize external news summary data.