Simple Conversational Data Augmentation for Semi-supervised Abstractive Dialogue Summarization

Abstractive conversation summarization has received growing attention while most current state-of-the-art summarization models heavily rely on human-annotated summaries. To reduce the dependence on labeled summaries, in this work, we present a simple yet effective set of Conversational Data Augmentation (CODA) methods for semi-supervised abstractive conversation summarization, such as random swapping/deletion to perturb the discourse relations inside conversations, dialogue-acts-guided insertion to interrupt the development of conversations, and conditional-generation-based substitution to substitute utterances with their paraphrases generated based on the conversation context. To further utilize unlabeled conversations, we combine CODA with two-stage noisy self-training where we first pre-train the summarization model on unlabeled conversations with pseudo summaries and then fine-tune it on labeled conversations. Experiments conducted on the recent conversation summarization datasets demonstrate the effectiveness of our methods over several state-of-the-art data augmentation baselines.


Introduction
Abstractive conversation summarization, which targets at processing, organizing and distilling human interaction activities into short, concise and natural text (Murray et al., 2006;Wang and Cardie, 2013), is one of the most challenging and interesting problems in text summarization. Recently, neural abstractive conversation summarization has received growing attention and achieved remarkable performances by adapting document summarization pre-trained models and (Gliwa et al., 2019;Yu et al., 2021) and incorporating structural information (Chen and Yang, 2020;Feng et al., 2020c;Zhu et al., 2020a;Liu et al., 2019b). However, most of these models usually require abundant human-annotated summaries to yield the state-of-the-art performances (Gliwa et al., 2019), making them hard to be applied into realworld applications (e.g. summarizing counseling sessions) that lack labeled summaries.
Data augmentation, which perturbs input data to create additional augmented data, has been utilized to alleviate the need of labeled data in various NLP tasks, and can be categorized into three major classes: (1) manipulating words and phrases at the token-level like designed word replacement (Kobayashi, 2018;Niu and Bansal, 2018), word deletion/swapping/insertion (Wei and Zou, 2019;Feng et al., 2020a), token/span cutoff (Shen et al., 2020b); (2) paraphrasing the entire input text at the sentence-level through round-trip translation (Sennrich et al., 2015;Xie et al., 2019; or syntactic manipulation (Iyyer et al., 2018;Chen et al., 2020c); and (3) adding adversarial perturbations to the original data which dramatically influences the model's predictions (Jia and Liang, 2017;Niu and Bansal, 2019;. Despite the huge success, the former two mainly perturbs sentences locally while ignoring the diverse structures and context information in dialogues to create high-quality augmented conversations for summarization. The third one might utilize context through additional backward passes, but often require significant amount of computational and memory overhead Zhu et al., 2019), especially for summarization tasks with long input.
To this end, we introduce simple and novel set of Conversational Data Augmentation (CODA) techniques for conversation summarization guided by conversation structures and context, including: (1) random swapping/deletion randomly swap or delete utterances in conversations to perturb the discourse relations, (2) dialogue-acts-guided insertion randomly insert utterances based on the Figure 1: Examples of utilizing different CODA strategies to augment the given conversation including (1) random Swapping/Deletion where last two utterances are swapped (top), (2) dialogue-acts-based Insertion where a backchannel utterance is inserted after the first utterance (middle), and (3) conditional-generation-based substitution where the first utterance is substituted with a model-generated one (bottom). dialogue acts like self-talk, repeating utterance and back-channel (Allen and Core, 1997;Sacks et al., 1978) to interrupt the conversations, and (3) conditional-generation-based substitution randomly substitute utterances in conversations based on pre-trained utterance generation models conditioned on the conversation context. Examples for operations in CODA are shown in Figure 1. To further enhance the performance when labeled summaries are limited, we extend CODA to semisupervised settings, Semi-CODA, where we combine CODA with two-stage noisy self-training (Xie et al., 2020;He et al., 2020) to utilize conversations without annotated summaries. Specifically, we repeat the process where we first generate pseudo summaries for unlabeled conversations with the base summarization model, then we pre-train a new model on pseudo data points and fine-tune the model on labeled conversations to form the updated summarization model. To sum up, our contributions are: • We propose simple yet effective data augmentation techniques for conversation summarization by considering the structures and context of conversations.
• We introduce a semi-supervised conversation summarization framework by combing CODA and two-stage noisy self-training.
• We demonstrate the effectiveness of our proposed methods through extensive experiments on two conversation summarization datasets , SAMSum (Gliwa et al., 2019) and ADSC (Misra et al., 2015).

Abstractive Conversation Summarization
Abstractive conversation summarization has received much attention recently. Other than directly apply document summarization models to conversational settings (Gliwa et al., 2019), models tailored for conversation are designed to achieve the state-of-the-art performances such as modeling conversations in a hierarchical way (Zhao et al., 2019;Zhu et al., 2020b). The rich structured information in conversations are also explored and leveraged such as dialogue acts (Goo and Chen, 2018), key point/entity sequences (Liu et al., 2019a;Narayan et al., 2021), topic segments (Liu et al., 2019c;, stage developments (Chen and Yang, 2020), discourse relations Feng et al., 2020b). External information like commonsense knowledge has also been incorporated to help understand the global conversation context as well (Feng et al., 2020c). However, current summarization models still heavily rely on abundant parallel data to achieve the state-of-the-art performances (Yu et al., 2021). Little work has focused on low-resourced settings where well-annotated summaries are limited or even unavailable. To fill this gap, in this work, we introduce a set of conversational data augmentation techniques to alleviate the dependence on labeled summaries.

Data Augmentation for NLP
Data augmentation is one of the most common approaches to mitigate the need for labeled data in various NLP tasks (Feng et al., 2021). The augmented data is usually generated by modify-ing existing data points through transformations while keeping the semantic meaning unaffected like designed word/synonym replacement (Kobayashi, 2018;Niu and Bansal, 2018;Kumar et al., 2020), word deletion/swapping/insertion (Wei and Zou, 2019), token/span cutoff (Shen et al., 2020b), and paraphrasing through round-trip translation (Sennrich et al., 2015;Xie et al., 2019;. Even though they could be directly applied to conversation summarization settings, these prior techniques mainly modify the text locally and largely ignore the structure and context information in conversations to generate more effective and diverse augmented conversations. To this end, our CODA augmentation will perturb the conversation structures and substitute paraphrases by taking into account the conversation context.

Semi-supervised Learning Methods
Semi-supervised learning methods can further reduce the dependency on labeled data and enhance the models by using large amounts of unlabeled data (Chapelle et al., 2009;Gururangan et al., 2019;. Unlabeled data is usually incorporated through consistency training (Xie et al., 2019;Chen et al., 2020b,a), co-training (Clark et al., 2018), variational auto encoders (Gururangan et al., 2019;Yang et al., 2017) or self-training (Scudder, 1965;Riloff and Wiebe, 2003;Xie et al., 2020). In this work, we focus on self-training, one of the most classic "pseudo-label" semi-supervised learning approaches (Yarowsky, 1995;Riloff and Wiebe, 2003). Self-training often iteratively incorporates unlabeled data by learning student models from pseudo labels assigned by teacher models. The teacher model could be the model trained on labeled data or the model from last iteration (Zhu and Goldberg, 2009). Recent work showed that combining self-training with better noise/augmentation techniques to perturb the input space greatly improve the performances on classification tasks (Rasmus et al., 2015;Laine and Aila, 2017;Miyato et al., 2019;Xie et al., 2020). However, their impact on language generation tasks like summarization is largely underexplored because, unlike classification tasks, the the pseudo summaries might be quite complicated and very different from human-annotated labels (He et al., 2020). Inspired by these previous selftraining work, we will combine our CODA with the two-stage noisy self-training framework (He et al., 2020) for semi-supervised abstractive conversation summarization.

Methods on Semi-Supervised CODA
In order to generate more diverse and effective augmented data for conversation summarization and alleviate the reliance on human annotations, we propose a set of simple Conversational Data Augmentation (CODA) to perturb conversations based on the conversation structures and global context (Section 3.1). We further introduce Semi-CODA under the self-training framework to utilize unlabeled conversations for semi-supervised conversation summarization (Section 3.2).

CODA
For a given conversation c = {u 0 , ..., u n } with n utterances, CODA random performs one of the conversational perturbations described below to generate augmented conversation c while preserving the semantic information of the global conversation.
Random Swapping or Deletion Utterances from different speakers in conversations usually follow Gricean Maxims (Dale and Reiter, 1995) to achieve effective communication in social situations, which requires utterances to be related to each other orderly under the context of discourse (Murray et al., 2006;Qin et al., 2017). From the perspective of perturbing discourse relations to create augmented conversations (Gui et al., 2021), we introduce two simple operations to perturb the discourse relations: (1) random swapping, which breaks the discourse relations by randomly swapping two utterances in one conversation to messes up the logic chain of utterance, and (2) random deletion, which goes against the discourse requirement by randomly deleting K r = α d ·n utterances to provide less information in the conversations, where n is the number of utterances in conversations and α d is a hyper-parameter to control the strength of the deleting perturbation, as shown in Figure 1. In practice, for one conversation c, we combine these two strategies by randomly choosing one of them to generate the augmented conversation c .
Dialogue Acts Guided Insertion Unlike structured documents, conversations have unique characteristics of interruptions (Allen and Core, 1997) such as repetitions, false-starts, reconfirmations, hesitations and backchanneling (Sacks et al., 1978), making it challenging for summarization models to reason over conversations and extract key information. Inspired by these observations, we introduce a novel dialogue-acts-guided insertion to interrupt the conversations to generate augmented data. Specifically, for a given conversation c, we randomly insert repeated utterances or insert utterances whose dialogue acts (Jurafsky and Shriberg, 1997) are interruptions including acknowledge/backchannel (e.g., "Uh-huh"), response acknowledgement (e.g., "Oh, okay"), backchannel in question form (e.g., "Is that right?"), self-talk (e.g., "What is the thing I am thinking of"), or hedge (e.g., "I don't know if I'm making any sense or not.") to generate augmented c , as shown in Figure 1 where an utterance with backchannel act "Uh-huh!" is inserted.
For inserting repeated utterances, we randomly select K r = α r · n utterances from the input conversation and directly insert them back. For other types of dialogue-acts-guided insertion, we randomly insert K d = α d · n utterances sampled from a pre-defined utterance set to random positions in the input conversation. The pre-defined set consists of 8,000 utterances: (1) utterances with desired dialogue acts from a human annotated Switchboard corpus (Jurafsky and Shriberg, 1997), and (2) utterances with desired dialogue acts from high confidence predictions using a state-of-the-art dialogue acts classifier (Raheja and Tetreault, 2019) (with 82.9% accuracy on Switchboard corpus) on SAM-Sum corpus (Gliwa et al., 2019).

Conditional Generation based Substitution
Paraphrasing has been effective as data augmentation on sentence-level tasks like sentence classification (Xie et al., 2019; as it could generate sentences with similar semantic meaning but with different word choices. However, when it comes to utterances in a dialogue, simple paraphrasing techniques like round-trip translation (Sennrich et al., 2015) might not be able to capture the context information in conversation, leading to limited diversity and low quality in its augmented utterances. To this end, we propose a conditional generation based method to generate new utterances and substitute the original utterances. , with its architecture shown in Figure 2.
We first pre-train the conditional generation model g(.; θ) which could generate an utterance u i with a masked conversation c mask = {u 0 , ..., u i−1 , u mask i , u i+1 , ..., u n } and a prompt p i as input. Specifically, during the pre-training stage, utterance u i ∈ c is randomly sampled and substituted with <MASK>. The unique tokens in u i are then randomly shuffled to form the prompt p i . We initialize the generation model g(.; θ) with BARTbase (Lewis et al., 2020), and prepend the prompt p i to the masked conversation c mask as input. The pre-training objective is: During the augmentation stage, for a random utterance u i in c, we construct the c mask and p i in the same way as the pre-training stage. We employ the random sampling strategy with a tunable temperature τ to generate u i and construct the augmented conversation c by substituting u i with u i in c; τ is a hyper-parameter to control the diversities (higher temperature would result in more diverse generations while injecting more noise). In practice, we randomly substitute K g = α g n utterances in c with generated utterances from g(.; θ).

CODA for Conversation Summarization
When training conversation summarization models f (.; θ), for any input conversation c with summary  Fine-tune f (.) on C l with CODA 6: end for s in the training set C, we randomly choose and perform one of the above augmentations to generate c in each epoch. The objective is: Note that our introduced CODA augmentation techniques can also be combined and performed in a sequential manner. CODA is agnostic to any conversation summarization models. In this work, we utilize the state-of-the-art summarization model, BART (Lewis et al., 2020), as our base model.

Semi-supervised CODA
To further improve the performance of learning with limited annotated conversations, we combine CODA with two-stage noisy self-training framework (Xie et al., 2020;He et al., 2020) for utilizing unlabeled conversations. The semi-supervised CODA algorithm is shown in Algorithm 1.
Specifically, for a parallel conversation dataset C l = {(c l i , s l i )} i=1:n where c l i is the conversation and s l i is the annotated summary, and a large unlabeled dataset C u = {(c u i )} i=1:m , where m >> n. In semi-CODA, a teacher conversation summarization model f (.; θ * ) is first trained on C l where CODA perturbations are utilized to inject noise. Then semi-CODA iteratively (1) apply the teacher model f (.; θ * ) to predict pseudo summaries on unlabeled conversations C u without any noise injected, (2) pre-train a new summarization model f (.; θ) on C u with CODA being applied, (3) finetune f (.; θ) on labeled data C l with CODA being applied and update the teacher model f (.; θ * ). The objective function of semi-CODA for annotated conversation is the same as Equation 2, while the objective function for unlabeled conversation is: Here, θ * is the parameter from the teacher model (from last iteration) and fixed within the current iteration. In practice, after step (1) in semi-CODA, we apply BERT-score (Zhang* et al., 2020) to calculate the semantic relevance between generated summaries and the unlabeled conversation, and select a subset of C u with the BERT-score higher than a threshold T for the following steps.

Datasets
To demonstrate the effectiveness of our CODA methods on a human-annotated dialogue dataset, we chose SAMSum (Gliwa et al., 2019) that contains open-domain daily-chat conversations such as arranging meetings, planning travels and chitchat. We use the original validation and test set as our validation and test set. To construct a lowresourced setting, we randomly selected 1% (147) and 5% (735) conversations in the original training set as our training set, and 50% conversations (7366) as unlabeled conversation. We also evaluated the generalizability of our methods on Argumentative Dialogue Summary Corpus (ADSC) (Misra et al., 2015) about summarizing debates. The data statics are shown in the  pre-processing, we separate every utterance in conversations with a special separator ("</s><s>") and truncate the input conversation into 800 tokens.

Baselines
We compared CODA with several state-of-the-art augmentation techniques and baselines: • BART (Lewis et al., 2020) is the state-of-theart pre-trained models for summarization. We used BART-base 1 as our base model for all the methods. We also tested AdaptSum (Yu et al., 2021) by initializing the summarization model with BART-base pre-trained on XSUM (Narayan et al., 2018) summarization task.
• Token Cutoff (Wei and Zou, 2019; Shen et al., 2020a) randomly removes tokens from the input to create perturbed conversation.
• Span Cutoff (Shen et al., 2020a) randomly eases a contiguous span of text in conversations to lead to harder perturbed conversation.

Model Settings
For the dialogue acts classifier, we directly followed the settings in Raheja and Tetreault (2019) and applied the trained classifier to predict dialogue acts of utterances in SAMSum corpus. We initialized our conditional generation model with BARTbase (Lewis et al., 2020) and trained the model on SAMSum corpus. During augmentation, the sampling temperature is 0.7. α in CODA was selected from {0.1, 0.2, 0.3, 0.5}. We utilized RoBERTalarge 3 to initialize the BERT-score (rescale with baseline) (Zhang* et al., 2020) and set the filtering threshold T = 0.25. The maximum iteration for semi-CODA was set 5. For all the methods, we used BART-base to initialize the conversation summarization model. During training, we used a batch size of 12 for 10 iterations with a 3e-5 learning rate. We used Adam optimizer with momentum    β 1 = 0.9, β 2 = 0.998. During the decoding stage, we used beam search with a beam size of 4.

Results
Using Limited Labeled Summaries We varied the number of conversations with summaries for training in both fully-supervised and semisupervised settings. The ROUGE scores using the rouge package 4 , were shown in Table 2 (1% (147) labeled data was used) and Table 3 (5% (735) labeled data was used). Compared to BART-base by pre-training on a news summarization corpus XSUM (Narayan et al., 2018), AdaptSum (Yu et al., 2021) shows similar performances, probably due to the large differences between news and daily chats. When applying Cutoff based augmentations or Round-trip Translations to generate new conversations, performances boosted compared to BART-base as more data was used in the training. Through perturbing conversation structures to generate harder conversations via randomly swapping/deleting utterances and inserting interruption utterances, Random Swapping/Deletion and Dialogue-acts-guided Insertion outperformed the baseline augmentation methods. Substituting utterances with more context-aware paraphrases from Conditional-generation-based Substitution also consistently improved Round-trip Translations. By combining all the conversational augmentation techniques, CODA achieved the best scores (e.g., with an increase of 2.8% on ROUGE-1, 3.7% on ROUGE-2 and 3.2% on ROUGE-L compared to BART-base when 1% labeled data was used). After incorporating unlabeled conversations through two-stage noisy self-training framework, all the augmentation methods showed large performance improvements over our base model BART. Compared to previous state-of-the-art data augmentations (Cutoff and Round-trip Translation), our proposed conversational augmentation techniques worked better when combined with noisy self-training as they could provide more effective perturbations. Consistently, our Semi-CODA achieved the significantly better performances especially when there are less labeled data (e.g., with an increase of 8.1% on ROUGE-1, 11.9% on ROUGE-2 and 9.2% on ROUGE-L compared to BART-base when 1% labeled data was used). Using All Labeled Summaries Table 4 summarized performances on the full setting where all the labeled data was utilized for training. CODA still showed performance gains compared to all baselines, suggesting that our proposed conversational data augmentation methods work well for conversation summarization even when a large number of labeled conversations is available for training. Human Evaluation We conducted human annotations to evaluate summaries generated by different models trained with 1% (147) conversations from SAMSum. Specifically, we asked annotators from Amazon Mechanical Turk 5 to rank summaries via a 1 (the most preferred) to 3 (the least preferred) scale, generated from BART, CODA and Semi-CODA for randomly sampled 150 conversations. Workers were paid 0.15$ for each ranking task. Every summary triples were ranked by three workers. The rank for every summary was aggregated by majority voting. The Intra-Class Correlation (ICC1k) was 0.561, indicating moderate agreement (Koo and Li, 2016)). As shown in Figure 3, our CODA and Semi-CODA received lower average rankings, which further demonstrated the effectiveness of CODA and Semi-CODA.
Out-of-domain Evaluation We then directly evaluated models trained with 1% (147) conversations with summaries from SAMSum on the debate summarization dataset ADSC (Misra et al., 2015), to investigate the generalization abilities brought by different augmentation methods and unlabeled conversations. As shown in Table 5   and Semi-CODA achieved significantly better outof-domain ROUGE scores than all the baselines, demonstrating the effectiveness of our designed conversational augmentation methods and the ways to incorporate unlabeled conversations.

Ablation Studies
Number of Iterations in Semi-CODA Here we showed the effects of iterative training in Semi-CODA. For all the iterations in Semi-CODA, we adopted the same hyperparameters. As shown in Table 6, ROUGE scores kept improving and achieved the best performance at iteration 3, and then started to converge. This indicates the effectiveness of iterative training in Semi-CODA by continually updating the teacher model to generate better pseudo summaries.
Two-stage Self-training vs. Joint Self-training One alternative in self-training is to merge the labeled conversation and conversations with pseudo summaries and train new models on them jointly (Edunov et al., 2018). We compared our two-stage training strategy in Semi-CODA with the jointlytraining with the same set of hyperparameters in Table 7. We found that two-stage training outperformed jointly training, indicating that our twostage strategy in Semi-CODA could effectively mitigate the noise from pseudo summaries.

Conclusion
In this work, we introduced a simple yet effective set of conversational data augmentation methods CODA, for improving conversation summarization in low-resourced settings. To further utilize unlabeled conversations, we proposed Semi-CODA that utilizes a two-stage noisy self-training framework. Experiments on both in-domain and out-ofdomain evaluations demonstrated that our CODA augmented conversations better compared to previous state-of-the-art augmentation methods. In the future, we plan to examine diverse conversation structures for conversation augmentation and work on zero-shot conversation summarization tasks.