Summary Grounded Conversation Generation

Many conversation datasets have been constructed in the recent years using crowdsourcing. However, the data collection process can be time consuming and presents many challenges to ensure data quality. Since language generation has improved immensely in recent years with the advancement of pre-trained language models, we investigate how such models can be utilized to generate entire conversations, given only a summary of a conversation as the input. We explore three approaches to generate summary grounded conversations, and evaluate the generated conversations using automatic measures and human judgements. We also show that the accuracy of conversation summarization can be improved by augmenting a conversation summarization dataset with generated conversations.


Introduction
Automatic conversation systems require large quantities of data to learn task specific language patterns and underlying conversation policies. Such data either come from human-to-human conversation logs (Lowe et al., 2015;Hardalov et al., 2018) or is collected in crowd-sourced environments, where two or more crowd-workers play specific roles under some guidelines (Zhang et al., 2018;Budzianowski et al., 2018). Since real human-to-human conversation logs are scarce, many datasets have been created using the latter approach. However, crowdsourced conversation data collection is time consuming, costly and presents multiple challenges to ensure data quality (Kang et al., 2018).
Conversation summarization is an emerging research area that has been ill-studied due to the lack of large-scale datasets. Most existing public datasets in this domain are small, for example, AMI meeting corpus (McCowan et al., 2005) contains * Current address: david.konopnicki@booking.com 137 summary transcripts. CRD3 (Rameshkumar and Bailey, 2020) is a spoken conversation dataset that consists of 159 conversations and summaries. Samsum (Gliwa et al., 2019), the only large scale dataset for conversation summarization, contains over 16, 000 open-domain conversations and summaries created artificially by humans.
Large scale pre-trained language models (PLMs) (Lewis et al., 2020;Brown et al., 2020;Raffel et al., 2020) have been used in various text generation tasks (Budzianowski and Vulić, 2019;Min et al., 2020;Cachola et al., 2020). In recent studies, PLMs are used to generate training data for natural language processing (NLP) applications. For example, Anaby-Tavor et al. (2020); Yang et al. (2020) use PLMs to create paraphrases for intent classifiers in conversation systems, and show that, when the original datasets are augmented with the generated data, performance improves. More recently Mohapatra et al. (2020) generated entire conversations grounded on instructions that are provided to crowd-workers using a modular approach, where different PLMs are trained for different roles.
Our Contributions: We investigate how PLMs can be utilized to generate entire conversations that are grounded on a given summary. We explore three approaches: (1) Supervised Learning (SL) based conversation generation (SL-Gen): where, a PLM is trained to generate an entire conversation, taking the summary of a conversation as input, (2) Reinforced Learning (RL) based conversation generation (RL-Gen): where, we further improve the SL-Gen method using the quality of the generated conversations as a reward, and (3) Controlled turn-by-turn conversation generation (CN-Gen): which allows us to generate conversations turn-by-turn, constrained on the summary and a set of pre-defined control parameters. We evaluate the quality of the generated conversations by conducting automatic and human evaluation. We arXiv:2106.03337v1 [cs.CL] 7 Jun 2021 Figure 1: The RL based conversation generation framework also show that once a conversation summarization dataset is augmented with the generated conversations, the performance of the downstream summarization task is improved.

Summary grounded conversation generation
In the conversation summarization task, a model takes a conversation as input, and learns to generate a summary. We study the inverse of that problem, where the input to our model is a summary, and the model generates a conversation. In this section, we propose three models for this task and the hyperparameters used in training the models are available in Section A of the appendix.

SL based generation (SL-Gen)
A seq2seq model can be trained for this task by providing a summary as the input and generating a conversation token-by-token. As PLMs have shown significant improvement over the traditional seq2seq architecture for text generation, we use a GPT-2 model and fine-tune it to generate a conversation given a summary as the input. Our input to the model follows the following format: <bos>summary text <dialog>conversation text<eos>. We also use different token-type-ids to indicate the summary and the conversation text. The model is trained to optimize Cross Entropy loss.

RL based generation (RL-Gen)
Many studies train text generation models with RL (Paulus et al., 2018;Li et al., 2016), where the generator network is optimized with a task specific reward. We investigate how the quality of the generated conversation can be used as a reward to improve the generation network. To this end, we train a summary generator network, which generates a summary, given a conversation. We measure the quality of the generated conversation by identifying the similarity between the summary of the generated conversation (generated, in turn, by the summary generator network) and the ground truth summary. The similarity score is used as a reward to train the conversation generation model. Our RL based generation framework is shown in Figure 1, and the critical components are described below. Conversation Generator: A trained SL-Gen model is used as the conversation generator, which, given an summary can generate a conversation. Summary Generator: We use a lightweight variant of BART (Lewis et al., 2019), named Distil-BART, which is fine-tuned on the Extreme summarization task (Narayan et al., 2018). We further fine-tune this instance on the conversation summarization data by providing the conversations as the input and training the model to output summaries. Reward Model: Once the Summary Generator generates an output summary for the generated conversation, the reward model compares it with the ground truth summary, which was used to ground the conversation generation. As Paulus et al.

Controlled conversation generation
We propose another approach, (CN-Gen), for conversation generation, which grants more control over the properties of the generated conversations.
Here, we generate one utterance of the conversation at a time, as opposed to the RL-Gen, where we generate the whole conversation at once. The properties of the generated conversations is controlled by adding several components to the input sequence to the model. The following three variables were used as the control parameters, (1) Number of remaining turns to generate in the conversation (Num turns): During the generation of a turn, we indicate the remaining number of turns in the conversation. In generating a n turn conversation, this starts with n for the first turn and reduces by 1 after the generation of each turn, (2) The speaker of the next Summary: person0 will be late. person1 will order pasta with salmon and basil for her.

turn conversation:
<Person0> I'll be late <Person1> I'll order some pasta with salmon and basil for you.   (3) The length of the next turn (Turn length): We define, 3 categories of lengths: Short (≤ 3 tokens), Long (> 10 tokens) and Medium (otherwise).

turn conversation
We use the following input representation to fine-tune a GPT-2 model: <bos> summary text <context> dialog context <turns to go> Num turns <speaker> speaker <turn length> turn length <turn> utterance <eos>. Changing these parameters allows us to generate different variants of conversations which are grounded on the same summary. During training, we obtain the values for the control parameters from the ground truth conversations, and at inference we randomly select the next speaker, number of turns of the conversation to be generated (in a range of 4-15 turns), and the next turn length. In Table 1 we show conversations of different lengths that were generated by the CN-Gen approach grounded on the same summary by changing the control parameters.
A summary and a conversation from the Samsum dataset (Gliwa et al., 2019), along with the conversations generated by the three aforementioned algorithms are shown in Figure 2. More examples are provided in the Section B of the Appendix.

Experiments
We experiment on the Samsum (Gliwa et al., 2019) dataset, which, to the best of our knowledge, is the only public large-scale conversation summarization dataset. We pre-process the dataset by replacing the personal names (ex: John) with unique tags (ex:<person 0 >). First, we evaluate of the quality of generated conversations using automatic mea-

Quality of the generated conversations
We evaluate the quality of the conversations generated by the three approaches that were introduced in Section 2. In Table 2 we show the properties of generated conversations and the ground truth conversations in the test set of Samsum dataset. Automatic Evaluation: We trained the conversation generation models on the Samsum training set and generated conversations on the test set. We compare the generated conversation with the ground truth conversations using the measures used by Sharma et al. (2017) to evaluate conversation system responses. The results shown in Table 3 suggest that CN-Gen outperform the SL-Gen and RL-Gen on all measures.
We also compare the summaries of generated conversations (generated by the Summary Generator) with the ground truth summaries, and the results are shown in Table 4. We believe that this is a semantic evaluation of the conversations, as the summaries capture the crux of the conversations. According to the results, CN-Gen outperforms the other two methods. This, along with the previous result suggest that the conversations produced by CN-Gen are the most similar to the ground truth conversations.
Human Evaluation: To evaluate the quality of generated conversations, we randomly selected 50 summaries from the Samsum test dataset and generated conversations using the three models. Three NLP experts were then asked to read the ground truth summary and rank the four conversations (3 generated and the ground truth conversation) using a [1-5] scale according to Grammaticality, Coherency, and Informativeness, with respect to the ground truth summary. Results are shown in table 5. As expected, the ground-truth conversations obtained the highest scores on all three aspects and can be considered as an upper bound for this task.    RL-Gen and CN-Gen obtained higher scores than SL-Gen and relatively good scores compared to the Ground Truth conversations. This corroborates the assumption that our proposed models generate high quality conversations. The Welch Two Sample t-test (Welch, 1947) shows that both RL-Gen and CN-Gen models outperform the SL-Gen model statistically significantly with p < 0.0001. However, there is no statistical significance between the results obtained from RL-Gen and CN-Gen. We report in Table 6 the average quadratic Cohen's Kappa calculated over the three possible combinations of two judges (Toledo et al., 2019). CN-Gen obtained the best scores during the automatic evaluation, while RL-Gen got the best scores from the human evaluation. The CN-Gen conversations are longer than the RL-Gen conversation by 1.3 turns on average (see Table 2), and hence would contain more word overlap with the ground truth. This results in better automatic evaluation scores for the CN-Gen, while the humans prefer short targeted conversations generated by RL-Gen.

Evaluation on the summarization task
To further evaluate the quality of the generate conversations, we augmented a conversation summarization dataset with generated conversations and evaluated the summarization model. We followed the following process: (1) We randomly selected x% of the summaries of the dataset and trained our conversation generation models, (2) The trained models were applied on the other (y=100-x%) of the summaries and generated conversations, (3) Those generated conversations along with the original summaries were added to the data. Using this approach, we can add extra y% (summary, conversation) pairs to the training data, (4) The conversation summarization model (discussed in Section 2 under 'Summary Generator') was trained on the augmented data. We compare the performance of the conversation summarization model on the original dataset and with augmentation.
Automatic Evaluation: We compare the three conversation generation methods at different augmentation percentages, and the results are shown in Table 7. At all augmentation levels, the summarization models trained with augmented data outperform the summarization model trained on the original dataset (without augmentation). CN-Gen based augmentation produces the best accuracy compared to other two methods. One prevalent pattern is that, when augmentation data increases, the accuracy seems to increase up to a certain point and then starts to decrease. The best accuracies were found around 30% data augmentation. We believe that more augmentation leads performance to drop due to the following reason. For augmenting with more data, we are left with less data to train the model for conversation generation (for 10% augmentation, the conversation generation models are trained on 90% of the data, while for 50% augmentation, the models are trained only on 50% of the     Table 7). Human Evaluation: We recruited 3 NLP experts to evaluate 50 instances of summaries generated with data augmentation (RL-Gen, CN-Gen), and respective summaries generated without augmentation (No-Aug). Here we consider two aspects with respect to a ground-truth summary: Coherency (whether the summary is easy to read) and Focus (whether the summary represents the groundtruth summary). Following (Amplayo and Lapata, 2020) we use the Best-Worst Scaling method. The score of each system is computed as the percentage of times it was chosen as the Best system minus times it was chosen as Worst. On the Coherency question, RL-Gen, CN-Gen and No-Aug obtained scores of 12.6, 6.6 and -4.0 respectively. On the Focus question RL-Gen, CN-Gen, and No-Aug obtained scores of 14.6, 6.0 and -2.6 respectively. These results confirm that the use of augmentation improves the quality of the summaries.

Conclusion
We investigated how the PLMs can be utilized to generate entire conversations that are grounded on a summary. We propose three approaches for conversation generation: SL-Gen, RL-Gen and CN-Gen and conducted multiple automatic and human evaluations to assess the quality of the generated conversations. Both automatic and human evaluations show that when compared to the ground truth conversations, RL-Gen and CN-Gen obtain high scores, suggesting that the proposed models generate high quality conversations. When a conversation summarization dataset is augmented with the generated conversations, the performance of conversation summarization is improved (over to 7% improvement in ROUGE-2 F-1), which also suggests that the proposed methods generate high quality conversations.
For the human evaluation of both conversations and summaries, we recruited 3 NLP researchers, who have graduate degree in NLP and Machine Learning. The annotation task itself was executed on Appen.com platform. Before the official annotation, we sampled 10 tasks to get an estimate of the duration of the task, and to make sure the instructions are clear enough.  Figure 3: Samples of dialogs with their corresponding summaries -ground-truth and automatic generated ones