Alternating Recurrent Dialog Model with Large-scale Pre-trained Language Models

Existing dialog system models require extensive human annotations and are difficult to generalize to different tasks. The recent success of large pre-trained language models such as BERT and GPT-2 (Devlin et al., 2019; Radford et al., 2019) have suggested the effectiveness of incorporating language priors in down-stream NLP tasks. However, how much pre-trained language models can help dialog response generation is still under exploration. In this paper, we propose a simple, general, and effective framework: Alternating Recurrent Dialog Model (ARDM). ARDM models each speaker separately and takes advantage of the large pre-trained language model. It requires no supervision from human annotations such as belief states or dialog acts to achieve effective conversations. ARDM outperforms or is on par with state-of-the-art methods on two popular task-oriented dialog datasets: CamRest676 and MultiWOZ. Moreover, we can generalize ARDM to more challenging, non-collaborative tasks such as persuasion. In persuasion tasks, ARDM is capable of generating human-like responses to persuade people to donate to a charity.


INTRODUCTION
It has been a long-standing ambition for artificial intelligence researchers to create an intelligent conversational agent that can generate human-like responses. Recently data-driven dialog models are more and more popular. However, most current state-of-the-art approaches still rely heavily on extensive annotations such as belief states and dialog acts (Lei et al., 2018). However, dialog content can vary considerably in different dialog tasks. Having a different intent or dialog act annotation scheme for each task is costly. For some tasks, it is even impossible, such as open-domain social chat. Thus, it is difficult to utilize these methods on challenging dialog tasks, such as persuasion and negotiation, where dialog states and acts are difficult to annotate.
Manning  proposed a simple sequence-to-sequence architecture that requires no explicit annotations. The model learns to extract information from dialog history with attention and copy mechanism. However, due to the limited language modeling capabilities in the previous model, Sequicity (Lei et al., 2018), which reuses belief states as inputs for supervision, outperforms Manning & Eric (2017)'s method significantly in recent dialog datasets. But with the success of large pre-trained language models such as BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), we re-examine Manning & Eric (2017)'s method and investigate how large-scale pre-trained language models can help dialog tasks.
Previous large-scale pre-trained language models are used to tackle documents with only one narrator. However, in dialogs, two speakers have different roles; therefore, their language model distributions are very different from each other. For example, customer service agents speak very differently to their customers. To address this issue, we propose ARDM, a dialog model that encodes and decodes different speaker utterances in alternating order with two pre-trained large-scale language models. To investigate whether ARDM can help dialog response generation, we evaluate its performance on three different task-oriented dialog datasets: CamRes676, MultiWOZ, and PersuasionForGood . The first two datasets are traditional information request dialog datasets with well-defined automatic evaluation metrics on task completion. By contrast, PersuasionForGood is a new dataset that focuses on persuading people to donate to a charity. There is no explicit dialog state defined in this task as such non-collaborative dialogs have various dialog actions.
We observe that ARDM is capable of improving task-oriented dialog tasks performance over the previous state-of-the-art methods without incorporating any explicit supervision from belief states or dialog acts. Also, due to ARDM's simplicity and generality, one can rapidly build a dialog prototype on different types of applications using only conversations without any manual annotations. We also found that ARDM works well on complex dialogs, such as persuasion. The model generates dialog responses that successfully persuade people to donate to a charity, suggesting the potential of ARDM being used in wide-scale real-world settings.

RELATED WORK
Traditional dialog systems consist of a dialog manager to maintain dialog states and control the conversation flow. However, a dialog manager requires extensive manual annotations for training the sub-modules such as dialog state tracker or policy decision-maker. An alternative is to model dialog without explicitly modeling belief states. Specifically, Manning & Eric (2017) proposed a recurrent neural dialogue architecture using a sequence-to-sequence model that utilizes copy-mechanism to copy history information directly from raw dialog history. This method achieved the state-of-theart results on DSTC2 (Henderson et al., 2014), which is a simple dialog restaurant booking task with abundant data. However, such method did not perform well on more complex dialog task data sets CamRes676 (Wen et al., 2017) and KVRET . Sequicity (Lei et al., 2018) attributed the bad performance of Manning & Eric (2017)'s method to the omission of belief tracker. They introduced the concept of belief span and added belief tracker back to the model and achieved state-of-the-art performance.
Compared to Sequicity, Manning & Eric (2017)'s method provides a more general framework that reduces manual dialog state, user intent, and dialog act labeling by bypassing any symbolic annotations. Such a model can apply to datasets with no or partial annotations of belief states. In a real-world setting, if the dialog task introduces new slot values in belief states (i.e. a new type of food), Sequicity will suffer from the belief span decoder error in response generation. Thus, Manning & Eric (2017)'s method may be potentially more robust than Sequicity in this situation. Besides, if the task requires belief states for database search, we can treat belief tracking as a separate task. We can train a good belief tracking with only a small amount of annotated data, which reduces the annotation required and it is easier to fix errors. Also, since belief states are a set of important entities condensed from dialog history (i.e., often exact words from utterances), they do not introduce extra information to the model. Therefore, a dialog model with powerful representation learning should learn a form of belief states information automatically without human annotations as the scaffold.
Recent success of BERT (Devlin et al., 2019) and GPT2 (Radford et al., 2019) suggests the possibility of using large pre-trained language models to enhance Manning & Eric (2017)'s method. There are some studies of applying large pre-trained language model to dialog generation. Transfer-Transfo (Wolf et al., 2019) fine-tuned the pre-trained language model GPT (Radford et al., 2018) on Persona-Chat dataset (Zhang et al., 2018) and obtained significant improvements on chitchat dialog generation, suggesting the potential of fine-tuning large pre-trained language model on other dialog response generation tasks. A more recent work (Budzianowski & Vulic, 2019) adopted the framework of TransferTransfo and made the first attempt to leverage large pre-trained language models GPT and GPT-2 on task-oriented dialog generation, but it included belief states modeling as the input and did not achieve better results than the baseline. We propose to model dialogs without any annotation but rely on pre-trained large scale language models that alternate.
Previous work shows that modeling speaker roles in conversation is beneficial for language understanding Su et al., 2018). Other researchers model persona information to generate language with different speaking styles (Li et al., 2016;Joshi et al., 2017). Zhao & Kawahara (2019) propose a relative speaker modeling method, where only the relative role instead of the absolute identity of the speaker is modeled. Our method is similar to Zhao & Kawahara (2019) in the spirit of modeling relative speaker relationship, but we focus on learning role-specific language models through utterances from different speakers, instead of explicitly taking role embeddings as input.

APPROACH
Our goal is to leverage large pre-trained language models to improve dialog response generation. Favoring Manning & Eric (2017)'s approach without using additional annotations such dialog states or dialog acts, we propose Alternating Recurrent Dialog Model (ARDM) by compositing two separate pre-trained language model in alternate order to learn the user and system utterance distribution. Figure 1 shows an overview of ARDM.

ALTERNATING RECURRENT DIALOG MODEL
We aim to model both user and system utterances distribution simultaneously. Given a multiturn dialog (d) between a user (u) and a system (s), we can represent d as a series of utterances {u 1 , s 1 , u 2 , s 2 , . . . , u T , s T }, where T denotes the total number of turns. We decompose the probability distributions over the utterances in d into two language models for the user and system respectively, denoted as p u and p s . Then we define a dialog model p(d) with the equation: p u and p s are standard language models where the task is to predict the next token given the preceding context. For an utterance u t or s t with m tokens {w 1 , . . . , w m }, the joint probability of an utterance is as follows: Finally, we train the dialog model by maximizing the likelihood over Equation 1.
We apply a simple memory mechanism to grant the model the capability of memorizing conversation history. For an utterance at turn t, we reuse the hidden states h ≤t−1 stored in the memory M t−1 to obtain h t , and store the h t back to the memory as M t . As for the pre-trained Transformer language model, we implement the memory mechanism using self-attention given the query/key/value features denoted as Q, K, V , where the equation is defined as: For simplicity, we assume there is only one layer in Transformer, and h t is the hidden state for the token t. Then a recurrence relation for h t is defined by computing Q t , K ≤t , V ≤t from h ≤t−1 and the current utterance. In practice, we reuse K ≤t−1 and V ≤t−1 (i.e. history keys and values) as M t−1 instead of h t−1 to avoid recomputing history information. Therefore, the final h t is computed as: However, one major drawback of this memory mechanism is that the memory consumption grows as the number of turns increases, until a point that the dialog cannot continue because of the memory limit. A straightforward way to solve this is to discard the distant history. But because most dialogs lengths in our datasets are can fit in the GPU memory limit (i.e., approx. 1,000 tokens for 11GB GPU), we leave the memory issue for future work.

TRAINING DETAILS
We initialize the user and the system language model with a large pre-trained language model GPT-2 small with 117M parameters (Radford et al., 2019). It is a Transformer (Vaswani et al., 2017) model with 12 heads, 768 hidden size, and 12 layers. The model is trained on a large scale corpus called WebText extracted from Reddit with at least three upvotes. The tokenizer is 50,257 size byte pair encoding (BPE) (Sennrich et al., 2016) that can encode and decode any text in a lossless manner to avoid out-of-vocabulary tokens. We follow a special format in GPT-2 as the "trigger" so that the model can zero-shot dialog response generation, by prefixing the user role token "A:" or "B:", and suffixing the end of utterance token "\n\n\n". This "trigger" approach is similar in other zero-shot scenarios mentioned in GPT-2 paper (e.g., that a "TL;DR" token can trigger GPT-2 to summarize the input text.) We further fine-tune ARDM on the specific task dataset. We apply AdamW optimizer (Loshchilov & Hutter, 2019), and the number of warm-up steps is set to be the number of batches in one epoch. The learning rate is set to 3 × 10 −5 , and the dropout rate is set to 0.1 for all tasks.

DECODING DETAILS
We decode utterances by nucleus sampling (Holtzman et al., 2019) with different hyper-parameters (top-p, top-k) for down-stream dialog tasks. We also vary the temperature of T < 1 to find the best setting for the specific down-stream dialog task. To handle both situations in the evaluation and the real-world use case, we have two decoding modes. For evaluation mode, we feed all past ground truth history before turn t to generate the corresponding utterance, so that we can evaluate the quality of generated dialog responses without concerning about the conversion flow. While in a real-world use case, we do not have ground truth history, and therefore we use the memory from previously generated responses and let the model dynamically interact with a human or another bot in turns. Because dialogs have different lengths, it is hard for ARDM to efficiently decode responses using traditional batch padding method. As a solution, we develop a dynamic dialog filtering algorithm to support fast decoding in batch. Such method speeds up the generation eight times faster. Please refer to Appendix B for the method's details.

EXPERIMENTS AND RESULTS
Data scarcity is one of the biggest challenges in dialog research. It is costly to collect human-human conversations under a specific setting. It is even more time-consuming to annotate belief states and dialog acts. With the success of transfer learning in NLP, we aim to mitigate the low-resource problem with the large pre-trained language model. We validate our proposed ARDM on three task-oriented dialog datasets, CamRest676, MulitWOZ, and PersuasionForGood.

CAMREST676
CamRest676 is a relatively small dataset with 408/136/136 dialogs for train/validation/test. We follow Sequicity (Lei et al., 2018) to delexicalize tokens such as restaurant names, phone numbers, postcodes by replacing them with their slot names in utterances. We prepend database search results to the system utterance. An example database search results are "restaurant;3", where the first slot indicates its dialog domain, which is always "restaurant" in CamRest767, and the second slot represents the number of matched items in the database. We use nucleus sampling for all methods in decoding for a fair comparison. Here, we set top-p 0.2 and temperature 0.7 for our model. We use BLEU-4 and Success F1 to evaluate language generation quality and Success F1 to evaluate task success. Success F1 computes the F1 score of the generated responses on requested slots such as an address, phone number, or food type. Other than Sequicity, we also compare results by using GPT-2 alone as a language model for the entire dialog.  Table 1 shows all models' results with ground truth belief state or generated belief state. We first use ground truth belief state in all methods to evaluate their response generation quality. ARDM achieves the best BLEU and Success F1 score. We observe that only having a pre-trained large-scale language model GPT-2 achieves similar results compared to the previous state-of-the-art method, Sequicity with reinforcement fine-tuning. This suggests pre-trained large-scale language model, such as GPT-2, really learns meaningful representation in the pre-training. However, without the alternating recurrent modeling, GPT-2 alone does not perform as well as ARDM in terms of both BLEU-4 and Success F1, especially in BLEU-4 (improved 19%). Without modeling the speaker role, the model blends two speakers language distribution and ignores the inherent speaker role difference. Moreover, to test if our model preserves its performance with even less training data, we reduce the training data to 50%, and the performance only drops slightly. With half of the training data, our method still performs significantly better than Sequicity. This result suggests ARDM is robust on low-resource settings due to the advantage of the large-scale pre-training language model.
We also evaluate all models with generated belief states instead of ground truth belief states. Sequicity generates belief tracker results, and its Entity Match rate is 0.927. Our model does not have a state tracker, so we write a separate simple regular expression to extract the occurrence of entities that appear in the database to support our model. Such state tracker achieves 0.960 in Entity Match rate. It suggests that state tracking may be accomplished in more straightforward ways other than training a neural network model on a large set of annotated data. With a simple state tracker, our proposed method still performs better than Sequicity, which trains the belief state and the response generation task jointly.

MULTIWOZ
Here, we only use the ground truth database search result to be consistent with other methods. We perform delexicalization which is mentioned in the original MultiWOZ (Budzianowski et al., 2018). We prepend the database search results to the system response for as conditional input. Also, the database results now contain information about whether the booking is successful or not (i.e., succeed or fail). Note that we do not use belief state or dialog act annotation provided by the dataset to train ARDM. We set the top-p to 0.2 and the temperature to 0.7. The results are evaluated on BLEU-4, Inform Rate, and Success Rate. Inform and Success Rate measure whether the system response provides the recommendations and requested information given in the goal. We compare our model to the attention-based seq2seq model which is proposed as the MultiWOZ Baseline (Budzianowski et al., 2018), the HDSA (Chen et al., 2019)   The evaluation results are shown in Table 2. Without any supervision from dialog states or dialog acts, ARDM significantly outperforms the MultiWOZ Baseline and LaRL on BLEU-4 and Inform rate, and is on par with HDSA. However, HDSA uses dialog act supervision and a large pretrained language model, BERT. Our model requires no annotation and can achieve similar results. This suggests our speaker role modeling and large-scale pre-training methods work similarly as the useful dialog act annotations. All the results show that our method's excellent performance remains consistent in multi-domain dialogs.
We analyze the generated responses and find that if multiple domains have appeared in the conversation history, our model tends to make mistakes in answering the right domain for user requests. This finding suggests that the Maximum Likelihood Estimation (MLE) has limitations in directly optimizing the metric, while reinforcement Learning (RL) can hugely improve the task completion in a dialog system. This is why LaRL has a higher Success rate. However, we also observe that LaRL has a low BLEU-4 score, which indicates low readability in responses. Therefore, there is a trade-off between the generation quality and the task success rate in the RL setting.

PERSUASIONFORGOOD
To showcase ARDM's performance on a dialog dataset where it is much more difficult to obtain belief states and dialog act annotations, we train and evaluate our model on PersuasionForGood  dataset. In this dataset, the persuader must persuade an assigned persuadee (i.e., a person who is asked to donate) to donate money (from their task payment) to a charity called "Save the Children". This dataset has a much larger vocabulary size (8,141) than the previous taskoriented dialog datasets due to its non-collaborative dialog property. The conversation content is richer because two speakers are negotiating back and forth. The dataset consists of 1,017 dialogs where only 300 dialogs are annotated with dialog acts. Therefore, models that require dialog state or dialog act annotation are not applicable in this dataset. ARDM does not require dialog acts for training. As ARDM has already learned good language priors through pre-training, which makes it applicable on PersuasionForGood. Due to the similarity to chit-chat, we use TransferTransfo (Wolf et al., 2019) model as a strong baseline model. TransferTransfo is also based on large pre-trained language model, but it uses token type embedding to encode role information of the speaker. We concatenate all history text together as input to generate the next utterance.
To generate diverse responses, we decode the response using the nucleus sampling (Holtzman et al., 2019) with a top-p of 0.9 and a temperature of 0.7. It is impossible to conduct an automatic eval-uation on task success on this task due to the lack of annotation. We use perplexity, BLEU-1, and BLEU-2 to evaluate each model's language generation quality. We also conduct a human evaluation to validate each model's task success rate. We show some generated examples in the Appendix to provide more information on both models' generation quality.

SYS
Save the Children is an international non-governmental organization that promotes children's rights, provides relief and helps support children in developing countries. USR That's cool. It is an organization for the good of this world. SYS Yes, it is. In the first two months of 2018 alone, 1,000 children were reportedly killed or injured in intensifying violence in Syria. USR That's sad.

SYS
We do need to do something about that. One of the things that we can do is make sure that we have a system in place that will ensure that every penny we spend goes directly to the children in need. USR I agree.

SYS
How much do you like to donate to the charity now? Your donation will be directly deducted from your task payment. You can choose any amount from $0 to all your payment ($2). USR I am willing to donate full my payment $2. SYS Wow, that is awesome! That is so kind of you! USR You are welcome. Table 4: A conversation between a machine persuader trained by ARDM and a human persuadee.
Because ARDM applies better speaker modeling and recurrence mechanism, our model achieves lower perplexity compared to TransferTransfo. In terms of BLEU scores, TransferTransfo is better than ARDM. However, BLEU-1 cannot reflect the actual generation quality, and sometimes can be misleading. We ask 14 human evaluators to chat with the two persuasive systems ten times to avoid randomness of the model. In total, we collected 140 ratings. We ask them to select a preferred chat-bot and indicate how much they are willing to donate after talking to the chat-bot. As a result, human judges prefer ARDM over TransferTransfo and tends to donate more when talking to ARDM produced chat-bot. Our model achieved 27% more donations compared to TransferTransfo. This indicates that our systems are more persuasive. In some examples, such as the one in Table 4, our model generates coherent, natural, and persuasive responses.

ERROR ANALYSIS
Since CamRest676 is similar to MultiWOZ in terms of task content and dialog structure, we only describe the errors in MultiWOZ for simplicity. We randomly selected 30 generated error responses from our model with zero inform and success score. To our surprise, we observed that nearly 63.3% of errors are not really mistakes. It is mainly due to the limitation of the automatic evaluator. For example, at turn one, the user asks about a restaurant, and the ground truth system response is "the [restaurant name] is located at . . . ", but the generated system response is "what food preference do you have?". Our generated response is correct with respect to the dialog context. It is narrowing down the restaurant choices before providing a restaurant recommendation. However, the evaluator sticks to the only possible response it has. Unless the user can dynamically interact with the system, there is no good way to change such mistakes in the automatic evaluator. We find that another 20% errors our model makes are when the system asks information the user already provided. This type of errors calls for a better history representation. Another 10% errors are due to ignoring the user's request for information, such as phone number. However, when we look at the ground truth responses, some crowd workers also made such errors. So resolving these errors requires a cleaner training dataset. Finally, the rest of 6.7% errors are about incorrect dialog domain understanding. For example, the user is asking for a hotel, but we present a restaurant recommendation. This is because of the data noise during the delexicalization process in which some domain labels are wrong. The donation persuasion system trained with TransferTransfo and our model has some common problems, such as inconsistency, lack of logic, and hallucination. For example, if the persuader provides the information about "Save the Children", then the persuadee asks "Can you tell me more about it?". The system ends up providing the same information as before. It also sometimes makes up facts that have never happened, such as "Save the Children has an operation about a hurricane in Hawaii". All those errors would prevent users from trusting the bot, and therefore resulting in less donation. However, we also observe that users have a higher tolerance for errors in the persuasion setting than the customer service setting.

DISCUSSIONS AND ETHICAL CONSIDERATION
ARDM models speakers separately on top of a large pre-trained language model. Such simple adaptation demonstrates substantial performance gain. We suspect it is because the interleaved structure of two language models provides a collaborative learning frame of both the user and the system language distribution modeling. The memory is the only way for the user and system to communicate, as they do not share any weights in their networks. Thus, the user encoder needs to learn useful representations to make the system model for understanding its intent. Similarly, the system needs to do the same for the user model to improve its understanding. This alternative repeating process forces both the user and system models to preserve the dialog history effectively in the memory. One can interpret the memory as the implicit representation of belief states or dialog acts.
Another benefit of ARDM is that we will obtain both user and system utterance generators. We can let the two models talk to each other to generate new self-play dialogs (Silver et al., 2017). We show some self-play dialog examples in the Appendix D. With self-play, one can rapidly build a large scale dialog dataset using adversarial filtering (Zellers et al., 2018). Such models can be used in reinforcement learning as user simulator to study complex dialog strategies as well.
Persuasion is a double-edged sword. Given the fast development of dialog systems, an ethical design principle must be in place throughout all stages of the development and evaluation. We choose the donation task is because it is a relatively simple task that benefits children. Second, when deploying the persuasive agents in real conversations, we need to keep the users informed of the nature of the system. By revealing the identity of the persuasive agent, the user should also have options to communicate directly with the human team behind the system. Lastly, by investigating persuasive dialog systems, we also envision to use them as an educational tool for the general public to learn to defend themselves against machine persuasion.

CONCLUSIONS
We propose to build Alternating Recurrent Dialog Model (ARDM), a simple, general, and effective dialog method that models user and system separately with large-scale pre-trained language models. Since ARDM does not require any annotations, it generalizes to different dialog applications. Experimental results on CamRest676 and MultiWOZ suggest that ARDM outperforms or on-par with the current state-of-the-art methods that use manual annotation information, such as belief states and dialog acts. Furthermore, we find our model's excellent performance generalizes to more complex non-collaborative dialog settings. It can generate high-quality responses to persuade people to donate to charity. However, the easiness of training ARDM raises concerns about the misuse of the model in scenarios such as sales, harassment, or scam on a mass scale. We caution the public in deploying such systems in the real world. Tianyu Zhao and Tatsuya Kawahara. Effective incorporation of speaker information in utterance encoding in dialog. arXiv preprint arXiv:1907.05599, 2019.

A MULTIWOZ EVALUATOR INCONSISTENCY
We rerun baseline models to compare our methods and find discrepancy among different papers' reported results. In order to understand the reason, we compared between LaRL's evaluator 2 and MultiWOZ Baseline's evaluator 3 . We found that they make different assumptions to handle the "train" domain (line 637-639 at LaRL evaluator.py). After carefully analyzing the code and discussing with authors of these two papers, we believe that LaRL's evaluator is more reasonable. However, in LaRL, the authors reported MultiWOZ Baseline's scores with a different evaluator. Therefore, we re-evaluated all methods, including LaRl, HDSA, and MultiWOZ Baseline using the same evaluator for fairness.

ER
That is great to hear. I know your donation is a small amount but the impact this will have on the world can be significant. Your donation would be directly deducted from your task payment. EE That is true. I will donate one dollar to the charity. ER Thank you so much for your donation, it will help many children. EE Thank you. I hope you will continue to donate to the charity.