PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning

To build a high-quality open-domain chatbot, we introduce the effective training process of PLATO-2 via curriculum learning. There are two stages involved in the learning process. In the first stage, a coarse-grained generation model is trained to learn response generation under the simplified framework of one-to-one mapping. In the second stage, a fine-grained generative model augmented with latent variables and an evaluation model are further trained to generate diverse responses and to select the best response, respectively. PLATO-2 was trained on both Chinese and English data, whose effectiveness and superiority are verified through comprehensive evaluations, achieving new state-of-the-art results.


Introduction
Recently, task agnostic pre-training with largescale transformer models has achieved great success in natural language processing (Devlin et al., 2019), especially open-domain dialogue generation. For instance, based on the general language model GPT-2 (Radford et al., 2019), DialoGPT (Zhang et al., 2020) is further trained for response generation using Reddit comments. To obtain a human-like open-domain chatbot, Meena (Adiwardana et al., 2020) scales up the network parameters to 2.6B and employs more social media conversations in the training process, leading to significant improvement on response quality. To mitigate undesirable toxic or bias traits of large corpora, Blender (Roller et al., 2021) fine-tunes the pretrained model with human annotated datasets and emphasizes desirable conversational skills of engagingness, knowledge, empathy and personality.
In addition to the attempts from model scale and data selection, PLATO (Bao et al., 2020) aims to tackle the inherent one-to-many mapping problem to improve response quality. The one-to-many mapping refers to that one dialogue context might correspond to multiple appropriate responses. It is widely recognized that the capability of modeling one-to-many relationship is crucial for opendomain dialogue generation (Zhao et al., 2017;Chen et al., 2019). PLATO explicitly models this one-to-many relationship via discrete latent variables, aiming to boost the quality of dialogue generation. PLATO has a modest scale of 132M network parameters and trained with 8M samples, achieving relatively good performance among conversation models on a similar scale. However, scaling up PLATO directly encounters training instability and efficiency issues, which might result from the difficulty to capture the one-to-many semantic relationship from scratch.
In this work, we try to scale up PLATO to PLATO-2 and introduce an effective training schema via curriculum learning (Bengio et al., 2009). There are two stages involved in the whole learning process, as shown in Figure 1. In the first stage, under the simplified one-to-one mapping modeling, a coarse-grained generation model is trained for response generation under different conversation contexts. This model tends to capture typical patterns of diversified responses, sometimes resulting in general and dull responses during inference. Despite the problem of safe responses, Context: How's your vacation?
Amazing! I had a wonderful trip to Hawaii.
Response: Amazing! I had a wonderful trip to Hawaii.

Training Phase Inference Phase
Self-attention Visualization Training Objectives this coarse-grained model is still highly effective in learning general concepts of response generation. The curriculum learning continues to the second stage, which contains the training of a fine-grained generation model and an evaluation model. The fine-grained generation model explicitly models the one-to-many mapping relationship via latent variables for diverse response generation. To select the most appropriate response, an evaluation model is trained to estimate the bi-directional coherence between the dialogue context and responses. Distinct with multi-task PLATO, the separate design of fine-grained generation and evaluation enables the model to concentrate more on its corresponding task, getting exempt from multi-task disturbance (Standley et al., 2020).
As compared with PLATO, PLATO-2 leverages curriculum learning to learn response generation gradually, from the general concept of one-to-one mapping to the complex concept of one-to-many mapping. With curriculum learning, we successfully scale the model up to billions of parameters, achieving new state-of-the-art results. Besides open-domain chitchat, the models learned in these two stages can also benefit task-oriented conversation and knowledge grounded dialogue respectively, whose effectiveness is verified thoroughly in DSTC9 (Gunasekara et al., 2020).
To sum up, we trained PLATO-2 with different model sizes: 1.6B, 314M and 93M parameters. In addition to the English models, we also trained Chinese models with massive social media conversa-

Methodology
The backbone of PLATO-2 is consisted of transformer blocks with pre-normalization (Radford et al., 2019). Distinct with conventional Seq2Seq, there are no separate encoder and decoder networks in our infrastructure. PLATO-2 keeps the unified network for bi-directional context encoding and uni-directional response generation through flexible attention mechanism (Dong et al., 2019).

Curriculum Learning
In this work, we carry out effective training of PLATO-2 via curriculum learning. As shown in Figure 2, there are two stages involved in the learning process: during stage 1, a coarse-grained baseline model is trained for general response generation under the simplified one-to-one mapping relationship; during stage 2, two models of fine-grained generation and evaluation are further trained for diverse response generation and response coherence estimation respectively.

General Response Generation
It is well known that there exists a one-to-many relationship in open-domain conversations, where a piece of context may have multiple appropriate responses. Since conventional approaches try to fit the one-to-one mapping, they tend to generate generic and dull responses. Whereas, it is still an efficient way to capture the general characteristics of response generation. As such, we first train a coarse-grained baseline model to learn general response generation under the simplified relationship of one-to-one mapping. Given one training sample of context and response (c, r), we need to minimize the following negative log-likelihood (NLL) loss: where T is the length of the target response r and r <t denotes previously generated words. Since the response generation is a uni-directional decoding process, each token in the response only attends to those before it, shown as dashed orange lines in Figure 2. As for the context, bi-directional attention is enabled for better natural language understanding, shown as blue lines in Figure 2.

Diverse Response Generation
Based upon the coarse-grained baseline model, diverse response generation is warm started and further trained under the relationship of one-to-many mapping. Following the previous work PLATO, the discrete latent variable z is introduced for the oneto-many relationship modeling. z is one K-way categorical variable, with each value corresponding to a particular latent speech act in the response. The model will first estimate the latent act distribution of the training sample p(z|c, r) and then generate the response with the sampled latent variable p(r|c, z). It is notable that these two tasks of response generation and latent act recognition are trained jointly within the shared network. The NLL loss of diverse response generation is defined as: where z is the latent act sampled from p(z|c, r). As sampling is not differentiable, we approximate it with Gumbel-Softmax (Jang et al., 2017). The posterior distribution over latent values is estimated through the task of latent act recognition: where h [M ] ∈ R D is the final hidden state of the special mask token [M], W 1 ∈ R K×D and b 1 ∈ R K denote the weight matrices of one fullyconnected layer.
To facilitate the training process of discrete latent variables, the bag-of-words (BOW) loss (Zhao et al., 2017) is also employed: where V refers to the whole vocabulary. The function f tries to predict the words within the target response in a non-autoregressive way: where h z is the final hidden state of the latent variable. f rt denotes the estimated probability of word r t . As compared with NLL loss, the BOW loss discards word orders and forces the latent variable to capture the global information of target response. To sum up, the objective of the fine-grained generation model is to minimize the following integrated loss:

Response Coherence Estimation
By assigning distinct values to the latent variable, the fine-grained generation model is able to produce multiple high-quality and diverse responses. To select the most appropriate response from these candidates, one straightforward way is to rank them according to p(z|c)p(r|c, z). However, it is widely recognized that the prior distribution p(z|c) is difficult to estimate and the uniform distribution is not an effective approximation. To this end, we adopt an alternative approach to train an evaluation model in the second stage, estimating the coherence between each response and the given dialogue context. The loss of response coherence estimation (RCE) is defined as follows: The positive training samples come from the dialogue context and corresponding target response (c, r), with coherence label l r = 1. And the negative samples are created by randomly selecting responses from the corpus (c, r − ), with coherence label l r − = 0. In addition to our coherence evaluation function p(l r |c, r), there are two other functions widely used for response selection. One is the length-average log-likelihood (Adiwardana et al., 2020), which considers the forward response generation probability p(r|c). The other one is the maximum mutual information (Zhang et al., 2020), which considers the backward context recovery probability p(c|r). However, the forward score favors safe and generic responses due to the property of maximum likelihood, while the backward score tends to select the response with a high overlap with the context, resulting in repetitive conversations. By contrast, the discriminative function p(l r |c, r) considers the bidirectional information flow between the dialogue context and response. Our coherence evaluation is able to ameliorate the aforementioned problems, whose effectiveness is verified in the experiments.
To maintain the capacity of distributed representation, the task of masked language model (MLM) (Devlin et al., 2019) is also included in the evaluation network. Within this task, 15% of the input tokens will be masked at random and the network needs to recover the masked ones. The MLM loss is defined as: where x refers to the input tokens of context and response. {x m } m∈M stands for masked tokens and x \M denotes the rest unmasked ones.
To sum up, the objective of the evaluation model is to minimize the following integrated loss:

Inference
For open-domain chitchat, the inference is carried out with the second stage's models as follows. 1) Diverse response generation. Conditioned on each latent value z ∈ {1, · · · , K}, its corresponding candidate response r z is produced by the fine-grained generation model p(r z |c, z). 2) Response coherence estimation. The evaluation model will preform ranking and select the one with highest coherence value as the final response r * = argmax rz p(l rz = 1|c, r z ).

Training Details
PLATO-2 has three model sizes: a standard version of 1.6B parameters, a small version of 314M parameters, and a tiny version of 93M parameters. Detailed network and training configurations are summarized in the Appendix. The main hyperparameters used in the training process are listed as follows. The maximum sequence lengths of context and response are all set to 128. K is set to 20 for the discrete latent variable (Bao et al., 2020;Chen et al., 2019). We use Adam (Kingma and Ba, 2015) as the optimizer, with a learning rate scheduler including a linear warmup and an invsqrt decay (Vaswani et al., 2017). To train the largescale model with a relatively large batch size, we employ gradient checkpointing (Chen et al., 2016) to trade computation for memory. The training was carried out on 64 Nvidia Tesla V100 32G GPU cards. It takes about 3 weeks for 1.6B parameter model to accomplish curriculum learning process.

Compared Methods
The following methods have been compared in the experiments.

Evaluation Metrics
We carry out both automatic and human evaluations in the experiments. In automatic evaluation, to assess the model's capacity on lexical diversity, we use the corpus-level metric of distinct-1/2 (Li et al., 2016a), which is defined as the number of distinct uni-or bi-grams divided by the total number of generated words.
In human evaluation, we employ four utterancelevel and dialogue-level metrics, including coherence, informativeness, engagingness and humanness. Three crowd-sourcing workers are asked to score the response/dialogue quality on a scale of [0, 1, 2], with the final score determined through majority voting. The higher score, the better. These criteria are discussed as follows, with scoring details provided in the Appendix.
• Coherence is an utterance-level metric, measuring whether the response is relevant and consistent with the context. • Informativeness is also an utterance-level metric, evaluating whether the response is informative or not given the context. • Engagingness is a dialogue-level metric, assessing whether the annotator would like to talk with the speaker for a long conversation. • Humanness is also a dialogue-level metric, judging whether the speaker is a human being or not.

Experimental Results
In the experiments, we include both static and interactive evaluations.

Self-Chat Evaluation
Self-chats have been widely used in the evaluation of dialogue systems (Li et al., 2016b;Bao et al., 2019;Roller et al., 2021), where a model plays the role of both partners in the conversation. As compared with human-bot conversations, self-chat logs can be collected efficiently at a cheaper price. As reported in Li et al. (2019), self-chat evaluations exhibit high agreement with the human-bot chat evaluations. In the experiments, we ask the bot to perform self-chats and then invite crowd-sourcing workers to evaluate the dialogue quality. The way to start the interactive conversation needs special attention. As pointed out by Roller et al. (2021), if starting with 'Hi!', partners tend heads, with the embedding dimension of 1024. As the Chinese vocabulary contains 30K BPE tokens, this model has 22.5M more parameters than the English small model.  to greet with each other and only cover some shallow topics in the short conversation. Therefore, to expose the model's weaknesses and explore the model's limits, we choose to start the interactive conversation with pre-selected topics. We use the classical 200 questions as the start topic (Vinyals and Le, 2015) and ask the bot to performance selfchats given the context. There are 10 utterances in each dialogue, including the input start utterance. We carry out automatic evaluation on the 200 selfchat logs and randomly select 50 conversations for human evaluation.
The compared models are divided into three groups. The first group includes PLATO 132M model and PLATO-2 93M model. Both of them have similar model scales. The second group includes DialoGPT 345M model and PLATO-2 310M model. Both of them are trained using Reddit comments and have similar model scales. The third group includes Blender 2.7B model and PLATO-2 1.6B model. Both of them are first trained using Reddit comments and further fine-tuned with BST conversations. In human evaluation, two self-chat logs, which are from the same group and have the same start topic, will be displayed to three annotators. One example is given in Figure 3. As suggested in ACUTE-Eval (Li et al., 2019), we ask crowd-sourcing workers to pay attention to only one speaker within a dialogue. In the evaluation, they need to give scores on coherence and informativeness for each P1's utterance, and assess P1's overall quality on engagingness and humanness.
The self-chat evaluation results are summarized in Table 1. These results indicate that PLATO-2 1.6B model obtains the best performance across human and automatic evaluations. In the first group, PLATO-2 achieves better performance than PLATO on a similar model scale, which might mainly result from the stable curriculum learning and large-scale conversation data. In the second group, DialoGPT tends to generate repetitive conversations due to the backward scoring function, resulting in poor performance in interactive evaluation. In the third group, PLATO-2 outperforms the state-of-the-art open-domain chatbot Blender. The gap of Blender and PLATO-2 on the corpus-level metric distinct-1/2 suggests that PLATO-2 has a better capacity on lexical diversity. In addition, the difference among these three groups suggests that enlarging model scales and exploiting human annotated conversations help improve the dialogue quality.

Human-Bot Chat Evaluation
In the Chinese evaluation, it is difficult to carry out self-chats for Microsoft XiaoIce, as there is no public available API. Therefore, we collect human-bot conversations through their official Weibo platform. The interactive conversation also starts with a preselected topic and continues for 7-14 rounds. 50 diverse topics are extracted from the high-frequency topics of a commercial chatbot, including travel, movie, hobby and so on. The collected humanbot conversations are distributed to crowd-sourcing founders is very good. they are based out of detroit but the brewery is in royal oak ohio yes they're in royal oak! have you ever been to founders? they have two breweries in pittsburgh and columbus i have been to founders brewery in columbus! great place to visit. workers for evaluation. The human and automatic evaluation results are summarized in Table 2. Xi-aoIce obtains higher distinct values, which may use a retrieval-based strategy in response generation. The human evaluations indicate that our PLATO-2 model achieves significant improvements over XiaoIce across all the human evaluation metrics.

Static Evaluation
Besides the interactive evaluation, we also employ static evaluation to analyze the model's performance. In static evaluation, each model will produce a response towards the given multi-turn context. Those powerful models are involved in the evaluation: Meena, Blender, DialoGPT and PLATO-2 1.6B. To compare with Meena, we include their provided 60 static samples in the Appendix of the paper and generate corresponding responses with other models. We also include 60 test samples about daily life from Daily Dialog (Li et al., 2017) and 60 test samples about in-depth discussion from Reddit. Given that the measurement of humanness usually needs multi-turn interaction, this metric is excluded from static evaluation. The evaluation results are summarized in Table 3. It can be observed that PLATO-2 is able to produce coherent, informative and engaging responses across different chat scenarios. The average Fleiss's kappa (Fleiss, 1971)

Case Analysis
To further analyze the models' features, two selfchat examples of Blender and PLATO-2 are provided in Figure 3. Although both models are able to produce high-quality engaging conversations, they exhibit distinct discourse styles. Blender tends to switch topics quickly in the short conversation, including alcohol, hobbies, movies and work. The emergence of this style might be related with BST   fine-tuning data. For instance, persona chat in BST is about the exchange of personal information between two partners, where topics need to switch quickly to know more about each other. Due to the task settings of data collection, some human annotated conversations might be a little unnatural. Nevertheless, fine-tuning with BST conversations is essential to mitigate undesirable toxic traits of large corpora and emphasize desirable skills of human conversations. Distinct with Blender, PLATO-2 can stick to the start topic and conduct in-depth discussions. The reasons might be two-fold. First, our model is able to generate diverse and informative responses with the accurate modeling of one-to-many relationship. Second, the evaluation model helps select the coherent response and stick to current topic. We asked crowd-sourcing workers to annotate which model's in-depth discussion is better w.r.t. the start topic. The comparison result is shown in Table 4, which also verifies our above analysis on discourse styles.

Why PLATO-2 Performs Better?
Why PLATO-2 achieves better performance as compared with Meena, Blender and other state-ofthe-art models? As analyzed above, major reasons might come from two aspects: fine-grained generation and evaluation. First, PLATO-2 employs discrete latent variable for the one-to-many relationship modeling, which is able to generate highquality and diverse responses. Second, the evaluation model in PLATO-2 is effective at selecting the most appropriate response from the candidates.
In fact, these two aspects are associated with the curriculum learning in the second stage, modeling the one-to-many relationship for open-domain conversations. By contrast, Meena and Blender are learned under the one-to-one mapping relationship, similar to the first stage in PLATO-2. To dissect the effects of these two stage models, we further ask crowd-sourcing workers to evaluate the models' self-chat logs on the dialogue-level metrics. The comparison results are summarized in Table 5. These results verify the effectiveness of curriculum learning in PLATO-2.

Further Exploration of PLATO-2
In addition to open-domain chitchat, there are two other kinds of dialogues in conversational AI (Gao et al., 2018): knowledge grounded dialogue, and task-oriented conversation. Similar to opendomain conversation, the one-to-many mapping relationship also exists in knowledge grounded dialogue (Kim et al., 2020): given a dialogue context, multiple pieces of knowledge might be applicable for the response generation. Therefore, the one-to-many mapping models of the second stage can also be adapted for knowledge grounded dialogue. By expanding the network input with the knowledge segment, the background knowledge is encoded and grounded for response generation.
Distinct from the open-domain conversation and knowledge grounded dialogue, task-oriented conversations usually need to accomplish a specific goal. Accordingly, the conversation flow would become less diverse and concentrated on task completion. Therefore, the one-to-one mapping generation model of the first stage can be used for the end-to-end task-oriented conversation.
For the exploration of PLATO-2 two-stage framework, we participated in several tasks of DSTC9 (Gunasekara et al., 2020), including interactive evaluation of open-domain conversation (Track3-task2), static evaluation of knowledge grounded dialogue (Track3-task1), and end-to-end task-oriented conversation (Track2-task1). PLATO-2 has achieved the first place in all three tasks (Bao et al., 2021). To sum up, the benefits brought by the two-stage curriculum learning in PLATO-2 are two-fold. Firstly, given the difficulties to scale up PLATO, the two-stage curriculum learning is an essential ingredient for the successful training of 1.6B parameter PLATO-2. Secondly, the twostage PLATO-2 adapts well to multiple conversational tasks, indicating its potentials as a unified pre-training framework for conversational AI.

Related Work
Related works include large-scale language models and open-domain dialogue generation. Large-scale Language Models. Pre-trained largescale language models have brought many break-throughs on various NLP tasks. GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) are representative uni-directional and bi-directional language models, trained on general text corpora. By introducing pre-normalization and modifying weight initialization, GPT-2 (Radford et al., 2019) successfully extends the model scale from 117M to 1.5B parameters. To cope with memory constraints, Megatron-LM (Shoeybi et al., 2019) exploits model parallelism to train an 8.3B parameter model on 512 GPUs. GPT-3 (Brown et al., 2020) further trains an 175B parameter autoregressive language model, demonstrating strong performance on many NLP tasks. The development of largescale language models is also beneficial to the task of dialogue generation. Open-domain Dialogue Generation. On the basis of GPT-2, DialoGPT (Zhang et al., 2020) is trained for response generation using Reddit comments. To obtain a human-like open-domain chatbot, Meena (Adiwardana et al., 2020) scales up the network parameters to 2.6B and utilizes more social media conversations in the training process. To emphasize desirable conversational skills of engagingness, knowledge, empathy and personality, Blender (Roller et al., 2021) further fine-tunes the pre-trained model with human annotated conversations. In addition to the attempts on model scale and data selection, PLATO introduces discrete latent variable to tackle the inherent one-tomany mapping problem to improve response quality. In this work, we explore the effective training of PLATO-2 via curriculum learning.

Conclusion
In this work, we discuss the effective training of open-domain chatbot PLATO-2 via curriculum learning, where two stages are involved. In the first stage, one coarse-grained model is trained for general response generation. In the second stage, two models of fine-grained generation and evaluation are trained for diverse response generation and response coherence estimation. Experimental results demonstrate that PLATO-2 achieves substantial improvements over the state-of-the-art methods in both Chinese and English evaluations.

A Data Cleaning Process
PLATO-2 has English and Chinese models, with training data extracted from open-domain social media conversations. As the comments are formatted in message trees, any conversation path from the root to a tree node can be treated as one training sample, with the node as response and its former turns as context. To improve the generation quality, we carry out elaborate data cleaning. A message node and its sub-trees will be removed if any of the following conditions is met.
1) The number of BPE tokens is more than 128 or less than 2.
2) Any word has more than 30 characters or the message has more than 1024 characters. 3) The percentage of alphabetic characters is less than 70%. 4) The message contains URL. 5) The message contains special strings, such as r/, u/, &amp. 6) The message has a high overlap with the parent's text. 7) The message is repeated more than 100 times. 8) The message contains offensive words. 9) The subreddit is quarantined. 10) The author is a known bot.
After data cleaning, the English training data contains 684M (context, response) samples and the Chinese training data contains 1.2B (context, response) samples. Each English/Chinese sample has 2.78/2.82 utterances and each utterance has 26.29/22.20 tokens on average.

B Training Configurations
PLATO-2 has three model sizes: a standard version of 1.6B parameters, a small version of 314M parameters, and a tiny version of 93M parameters. The 1.6B parameter model has 32 transformer blocks and 32 attention heads, with the embedding dimension of 2048. The 314M parameter model has 24 transformer blocks and 16 attention heads, with the embedding dimension of 1024. The 93M parameter model has 12 transformer blocks and 12 attention heads, with the embedding dimension of 768. The training configurations of PLATO-2 1.6B are provided in Table 6. The training was carried out on 64 Nvidia Tesla V100 32G GPU cards. It takes about 3 weeks for the 1.6B parameter model to accomplish the curriculum learning process.

C Chinese Case Analysis
We also provide two human-bot chat examples of XiaoIce and PLATO-2 in Figure 4, with original interactive logs shown on the left and translated logs on the right. It can be observed that some responses produced by XiaoIce are not coherent with the contexts and there are some abrupt changes of topics. By contrast, the interaction with PLATO-2 is more coherent and engaging.

D Scoring Criteria in Human Evaluation
The criteria used in human evaluation are provided in Table 7.

E Response Selection Comparison
We carry out more experiments to compare the performance of distinct scoring functions in response selection. Firstly, one Chinese response selection dataset is constructed: 100 dialogue contexts are selected from the test set and 10 candidate responses are retrieved for each context with a commercial chatbot. Secondly, we ask crowd-sourcing workers to annotate the label whether the candidate response is coherent with the context. Thirdly, we train three 336M parameter models as the scoring function, including the forward response generation probability p(r|c), the backward context  recover probability p(c|r) and the bi-directional coherence probability p(l r |c, r). Their results on the annotated response selection dataset are summarized in Table 8. The metrics of mean average precision (MAP), mean reciprocal rank (MRR) and precision at position 1 (P@1) are employed. These results indicate that PLATO-2's evaluation model is better at selecting appropriate responses.