Combining Curriculum Learning and Knowledge Distillation for Dialogue Generation

Curriculum learning, a machine training strategy that feeds training instances to the model from easy to hard, has been proven to facili-tate the dialogue generation task. Meanwhile, knowledge distillation, a knowledge transformation methodology among teachers and students networks can yield signiﬁcant performance boost for student models. Hence, in this paper, we introduce a combination of curriculum learning and knowledge distillation for efﬁcient dialogue generation models, where curriculum learning can help knowledge distillation from data and model aspects. To start with, from the data aspect, we cluster the training cases according to their complexity, which is calculated by various types of features such as sentence length and coherence between dialog pairs. Furthermore, we employ an adversarial training strategy to identify the complexity of cases from model level. The intuition is that, if a discriminator can tell the generated response is from the teacher or the student, then the case is difﬁcult that the student model has not adapted to yet. Finally, we use self-paced learning, which is an extension to curriculum learning to assign weights for distillation. In conclusion, we arrange a hierarchical curriculum based on the above two aspects for the student model under the guidance from the teacher model. Experimental results demon-strate that our methods achieve improvements compared with competitive baselines.


Introduction
Along with the enormous prosperity of social media on the Internet, there is a resurgent interest in developing open domain dialogue systems. However, the complexity of conversations crawled from the Internet may vary significantly. Sachan and Xing (2016); Cai et al. (2020); Lison and Bibauw (2017). To adapt to this phenomenon, some prior works (Cai et al., 2020; Sachan and * Corresponding authors: JunFei Liu and Dongyan Zhao. Xing, 2016;Feng et al., 2019) employ curriculum learning (Bengio et al., 2009), in which a model is taught by using easy samples firstly and gradually adding more difficult ones. For example, (Cai et al., 2020) proposes an adaptive multi-curricula learning framework to train the dialogue model with easy-to-complex dataset based on various concepts of difficulty including the specificity and repetitiveness of the response, the relevance between the query and the response, etc. Also, Wan et al. (2020) resolves this problem by introducing selfpaced learning (Kumar et al., 2010), which is a special kind of curriculum learning (Eppe et al., 2019). Wan et al. (2020) measures the level of confidence on each training example, where an easy sample is actually one of high confidence by the current trained model. Both curriculum learning and self-paced learning suggest that samples should be selected in a meaningful order for training. The difference is that curriculum learning uses pre-training or human intuitions while the emphasis of self-paced learning can be dynamically determined by model itself.
On the other hand, knowledge distillation (Hinton et al., 2015) is one of the most popular techniques to train efficient models, which aims to transfer the knowledge encoded in a pretrained teacher network into a student model. (Ba and Caruana, 2014) points that the teacher's probability predictions can capture the logarithm relationships between labels that are not obvious in the one-hot ground-truth label. Moreover, the teacher model can spread uncertainty over multiple outputs when it is not confident of its prediction. As a consequence, student models can yield significant performance boost under the guidance of a teacher. Since the knowledge from the teacher to student also has different difficulty degrees, it is intuitive to apply curriculum learning during this distillation process.
To our best knowledge, very little is known about how curriculum learning and knowledge distilla- tion work together. Hence, in this work, we propose a dialogue generation model that combines curriculum learning and knowledge distillation. Firstly, from the data level, we employ different types of features such as sentence length, word and utterance entropy, and coherence between dialog pairs to estimate the complexity of data for each training example. We preliminary cluster the training cases according to their data level complexity. Then based on these difficulty scores, we construct a learning curriculum for the student model. Secondly, from the model level, we employ an adversarial training strategy to evaluate the model-aware complexity. Concretely, the student model is adversarially trained to fool a discriminator, while the discriminator aims to distinguish the outputs from student and teacher networks. We measure the hardness of each sample by taking both the value and history of the discriminator into account, based on the following two intuitions. (1) The discriminator defines an objective of progressive difficulty (Doan et al., 2019), if the discriminator can successfully distinguish the output, then it is a hard case, and vice versa (Doan et al., 2019). (2) The model evolves during training and therefore additional evaluation pass to measure the change in a performance is needed (Matiisen et al., 2020). In this paper we consider the change in the discriminator. If the change is negative, this must mean that the sample is difficult to train. Then based on these model-level difficulty scores, we further transfer the knowledge from teacher to student network gradually by incorporating self-paced learning methodology.
Our contributions are summarized as follows: (1) We make an empirical study on the combination of curriculum learning methods and knowledge distillation for efficient dialogue generation models. (2) We arrange a hierarchical curriculum based on the above two aspects (data and model) for the distillation model. (3) We apply the proposed framework on two real-life open-domain conversation datasets, automatic and manual evaluation shows that our approach can be used to enhance the performance of dialogue models.

Related Work
Our work is at the intersection of curriculum learning (Bengio et al., 2009) and knowledge distillation (Hinton et al., 2015) for training dialogue generation models.

Neural Dialogue Generation
Since (Ritter et al., 2011) propose a data-driven approach that adopt phrase-based statistical machine translation for dialog system. more and more researchers have focused on generation-based conversation system. A popular framework for dialogue generation is using extra information such as conversation topics (Xing et al., 2017) , persona profile (Song et al., 2019), user emotions (Zhou et al., 2018), or out-sourcing knowledge  is introduced to benefit the dialogue model with more diverse response generation (Serban et al., 2017;Zhao et al., 2017;Gu et al., 2019). Latent variables also benefit the model with more diverse response generations (Zhao et al., 2017). This paper improve the dialogue model from a different angle that we make an empirical study on the combination of curriculum learning methods and knowledge distillation.

Knowledge Distillation
Our study is also related to the knowledge distillation method (Hinton et al., 2015), which employs a teacher model and tries to minimize the KL divergence between teacher distribution and student distribution. In (Romero et al., 2015), the student network is trained not only using the soft targets, but also using hints from the intermediate layers.
Knowledge distillation was first introduced for classification tasks as a way to compress large networks into smaller models. Kim and Rush (2016) extend this to neural machine translation, and then  has proposed further applications of dialogue generation task. However, these papers do not consider the order of the learning schedule. In a sense, our method is different from theirs because we borrow the idea of curriculum learning for knowledge distillation.

Curriculum Learning in NLP
Inspired by the human learning process, curriculum learning (Bengio et al., 2009) is proposed as a machine learning strategy by feeding training instances to the model from easy to hard. It has been applied to many NLP tasks. To name a few, (Sachan and Xing, 2016) propose and study other heuristics that define a measure of easiness and learn the curriculum by selecting samples using this measure. More recently,  learns a multi-Domain curriculum for neural machine translation. Xu et al. (2020) uses curriculum Learning to distinguish easy examples from difficult ones for natural language understanding by reviewing the trainset in a crossed way. Our paper is quite different from theirs because we arrange a hierarchical curriculum based on the above two aspects (data and model) for the distillation model.

Problem Formulation
The overall network architecture is shown in figure 1. The teacher and student model use the same basic architecture that is related to an encoderdecoder (Cho et al., 2014) generative dialogue model based on Variational Autoencoders (VAEs) (Kingma and Welling, 2014).
In our model, there are three elements: dialogue context X = x 1 , x 2 ...x i , response Y = y 1 , y 2 ...y i and a latent variable z. The dialogue context X is composed of several history utterances. The response Y is the responses towards the given context. The latent variable z is used to capture the latent distribution over the replies with a standard Gaussian prior, which is defined as follows: Our task is to model the true probability of a response Y given an input X, which can be estimated as: (2) The hierarchical curriculum strategy for distillation model consists of two parts: one is for the data-level and the other is for the model-level. In the data-level, easier context-response pairs are presented to the student model before harder ones. As for the model level, we design curriculum schedules to gradually transfer knowledge from the the teacher to student, which controls the difficulty of soften labels that are distilled from teacher to students. The samples that discriminator cannot differentiate between the output provided by the student and the teacher are assumed to be easier ones. Starting from easier samples, the model progressively strengthens its relation between the teacher and student models. In the rest of this paper, we give detailed descriptions of the proposed approach.  (Xu et al., 2018). These features compensate each other by capturing the information in a sentence pairs from different aspects. All these features are from previous research and here we integrate them together: we first use the method from these papers to compute the scores for individual sentences; then normalize the scores; finally add all these scores together as a total score. We rank all sentence pairs according to their scores, and we break down the dataset D o into N subsets, in which those examples with similar complexity are categorized into the same subset.

Output Knowledge Distillation
Knowledge distillation describes a class of methods for the knowledge transfer from teacher network to student network. In our model, the student network S θ is trained over the same architecture but different parameters as teacher model T θ . The teacher has previously been trained, and we freeze its parameters when training the student network.
We transfer the knowledge from teacher to student by minimizing the similarity distance between the output of student network and the soft label generated by the teacher network. We use crossentropy loss to measure the two logits as (Romero et al., 2015). To further improve the sequence-tosequence student model, hard-assigned labels are also utilized. The final student network is trained to optimize the following compound objective: where H refers to the cross-entropy and V is a parameter to indicate the temperature of distillation. Later, we will use the method of the model level curriculum learning to process λ in section 2.5. Note that the first term in Equation (3) corresponds to the traditional cross-entropy between the softmax layer's output of a (student) network and word distribution in response Y , whereas the second term is to learn from the softened output of the teacher network to strengthen its supervision for the student. In the teacher model, we train it by using all the dataset with original order, while in the student model, the training starts from the step that consists of examples with the lowest difficulty. After that, data in the next step is aggregated to the current training dataset.

Latent Space Knowledge Distillation
In order to guide the student's learning process of the output layer, we introduce hints (Romero et al., 2015), which are representations in the intermediate layer from the teacher network. Instead of adopting the classic student-teacher strategy of forcing the output of a student network to exactly mimic the soft targets produced by a teacher network, we introduce adversarial networks to transfer the knowledge from teacher to student. Due to the discrete nature of natural language tokens (Shen et al., 2017;Xu et al., 2017), it is difficult to pass the gradient update from the discriminator to the generator . So we choose to discriminate variable z in high level latent space rather than direct tokens (Gu et al., 2019).
During the process of latent space knowledge distill, we generate student' latent variable representation by training the student network S θ and freezing the teacher parts adversarially against discriminators D. A discriminator D attempts to classify its input as teacher or student by maximizing the following discriminator loss (Goodfellow et al., 2014): (4) where W (· ·) represents the Wasserstein distance between these two distributions (Arjovsky et al., 2017). We choose the Wasserstein distance as the divergence since the WGAN has been shown to produce good results in text generation (Zhao et al., 2018). z t and x t denote the latent variable and query representation in T θ . z s and x s denote the latent variable and query representation in S θ . Student network attempts to generate similar outputs which fools the discriminator D. D is implemented as a feed-forward neural network which takes as input the concatenation of z and x and outputs a real value.

Model-Level Difficulty Evaluation
In the first step, we have selected data based on the definition of data difficulty. While in this step, we select the teachers' knowledge by using curriculum learning based on the performance of GAN. GAN can be said to share aspects with curriculum learning: the discriminator defines an objective of progressive difficulty (Doan et al., 2019). We consider two different metrics as scores for measuring generator progress in our curriculum approach, which is defined as follows: (1) Discriminator evaluation: where Score i is the difficulty score of the i th sample.
For comparison, we also use the loss value of distance between the output of teacher and student network to measure the sample difficulty, which is defined as follows: (1) Loss value:

Self-paced Learning
In this section, we aim to decide the order of output distillation. Not that all samples are distilled from teacher to student equally, but to start training from simple samples and gradually select complex samples to join the training process of the model. That is to say, we need to determine the value of V for Use a GAN to distill the latent variable by using Equation (4) 5: Calculate the difficultly score; 6: Acquire the self-paced learning arrangement and distill the output by using Equation (8) (3). The conventional self-paced learning selects the samples based on the loss value. While we replace it with our difficulty score described in the last section. Then we use self-paced learning to estimate V by the optimization as: where f (λ, V) determines the way to compute the value of v i , λ is the self-paced adjustment parameter. lets V ∈ {0, 1} n and defines f (λ, V) as: The optimal V can be calculated by where λ is used to control the learning pace of if self-paced learning. In our paper, suppose T is the total number of training steps and t is the current training step. During training, to select the training instances with desired difficulty, we resort to a pre-defined pacing function λ = f (t) to control how fast the output will be distilled from teacher to student. We define three different pacing functions named as linear-scheduler, log-scheduler and exp-scheduler to make a smooth transformation from teacher to student models and verify the effectiveness of the proposed model. Linear-scheduler is increased constantly in the training process. Log-scheduler indicates that the increased speed is from fast to slow, while exp-scheduler is opposite to it. We will compare the effects of these three methods in the next section.
In order to incorporate self-paced learning into the distillation process, we reformulate our objective function 3 as follows: In conclusion, our hierarchical curriculum learning algorithm framework is described in Algorithm 1.

Datasets
We conduct experiments on two English conversation datasets, which have been widely used in open-domain dialogue generation. (1) DailyDialog (Li et al., 2017): it is a collection of real-world daily conversations for an English learner in daily life. It is a multi-turn dataset, and we treat each turn as a single-turn training pair in this work. (2) PersonaChat : it is collected by two crowdsourced workers chit-chatting with each other, conditioned on the assigned personas. In our experiments, we only use the conversation text and process it as DailyDialog.

Evaluation Methods
Automatic Evaluation Method It is challenging to assess the quality of the generated responses. In this paper, we adopt several evaluation methods to measure different aspects of our results: BLEU (Papineni et al., 2002): it is used as a reward to evaluate dialog systems by measuring word overlap between the generated reply and the ground truth for the final evaluation. We compute BLEU scores for n <= 4 using smoothing techniques 1 . Entropy-based metrics : it includes word and sentence entropy as (Serban et al., 2017), which suggests the diversity of responses. Length: as proposed by (Mou et al., 2016), the length of an utterance is an objective, surface metric that reflects the substance of a generated reply.    Human Evaluation Method Considering the limitations of the existing automatic evaluation metrics, we also adopt human judgments. We use Dai-lyDialog as the evaluation corpus since it is more similar to our daily conversations and easier for annotators to make the judgement. We randomly sample 100 cases and three well educated volunteers are recruited to do manual evaluation. For each query-reply pair, volunteers are asked to rate it with three levels: 0, 1, 2. 0 indicates that the selected sentences are either irrelevant or disfluent with grammatical errors; 1 is for the reply that is relevant but not informative enough; 2 means that the queries and replies are extremely related and the replies are natural. We calculate the ratio of each score (0, 1 and 2) for each model. To examine the agreements among all the volunteers, we also calculate the Fleiss kappa (Fleiss and Cohen, 2016) of the human annotations on all models.

Comparison Models
To ascertain the effectiveness and applicability of our approach, we re-implement experiments on these methods: (1) S2S: it is a sequence-tosequence model with attention mechanism as in (Shang et al., 2015). (2) CVAE: it is a latent variable model using conditional variational autoencoder trained with KL annealing and a BoW loss as in (Zhao et al., 2017). (3) Curriculum (Cai et al., 2020): it employs an adaptive multi-curricula to schedule a committee of organized curricula for dialogue learning. (4) KD (Tahami et al., 2020): it uses two dialogue models as the student and the teacher. The framework uses a teacher-student setting where the student learns from both the groundtruth labels and the soft-labels provided by the teacher.

Training and Evaluation Details
For the teacher and student model, we use gated recurrent units (GRU) (Cho et al., 2014)    linearity. The dimension of a latent variable z is set to 64. The initial weights for all fully connected layers are sampled from a uniform distribution [-0.02, 0.02]. The generators as well as the discriminator D are 3-layer feed-forward networks with ReLU non-linearity and hidden sizes of 200, 200 and 400, respectively. The gradient penalty is used when training D (Nair and Hinton, 2010) and its hyperparameter λ is set to 10. We set the vocabulary size to 20,000 and define all the out-of-vocabulary words to a special token < unk >. The word embedding size is 200. The longest utterance is set to 40. The baselines are implemented with the same set of hyper-parameters. All the models are implemented with Pytorch 2 .

Evaluation Results
Automatic Evaluation Results The automatic evaluation results of our proposed method and baselines on the two datasets are shown in Table 1.
We can see the following observations. (1) Our model outperforms the baselines regarding almost all the evaluation metrics on the two datasets. The overall performance of our model further supports our hypothesis that our model achieves a better trade-off on the whole.
(2) Specially, in terms of BLEU scores, compared to the S2S, CVAE, KD and Curriculum, our model obtains impressive 16.7%, 11.2%, 10.2% and 9.5% performance gains on the DailyDialog. As for PersonaChat, our model outperforms the baseline with absolute improvements of about 8.2%, 4.9%, 3.3% and 3.6%. This indicates that our model generates more relevant responses with the highest BLEU scores on 2 https://pytorch.org/ both datasets. (3) To show that our model is on average more diverse than other model responses, we compute the average sentence entropy and word entropy, and our model produces responses with higher entropy on both dataset compared to the other baseline models. In particular, we can see that the entropy of the sentences has been considerably enhanced. (4) We also report the average length of responses outputted by each model. Since long responses contain rich content, the results provided quantitative evidence to our claim that we can improve the responses with richer content than other models.

Human Evaluation Results
The results of human evaluation against all baseline methods are listed in Table 2. The Kappa scores on all models are larger than 0.5, which indicates the correlation of the human evaluation. From the results we can again observe that, similar to the automatic evaluation results, our model consistently achieves the best performance, which further demonstrates the effectiveness of our proposed method.
6 Further Analysis

Ablation Study
There are four important parts in the proposed framework: Data Level Curriculum (DC), Output Distillation (OD), Middle Layer Distillation (MD), Model Level Curriculum (MC) and we remove them one at a time.   but also knowledge distillation. Specially, we find that the MC is slightly more important in overall performance. Meanwhile, without other parts also decreases the performance on most evaluation metrics, which further proves the effectiveness of combining these two techniques together.

The Effect of Different settings of subsets in Data Level Curriculum
We further explore the effects of different number of subsets for our data-level curriculum strategies, which also decides the granularity of sample selection in one epoch. Experiments are conducted on the proposed two datasets and we report BLEU scores in Figure 2. We select a wide range of choices: from 2 to 20. In general, its performance significantly outperforms the baseline system on the test set with different settings of subsets, which indicates that our approach is robust and effective. We also evaluate extreme situations. For example, when we divide our data set into 100 groups, the result is 0.295 on BLEU score (0.011 below our baseline with the worst effect), which is as expected because an over-small subset leads to the problem of overfitting.

The Effect of Different Schedular Functions
Since we design three pacing functions in modellevel curriculum arrangement, we compare and analyze the proposed functions in experiments. We conduct experiments on the two datasets and the performance of different pacing functions can be found in Table 5. We have the following two observations. (1) The exp-scheduler method consistently outperforms others on two datasets. We suspect that is because in the case of the exp-scheduler function, the student network starts learning less from the teacher model and therefore has more time to learn a better discriminator.
(2) Compared with other pacing functions, the linear-scheduler pacing function results in the worst performance, which indicates the effectiveness of changing learning speed.

The Effect of Different Model-Level Curriculum Strategy
To further glean the insights regarding the different model-level curriculum strategy, we present the results in Table 6. We can see that D change achieves the best results when compared to the baseline and other methods, which indicates that D does reflect the complexity of students' models compared to teachers'. The loss-based complexity performed worse than D and D change. We suspect that because the loss function is not a good signal to judge the model complexity compared to discriminator.

Case Studies
To empirically analyze the quality of generated responses, we present examples generated from our model and baselines in Table 7 . For each query, we show the best samples of generated responses from each model. On the table, we see that our model generates both long and informative replies compared with others.

Error analysis
To enhance the performance of our model in the future, we take the worse cases in human judgment as an example to analyze our errors. We find that although our model improves the response diversity significantly, the model still has a "safe response" problem. Compared with the response generated by the teacher model, we find that the "safe response" generated by the teacher model can greatly affect students. 80.1% of the "safe response" is from the teacher model. That is, soft labels that are generated by a teacher model largely determine the performance of its student model. Therefore, in the future, we will study methods that can learn the good parts of the teacher model, and filter the bad parts of the teacher model.

Conclusion
In this work, we consider open-domain dialogue systems. To induce model learning from effective teachers, we propose a learnable distillation model to dynamically distill knowledge by hierarchical curriculum learning. Experiments conducted on two public conversation datasets show that our proposed framework is able to boost the performance of existing dialogue systems. Besides, our framework is not limited to the neural dialogue generation task. In the future, we would extend our method to deal with other text generation tasks (e.g., abstract summarization) and examine this approach's adaptability to these tasks.