Diversifying Neural Dialogue Generation via Negative Distillation

Generative dialogue models suffer badly from the generic response problem, limiting their applications to a few toy scenarios. Recently, an interesting approach, namely negative training, has been proposed to alleviate this problem by reminding the model not to generate high-frequency responses during training. However, its performance is hindered by two issues, ignoring low-frequency but generic responses and bringing low-frequency but meaningless responses. In this paper, we propose a novel negative training paradigm, called negative distillation, to keep the model away from the undesirable generic responses while avoiding the above problems. First, we introduce a negative teacher model that can produce query-wise generic responses, and then the student model is required to maximize the distance with multi-level negative knowledge. Empirical results show that our method outperforms previous negative training methods significantly.


Introduction
In the past few years, data-driven response generation (Vinyals and Le, 2015;Shang et al., 2015;Vougiouklis et al., 2016) has achieved impressive performance, drawing continuously increasing attention from academia and industry. Conventionally, with the guidance of maximum likelihood estimation (MLE), neural dialogue models are expected to maximize the probability of generating the corresponding reference given any query. Unfortunately, due to the many-to-one phenomenon (see Table 1), a characteristic of the dialogue task (Csáky et al., 2019), these models are prone to produce safe but generic responses (e.g., I don't know (Li et al., 2016)), which sets an obstacle for the generative dialogue system to be deployed widely. Some researchers tried to redesign the objective 1 The code and preprocessed data are available at https://github.com/Yiwei98/dialogue-negative-distillation. of models to meet the requirement of diverse responses instead of MLE, such as MMI (Li et al., 2016), AdaLabel (Wang et al., 2021), and IAT (Zhou et al., 2021). Besides, several studies (Kulikov et al., 2019;Holtzman et al., 2020) proposed more advanced decoding strategies to alleviate the problem of generic responses. Indeed, the above methods boost the diversity of responses by reminding the model what should be said.
However, inspired by negative training (Kim et al., 2019;Ma et al., 2021), we argue that it is also necessary to tell the dialogue model what not to say. To alleviate the problem of generic responses, He and Glass (2020) negatively updates the parameters when identifying the high-frequency responses. Li et al. (2020a) punishes the behaviors of generating repetitive or high-frequency tokens by using the unlikelihood objective (Welleck et al., 2020).
Although the negative-training based methods enhance the diversity of responses, there still exists two drawbacks: First, they regard high-frequency tokens or utterances as negative candidates. However, the high-frequency response problem is only a sub-problem of the generic response problem (He and Glass, 2020). It means that the responses that are low-frequency but generic will escape from punishment. Even worse, we have observed that some generic responses followed by a low-frequency but meaningless subsequence can avoid being identified as high-frequency, which inevitably sacrifices the fluency of responses (see Analysis). Second, these methods ignore the implicit negative knowledge in neural networks that characterizes negative candidates at multiple levels. We contend that it is more effective to conduct negative training with richer information (e.g., hierarchical representation).
To tackle the above problems and further improve the diversity of responses, we propose a novel negative training paradigm called Negative Distillation (ND Well, I don't know. 12.14 She likes dancing. Ask her to dance. 5: It doesn't matter. You gotta find what she's I don't know . . . 6.82 interested in and go with that. Table 1: The many-to-one phenomenon in DailyDialog. All the above five queries have the same I don't know-like responses. The corresponding source entropy (Csáky et al., 2019) scores are much higher than the median score (0.92) of the whole training set. This phenomenon will lead to the generic response problem. lation (KD) (Hinton et al., 2015;Jiao et al., 2020) takes the teacher as a positive role model and induces the student to imitate. Differing from that, we train the teacher as a negative role model and remind the student to get rid of those bad behaviors.
Specifically, we first collect a negative training set by using a filtering method called Source Entropy (Csáky et al., 2019). This filtering method can retrieve all many-to-one cases of the raw dataset. Note that the "one" is usually a generic response. Then, we train a dialogue model on the above subset as the negative teacher. Given queries, the negative teacher can provide a set of negative candidates (i.e., generic and dull responses) that the student is prone to generate, which avoids the first drawback mentioned before. Therefore, the student obtains query-wise bad behaviors for Negative Distillation. To conduct the negative update holistically, we design two negative objectives, including soft unlikelihood loss on the prediction layer and reverse square error on the intermediate layer. In this way, the negative distillation fully exploits multilevel negative knowledge to force the student to generate non-generic responses.
Our contributions are summarized as follows: • We propose a novel and effective negative training paradigm called Negative Distillation. It constructs query-wise generic responses as the negative candidates.
• We design two negative objectives to utilize multi-level information to further boost the performance of negative distillation.
• We perform extensive experiments and detailed analysis to verify the effectiveness of the negative distillation framework and the superiority compared with previous negative training methods.

Method
In this section, we first introduce the negative teacher, then describe the negative distillation on the prediction layer and the intermediate layer, respectively, and finally present the progressive optimization objective. Algorithm 1 shows the whole training details.

Background
Dialogue Generation with MLE Take Q = {q 1 , q 2 , ..., q Tq } and R = {r 1 , r 2 , ..., r Tr } as the (query, response) pair, where T q and T r represent the length of query and response, respectively. The generative dialogue model aims to learn a conditional probability distribution p θ (R|Q). The maximum likelihood estimation (MLE) is usually used to train the model, which can also be expressed as minimizing the negative log-likelihood: Considering one characteristic of the dialogue task, i.e., allowing the response to be varied, the manyto-one phenomenon occurs in the dialogue corpora frequently. However, with the MLE-based training, this phenomenon will cause the model to produce generic responses.
Unlikelihood Training Unlikelihood (UL) loss (Welleck et al., 2020) is proposed for the model to address the problem of undesirable behaviors (e.g., repetitive or high-frequency tokens). It forces the model to minimize the probability of generating negative candidates, which is formulated as: where C t consists of negative candidates (e.g., overuse frequent words) that are also a sub-set of the vocabulary.
Knowledge Distillation The traditional knowledge distillation (KD) usually transfers useful knowledge from a large and strong teacher network T to a small student network S. The distillation loss is used to align the soften predictions of the teacher and the student, denoted as f T (x) and f S (x): where L(·) is a measurement function that calculates the distance of different probability distributions, x is the input text, and D denotes the training set.
In this work, we replace the positive teacher in vanilla KD with a negative teacher, aiming to provide negative knowledge for the student to conduct negative training and avoid undesirable behaviors.

Negative Teacher
To improve the diversity of responses, the dialogue model should be told which responses are generic. For negative distillation, a negative teacher is required to produce possible generic responses given any query. In this work, we adopt the widely used Transformer (Vaswani et al., 2017) as the underlying model for both teacher and student. We introduce the Source Entropy filtering method (Csáky et al., 2019) to identify and collect the many-toone cases for the negative training set. The source entropy is defined as: where p(q i |r) is the conditional probability calculated based on the relative frequency of (query, response) pairs, r is a response, q i is the query corresponding to the response r, and D represents the raw training set. A higher source entropy indicates that the response r corresponds to more queries, i.e., the many-to-one problem is serious. We select the top 50% 2 dialogue pairs (q, r) with a high source entropy as the negative training set D N , which contains a much higher proportion of generic responses than the raw training set.
After that, we train the teacher N on the negative training set D N by Equation 1. The teacher will naturally produce generic responses for any input query. More importantly, it will provide richer negative knowledge for the student, including soft logits in the prediction layer and implicit features in the intermediate layers.

Negative Distillation
In this section, we conduct the negative distillation for the student based on the multi-level negative knowledge.

ND for Prediction Layer
The soften logits in the prediction layer contain more information than the ground-truth labels, such as the similarity between labels (Wang et al., 2021). Therefore, conventional KD transfers knowledge by narrowing the gap between the probability distributions of the teacher T and the student S: As for negative distillation, the extra knowledge in soften logits of the negative teacher reflects how to generate dull responses based on the input query. Therefore, we propose a soft unlikelihood loss to maximize the distance between the predictions of the negative teacher N and the student S: where p N and p S are calculated by: where t is a temperature coefficient that is used to soften the probability distribution over words.
It should be emphasized that previous negative training methods only use the high-frequency words or phrases with one-hot representation as the targets, which ignores the rich information existing in the soften logits (e.g., the generic words have similar probabilities). In the Analysis section, we demonstrates the superiority of soften logits compared with hard targets (i.e., one-hot representation).
ND for Intermediate Layer In addition to the output knowledge from the prediction layer, there is also some implicit knowledge embedded in the intermediate layers, such as hidden states and attention matrices. To keep the student away from undesirable behaviors (i.e., producing generic responses) more effectively, we further consider the above knowledge into negative distillation. Specifically, the distance between features of the negative teacher and the student should also be increased. In this work, we propose a new measurement function, called mean reverse square error (MRSE), to calculate this distance: where A and B are the feature matrices of the negative teacher and the student, respectively, and n is the number of elements of each matrix. Due to the responses generating in the decoding phrase, we only conduct negative distillation on the intermediate layers of the decoder. For each decoder layer, the negative distillation objective of hidden states is defined as: where H l N and H l S are the output hidden states of the l th decode layer of N and S, respectively.
As the attention weights can learn substantial linguistic knowledge (Clark et al., 2019), it is beneficial for the student to further conduct negative distillation on the attention matrices, which is computed as follows: where Q, K, and V are the matrices of queries, keys, and values, respectively, and d k is a scaling factor. Following Jiao et al. (2020), the attention matrix A is chosen to calculate the distance rather than its softmax version softmax(A). Similar to Equation 9, the negative distillation objective of attention matrices is formulated as: where A l N and A l S are the attention matrices of the l th decoder layer of N and S, respectively. Optimize N by minimizing L mle (N ) on D N using Eq. 1 8: until Convergence 9: % Negative distillation. 10: repeat 11: Optimize S by minimizing L(S) on D using Eq. 13 12: until Convergence Output: S : The trained student.

Progressive Optimization
The overall loss, combining the above negative distillation objectives and the MLE objective, is denoted as: where α is a hyper-parameter that balances the importance of supervised learning and negative distillation. For negative distillation, it would be better that the student has the ability to say something before it is reminded of what not to say. Thus, we perform a progressive distillation that first warms up the negative distillation ratio and then colds it down gradually. Inspired by the derivative of sigmoid function:  which shows a trend of gradual rise-fall, we define the balance coefficient α as: where λ controls the peak value and z is calculated by: where s is the training step, and β and γ control the telescopic and translation transformation, respectively.

Datasets
In our experiments, two widely used dialogue datasets are employed to evaluate the proposed method: DailyDialog, which collects conversations that are similar to human daily communication (Li et al., 2017b), and OpenSubtitles, which consists of large-scale dialogues extracted from movie subtitles (Tiedemann, 2009). In this work, we focus on the single-turn dialogue generation, thus we pre-process these two datasets into the (query, response) pairs. Table 2 provides the statistics of both datasets.

Experimental Settings
We take the Transformer-based sequence-tosequence model (Vaswani et al., 2017) as the underlying model for all approaches.
Following the settings of Transformer in Csáky et al. (2019), both encoder and decoder contain 6 layers, in which the self-attention module has 8 attention heads and the number of feed-forward units is 2048. The size of hidden states is set to 512 and the dimension is 64 for query, key, and value. Please refer to Appendix A for more details.
For the proposed approach, both the negative teacher network and the student network have the same settings in terms of the network architecture and hyper-parameters. λ in Equation 15 is set to 4, making the peak value equal to 1. γ is 25600 and β is 6/γ. For the temperature coefficient t, we simply set it to 1.

Baselines
We compare the proposed negative distillation (ND) approach with the standard Transformer, two existing negative training approaches and two extra diversity improving approaches: • Standard The vanilla Transformer-based sequence-to-sequence model with the MLEbased training (i.e., the cross-entropy based loss).
• NT (Negative Training) (He and Glass, 2020) During training, it first counts the frequency of all generated utterances and then conducts the negative update based on the high-frequency utterances.
• UL (Unlikelihood Training) (Li et al., 2020a) Different from NT, it calculates the frequency of all generated words instead of utterances and penalizes the high-frequency words by introducing an unlikelihood loss term.
• CVAE (Zhao et al., 2017) A dialogue response generation model using conditional VAE to improve the diversity of generated responses.
• FACE (Jiang et al., 2019) It uses the frequency-aware cross-entropy loss to tackle the low-diversity problem.
All the baselines are performed with the same architecture and hyper-parameters as ours. Following He and Glass (2020); Li et al. (2020a), we use greedy search as the decoding strategy for all baselines and our method. We also evaluate the performance with beam search (size 5) and obtain similar results (see 3.6 for details). Details for baselines is describes in Appendix B.

Automatic Evaluation
Metrics To evaluate whether negative distillation can effectively reduce the generic responses, we adopt Dist-{1,2,3} (distinct) (Li et al., 2016) to reflect the lexical diversity of the generated responses. It is a widely used metric that counts the proportion of unique unigrams/bigrams/trigrams. LF (lowfrequency token ratio) (Li et al., 2020b) further measures the diversity of responses by calculating the ratio of low-frequency words in the generated responses. The threshold of low frequency is set to 100.   Besides, it is necessary to verify whether the models can ensure consistency while improving diversity. So we use KL-{1,2} (KL divergence) (Csáky et al., 2019), which measures the distribution distance between the generated and the groundtruth responses, to reflect how well a model can approximate the ground-truth unigrams/bigrams distributions. BLEU (Chen and Cherry, 2014) is also reported and it measures n-gram overlap between the generated and the ground-truth references.
Results Table 3 shows the results obtained at the lowest point of the validation loss. We can see that our approach outperforms all baselines in diversity (Dist and LF) by a significant margin on both datasets, demonstrating that ND can effectively alleviate the generic response problem by using multi-level negative information. The KL and BLEU scores of ND are close to or better than Standard, which verifies that our method can maintain the consistency of responses while improving its diversity. To some extent, both NT and UL improve the diversity of words, especially for trigrams, but the low LF scores indicate that they reduce the high-frequency words but fail to increase the number of low-frequency's. What's worse, BLEU and KL-2 scores of above two and CVAE sharply decline. It suggests that previous negative training approaches and other methods for diversity enhancement may harm the consistency and fluency of responses dramatically, which is not in line with the goals of the dialogue system. Our method obtains similar results with beam search. Please refer to 3.6 for details.

Human Evaluation
Apart from automatic evaluations, we conduct human evaluations to further verify the effectiveness of our method than previous negative training methods. We randomly select 50 samples from the test set of DailyDialog, and three well-educated annotators are invited to judge which of the responses generated by ND and baselines is better (i.e., win, tie or loss) in terms of informativeness, relevance, and fluency. Informativeness reflects how much the information related to the query is contained in the generated response. Relevance reflects how likely the generated response is coherent to its query. Fluency reflects how likely the generated response is produced by human. Table 4 summarizes the human evaluation results. We can see that the proposed approach is overall better than all baselines. Specifically, ND achieves better performance than Standard in terms of informativeness and relevance, and remains competitive in fluency. Compared with both NT and UL, our approach shows significant advantages, especially in fluency. It indicates that their punishment for high-frequency tokens or utterances will lead to a serious non-fluency and inconsistency problem. We use Fleiss's kappa (Fleiss, 1971) to measure the inter-annotator agreement.

Experimental Analysis
We conduct extensive analysis on DailyDialog to investigate the effectiveness of the negative distillation in more details.

Models
Dist -  Ablation study We study the effects of different negative distillation objectives by ablating the prediction layer distillation (w/o L pred ), the attention distillation (w/o L att ), the hidden state distillation (w/o L hid ), and the whole negative distillation (w/o L neg , i.e. Standard). The results in Table 5 show that all three proposed negative distillation objectives are useful for improving the diversity. The significant decline in w/o L hid indicates that the negative information in intermediate layers is very important for ND. w/o L att is better than w/o L hid , attributing to the more abundant information in hidden states.
Does source entropy work? To verify whether the source entropy filtering method can collect the generic responses, we select the top 50% and the bottom 50% of the sorted training set as D t and D b , respectively. Then we train N t and N b on the corresponding sub-sets. From Table 6, we can see that N b outperforms N t in all the diversity-related metrics, indicating the effectiveness of source entropy.  Can the negative knowledge be transferred? We take N t and N b as the negative teachers for the students S t and S b , respectively. Then we conduct negative distillation on both S t and S b . The results in Table 7 demonstrate that S t obtains more gains in diversity than S b , indicating S t gets rid of more negative knowledge.    Study of soft target To evaluate the superiority of soft targets for negative distillation, we sample responses (i.e., hard target) by greedy search on the predictions of negative teachers for comparison. The results in Table 9 show that ND with soft targets can diversify the responses more effectively, demonstrating the advantages of richer negative information (e.g., the similarity between labels) in soft targets. What's more, we randomly select responses from the negative training set D N as negative targets. The sharp decline in performance proves that the negative teacher can produce targeted generic responses.
Effect of progressive distillation In order to verify the effectiveness of progressive negative distillation, we conduct negative distillation with fixed α. The value is obtained by calculating the average of α in Equation 15 across the convergence steps.
The results in Table 8 demonstrate that the progressive distillation policy can help the student exploit negative knowledge more effectively. Besides, note that ND with fixed α also outperforms the Standard model.  Evaluation results with beam search He and Glass (2020) and Li et al. (2020a) choose greedy decoding due to its simplicity and higher diversity than beam decoding. However, we find that both NT and UL tend to generate long but non-fluent and incoherent responses. So we conduct beam search with adding the length penalty. Table 10 summarizes the results and it shows that both two baselines get better KL and BLEU scores than using greedy search due to shorter responses. ND outperform baselines on all the metrics, confirming the effectiveness of our method.  Case study Table 11 shows some cases generated by the proposed method and baselines. Standard prefers generic and meaningless responses. Both NT and UL tend to generate a short generic sentence followed by a incoherent and non-fluent subsequence. In contrast, ND can produce diverse and coherent responses.

Related work
Diversity Dialogue Learning There are two lines of work for solving the generic response prob-lem: One line promotes the diversity from positive view, which is outside of our work. Specially, previous work includes MMI (Li et al., 2016), GAN (Li et al., 2017a;Zhang et al., 2018), CVAE (Zhao et al., 2017), BT (Su et al., 2020), FACE (Jiang et al., 2019), AdaLabel (Wang et al., 2021), IAT (Zhou et al., 2021), and Nucleus Sampling (Holtzman et al., 2020). The other line alleviates the generic response problem using negative training. He and Glass (2020) regards frequent response problem as a sub-problem of the generic response problem and conduct negative update for the high-frequency responses during training. Li et al. (2020a) focuses on high-frequency tokens rather than tokens and punishes them by using the unlikelihood objective (Welleck et al., 2020). Both of them handle the generic response problem only from the angle of reducing frequency, thus can not capture all the features of generic replies.
Negative Training for Dialogue Learning Negative training for retrieval-based dialogue learning has been previously extensively studied (Humeau et al., 2020;Nugmanova et al., 2019), while we focus on the dialogue generation in this work. He and Glass (2020) uses negative training to prevent generic and malicious responses in dialogue models. Li et al. (2020a) generalizes unlikelihood to dialogue generation for improving repetition, specificity and coherence. Lagutin et al. (2021) proposes implicit unlikelihood training to minimize repetition. Our work proposes a new negative training paradigm aimed at improving the diversity of dialogue responses while avoiding the problem of poor consistency and fluency of previous work.

Conclusion
We present a novel negative training paradigm to improve the diversity of dialogue responses. It formulates the conventional negative training as a knowledge distillation process, which is rarely explored before. The negative teacher can produce the corresponding generic and dull responses given any query, which naturally avoids problems that hinder previous negative training methods. Besides, we further boost the performance of negative distillation by exploiting richer information, i.e., multilevel features. Extensive experiments validate the superiority of our proposed method compared with prior negative training work. A limitation of our work is that we only focus on the generic response problem. For future work, we will extend the proposed negative distillation to handle other generation problems, such as inconsistency and lacking personas or emotions.

A Details for Implementations
Here are some implementation details of our experiments. Dropout (Srivastava et al., 2014) is used for the self-attention module, the feed-forward layer, and the activation layer, and the rate of all three is set to 0.1. We also use label smoothing (Szegedy et al., 2016) and the smoothing value is 0.1. The batch size is set to 256. We use the Adam optimizer (Kingma and Ba, 2015) and employ the warm-up (He et al., 2016) trick to adjust the learning rate during training. The warm-up steps s wp are 128k and 256k for DailyDialog and OpenSubtitles, respectively. The learning rate is computed as follows: where lr is the learning rate at the s th step of training and d model is the size of hidden states. We implement all approaches with Pytorch 1.7, and conduct all experiments on RTX 3090.

B Baselines
For NT, the threshold r thres is set to 1% and the weight coefficient λ POS is set to 1 as the authors' suggestion. For UL, we search the mixing hyperparameter α in [1, 10, 100, 1000] and 1000 is selected for its best performance. Both NT and UL are refined on the well-trained Standard model. For CAVE, we set the latent size with patience to 256 and 64 for DailyDialog and OpenSubtitles, respectively. And for FACE, we use the "output frequency" and "pre-weight" version as the author suggested.
We also compare the proposed method (ND) with AdaLabel (Wang et al., 2021), although AdaLabel alleviates the generic response problem from the perspective of target regularization rather than negative training. The results in Table 12 confirms the superior performance of our method for improving the diversity of generated responses. In addition, the negative distillation method can be readily extended to other generation problem, while AdaLabel mainly focuses on diversity.