Diversifying Dialog Generation via Adaptive Label Smoothing

Neural dialogue generation models trained with the one-hot target distribution suffer from the over-confidence issue, which leads to poor generation diversity as widely reported in the literature. Although existing approaches such as label smoothing can alleviate this issue, they fail to adapt to diverse dialog contexts. In this paper, we propose an Adaptive Label Smoothing (AdaLabel) approach that can adaptively estimate a target label distribution at each time step for different contexts. The maximum probability in the predicted distribution is used to modify the soft target distribution produced by a novel light-weight bi-directional decoder module. The resulting target distribution is aware of both previous and future contexts and is adjusted to avoid over-training the dialogue model. Our model can be trained in an endto-end manner. Extensive experiments on two benchmark datasets show that our approach outperforms various competitive baselines in producing diverse responses.


Introduction
The success of neural models has greatly advanced the research of dialog generation Wang et al., 2020;. However, most of these models suffer from a lowdiversity issue where models tend to generate bland and generic responses such as I don't know or I'm OK (Li et al., 2016). Although various approaches have been proposed to tackle this issue (Li et al., 2016;Zhao et al., 2017;Du et al., 2018;Zhou et al., 2018;Welleck et al., 2020;Zheng et al., 2020b), there are still remarkable gaps between responses generated by neural models and those from humans (Holtzman et al., 2020). Further, some existing methods may even harm the fluency or coherence when improving the diversity of generated * Equal contribution † Corresponding Author: aihuang@tsinghua.edu.cn So, what exactly do you do around here ? I make the robots seem more ___ Post: Response: responses. (Ippolito et al., 2019;Massarelli et al., 2020;Zheng et al., 2020a). Recently, Jiang and de Rijke (2018); Jiang et al. (2019) show that there is a strong connection between the low-diversity problem and the overconfidence issue. i.e., over-confident dialogue models tend to produce low-diversity responses. One of the reasons can be attributed to the supervision target. Specifically, training a dialogue generation model with the Maximum Likelihood Estimation (MLE) objective under the hard target (i.e., one-hot distribution as ground truth) makes the model favor high-frequency tokens and produce over-confident probability estimation (Gowda and May, 2020), which ultimately leads to poor calibration (Mukhoti et al., 2020), and thus low diversity (Jiang et al., 2019). Hinton et al. (2015) and Yang et al. (2018) suggest that the ideal training target should be a soft target that assigns probability mass on multiple valid candidates (see Figure 1). With such a soft target, the over-confidence issue can be alleviated (Müller et al., 2019), and thus the diversity of the output responses can be improved.
Unfortunately, the ideal soft target is challenging to obtain. Early works try to tackle this issue using label smoothing (Szegedy et al., 2016), i.e., a small probability is uniformly assigned to nontarget words. However, the target distribution constructed in this way is far from ideal: First, the probability of the target word is chosen manually and fixed, which cannot adapt to different contexts. However, as Holtzman et al. (2020) demonstrated, human text distribution exhibits remarkable fluctuations in the per-token perplexity. We argue that different target probabilities should be used for different contexts. Second, the uniform assignment of the probability mass on non-target words ignores the semantic relationship between the context and each word. Ideally, a word should receive more probability mass if it is more relevant to the context. For the example shown in Figure 1, word "fun" is more likely to appear behind the context "I make the robots seem more " than word "bank".
To address the above issue, we propose an Adaptive Label smoothing (AdaLabel) method that can dynamically estimate a soft target distribution at each time step for different contexts. Specifically, for each target word y t in the training data, the probability distribution predicted by the current model is first obtained. The maximum probability p max in this distribution measures the confidence of the current prediction, i.e., a higher p max means higher confidence for the current prediction. To avoid over-confidence, we use p max as the supervision signal for the target word y t in the training process so that the model will not be optimized towards y t when it correctly predicts y t . A word-level factor is also introduced to facilitate the learning of low-frequency words.
Moreover, we introduce a novel auxiliary decoder module D a to produce the supervision signals for these non-target words in each training step. D a only contains one transformer block, and it is optimized to predict words based on bi-directional contexts. A novel Target-Mask attention scheme is devised to prevent D a from seeing the target word in the training process. This scheme also enables parallel training and inference of D a .
We perform extensive experiments on two benchmark datasets: DailyDialog and OpenSubtitles. Our method outperforms various competitive baselines and significantly improves the diversity of generated responses while ensuring fluency and coherency. Our major contributions are summarized: 1. We propose AdaLabel, a method that can produce a soft target distribution considering the current context and the model's confidence. Specifically, AdaLabel ensures that the dialogue model will not be optimized toward the target word y t if y t has been correctly predicted. This prevents our model from being over-confident.
2. We introduce a light-weight bi-directional decoder that can produce context-aware supervision signals for non-target words. A novel Target-Mask attention scheme is devised to facilitate the parallel training and inference of this decoder.
3. Extensive experiments on two benchmark dialogue datasets with both automatic and human evaluation results show that our method helps to alleviate the model over-confident issue and significantly improves the model's diversity.

Related work
Diversity Promotion: Existing approaches for solving the low diversity issue of neural dialogue models generally involve two categories: The first category is training-based, where new training objectives are designed (Li et al., 2016;Zhang et al., 2018;Gao et al., 2019) or latent variables are introduced (Zhao et al., 2017;Zhou et al., 2018) in the dialogue model. Some methods also try to refine the training target used in the MLE loss Jiang et al., 2019;Li et al., 2019), or directly penalize the trivial responses with auxiliary loss terms (Welleck et al., 2020;Li et al., 2020). Unlike these existing approaches, our method tries to adaptively adjust the training target by utilizing the current predictions.
The second category is decoding-based, in which different heuristic decoding rules are designed (Holtzman et al., 2020;Kulikov et al., 2019). Note that these decoding techniques are independent of the model setting, and our method can be used in combination with these techniques.
Confidence Calibration: Modern deep neural networks suffer from the over-confidence issue (Guo et al., 2017;Kumar and Sarawagi, 2019), and various remedies are proposed (Pereyra et al., 2017;Mukhoti et al., 2020;Lin et al., 2017). Following the work of Jiang and de Rijke (2018); Jiang et al. (2019), our method is proposed to tackle the overconfidence issue to improve the diversity of the generated responses. However, different from existing approaches, our method enables more flexible controls over the target distribution.
Knowledge Distillation: Another important technique similar to our work is knowledge distilla-  Figure 2: Overview of constructing the adaptive soft target q using AdaLabel: The maximum probability p max in the predicted distribution p is used to obtain an adaption factor , which is further used to combine the hard target q and the auxiliary distribution v to obtain q . A bi-directional auxiliary decoder D a is used to produce v.
tion, in which a learned teacher model is distilled to a student model by minimizing a KL term (Hinton et al., 2015;Kim and Rush, 2016). The most related work comparing to ours is the C-MLM approach , in which a BERT model is fine-tuned to be a teacher. Our approach and C-MLM's primary difference is that our auxiliary decoder D a is a one layer module that is jointly trained with the dialogue model. However, the BERT teacher in C-MLM contains much more parameters, and it is trained using an expensive pretrained and then fine-tuned process. Moreover, the target-masked attention scheme in D a enables parallel inferences of v for each training sequence Y . In contrast, multiple independent forward passes are required for the BERT teacher.

Background: MLE with Hard Target
The goal of generative dialogue modeling is to learn a conditional probability distribution p(Y |X), where X is the dialogue context, Y = y 1 , ..., y T is a response word sequence, and y i ∈ V is a word from the vocabulary V. In an auto-regressive manner, p(Y |X) is factorized as t p(y t |y <t , X). For each target word y t in the training sequence Y , a conventional MLE training approach try to optimize the following cross entropy loss: where q is a one-hot distribution (i.e., a hard target) that assigns a probability of 1 for the target word y t and 0 otherwise, i.e., q k = 1 only when w k = y t . For simplicity of notation, we abbreviate the dependency of y t in the notation of each distribution in our paper, i.e., different target word y t in Y corresponds to different values of q and p.

Method Overview
We propose to adaptively construct a soft target distribution q to replace q in Eq. 1. Specifically, where ε ∈ [0, 1] is an adaption factor, and v is an auxiliary distribution vector that depends on the current time step. (see Figure 2 for an overview).
In this study, we constrain v to assign zero probability for the target word y t and non-zero probabilities for these non-target words V =yt = {y i |y i ∈ V, y i = y t }. This constraint allows us to explicitly control the supervisions assigned to y t . Specifically, the first term ε · q and the second term (1 − ε) · v in Eq. 2 respectively determines how much probability q assigns to y t and V =yt . This setting differs from conventional knowledge distillation (Kim and Rush, 2016) because it facilitates more flexible controls over q , so that we can use the factor ε to determine the supervision signal provided for the target word y t . The following sections detail how to compute ε and v.

Target Word Probability
We control the probability of the target word y t in p by manipulating the adaption factor ε in Eq. 2. Specifically, for a training dialogue pair X, Y and each target word y t ∈ Y , the current distribution p(·|y <t , X) is first calculated, and the maximum probability in this distribution is obtained: ε is then obtained: where λ serves as a lower-bound of ε (i.e., ε ≥ λ). The basic intuition behind Eq. 4 is to set ε = p max when p max is reasonably large. This design prevents our model from receiving supervisions sharper than p max , when the current prediction is confidence enough.
Further, to ensure that the target word y t always receives the largest probability in q , i.e., to ensure ε > (1 − ε) · max(v) (see Eq. 2), in which max(v) is the maximum probabilities for non-target words V =yt , we have to enforce ε > max(v) 1+max(v) . Thus we propose to calculate the lower-bound λ of ε as: where η > 0 is a hyper-parameter that controls the margin between the probability of the target word and non-target words in p .
To facilitate faster converge and better learning of low-probability words, an empirical factor α ∈ [0, 1] is further introduced to adjust the calculation of ε on the basis of Eq. 4: where α is calculated as the relative ratio to p max : where p(y t |y <t , X) is the probability for the target word y t . Note that Eq. 6 and Eq. 4 is equivalent if α = 1. Intuitively, α accelerates the training of lowfrequency words because if y t is of low-frequency in the corpus, then y t is usually under-trained and thus p(y t |y <t , X) is generally small. This leads to a small α and thus increases the probability for y t in p . Note that ε, λ and α are all time-step specific variables, whereas η is a fixed hyper-parameter. This allows the values adapt to dynamic contexts. In our experiments, Eq. 6 is used to calculate ε.

Non-target Words Probabilities
The auxiliary distribution v in Eq. 2 is calculated using an auxiliary decoder D a , which is a singlelayer transformer-based decoder that is jointly optimized with the generation model. Figure 3 shows the structure of D a , in which a novel target-masked  Figure 3: (a) The auxiliary decoder D a ; (b) The targetmasked attention scheme used to compute the auxiliary distribution v for the target word y 3 , specifically, y 2 is used as the query and y 3 is masked; (c) The attention pattern used in the target-masked attention scheme, white dots represent masked positions. attention scheme is devised to mask each target word y t in the self attention module of the decoder when calculating the corresponding v (see Figure  3b and 3c). In this way, bi-directional contexts can be utilized when predicting the auxiliary distribution v for y t . Moreover, it is important to use only one decoder layer in D a because stacking multiple layers in D a leaks the information of y t to v.
Note that using one layer in D a does not necessarily downgrade its performance (Kasai et al., 2021). Our experiment results in Section 5.1 indicate that with the help of bi-directional contexts, the accuracy of D a largely outperforms the unidirectional dialogue decoder that is much deeper than D a . Moreover, for a training response Y , the structure of D a enables us infer the auxiliary distribution in parallel for all the target words in Y within a single forward pass. This differs from the BERT teacher used by , in which multiple independent forward passes are needed to get the teacher distributions for all the words in Y .
When training D a , the following standard MLE loss is optimized for each target word y t : in which the notation of q k follows Eq. 1. The outputs of D a are used as the logits to infer v to be further used in Eq. 2. Specifically, the logit of the target word y t is masked to −∞ before Softmax to ensure y t always receives zero probability in v. Moreover, we also follow the approach used by Tang et al. (2020) to truncate the head and tail of the remaining logits before inferring v in Eq.

Train
Valid Test DailyDialog 65.8K 6.13K 5.80K OpenSubtitles 1.14M 20.0K 10.0K 2, i.e., all the logits are ranked in a descending order and only the logits ranked from n to m are kept while the rest logits are masked to −∞. This masks the head and tail probabilities in v to zero. We argue that truncating the tail probabilities of v filters noises, and truncating the head probabilities of v encourages the dialogue model to focus more on low-probability words. In our experiments, we set n = 2 and m = 500. An extensive hyperparameter search indicates that our method is not sensitive to the value of n and m.
There are two major differences between our auxiliary decoder D a and the teacher model used in conventional knowledge distillation approaches: First, conventional teacher models usually carry more parameters than their students, whereas D a is rather light-weight. Second, conventional teacher models are typically pre-trained before being utilized in the distillation process, whereas D a is trained jointly with our dialogue model.

Dataset
We use two benchmark datasets for open-domain dialogue generation: DailyDialog (Li et al., 2017) is a high-quality multi-turn dialogue dataset that is collected from daily conversations. OpenSubtitles 1 contains dialogues collected from movie subtitles. Moreover, we follow Li et al. (2016) and Jiang et al. (2019) to focus on short conversations, i.e., dialogues with posts or responses longer than 100 tokens are removed. See Table 1 for more details.

Implementation Details
The backbone of our model is the transformerbased sequence to sequence model (Vaswani et al., 2017), and most hyper-parameters follow Cai et al. (2020). Specifically, the encoder and decoder each contains 6 layers. Each layer has 8 attention heads, and the hidden size is set to 512. The auxiliary decoder D a follows the same hyper-parameter setting as the dialogue decoder, but it only contains one layer. The WordPiece tokenizer provided by 1 http://opus.nlpl.eu/OpenSubtitles.php BERT (Devlin et al., 2019) is used, and the Adam optimizer (Kingma and Ba, 2015) is employed to train our model from random initializations with a learning rate of 1e-4. η in Eq. 5 is set to 0.2 for all datasets. See Appendix A for more details. 2

Baselines
We compared our method with two groups of baselines that try to tackle the over-confidence issue.
The first group modifies the training target used to compute the loss function: 1) LS (Szegedy et al., 2016): uses the label smoothing approach to construct a target distribution by adding the onehot target and a uniform distribution; 2) FL (Lin et al., 2017): uses the focal loss to down-weigh well-classified tokens in each time step. 3) FACE (Jiang et al., 2019): uses the frequency-aware crossentropy loss to balance per-token training losses. Specifically, relative low losses are assigned to high-frequency words to explicitly tackle the overconfidence issue. We used the best performing "Pre-weigh" version in our experiments. 4) F 2 : factorizes the target distribution based on the token frequencies.
The second group of baselines add some penalty term to the standard MLE loss: 5) CP (Pereyra et al., 2017): a confidence penalty term is added to regularize the entropy of the model, so that over-confident predictions are penalized; 6) UL (Welleck et al., 2020): an unlikelihood loss term is added to penalize the frequently generated words. 7) NL (He and Glass, 2020): works similarly with baseline UL except a negative loss term is used instead of the unlikelihood loss term. 8) D2GPo (Li et al., 2019): augments the MLE loss with a data-dependent gaussian prior objective to assign different losses for different non-target words.
We also compared to: 9) CE: a vanilla Seq2Seq model trained with the cross-entropy loss. For fair comparisons, the C-MLM model proposed by  is not used as our baseline since the BERT teacher in C-MLM requires a large amount of extra data to pre-train. Nevertheless, AdaLabel still surpasses C-MLM on various metrics (see Appendix F for more analysis).
All our baselines are adapted from the authors' official codes with the same backbone architecture and hyper-parameters as our model (see details in Appendix B). Following the original setting, a train-  and-refine strategy is used in baseline 3, 6, and 7, i.e., these baselines are refined based on CE. We follow the setting of Jiang et al. (2019) to use deterministic decoding scheme (particularly, greedy decoding) for our model and all baselines. Note that our method can be adapted to other decoding schemes such as beam-search or top-K sampling. See Appendix C for more detailed analysis.

Automatic Evaluation
Metrics: We first used automatic metrics to evaluate our method: 1) Distinct (Dist) (Li et al., 2016) calculates the proportion of unique n-grams (n=1, 2) in the generated responses, which is widely used to measure the response diversity. 2) Entropy (Ent) (Zhang et al., 2018) evaluates how evenly the empirical n-gram (n=1, 2) distribution is. Higher sores mean more diverse of the response. 3) Low-Frequency Token Ratio (LF) (Li et al., 2019) further measures the model diversity by counting the ratio of low-frequency words in the generated responses. We chose words with a frequency less than 100 in each corpus as low-frequency words. Over-confident models tend to omit low-frequency words (i.e., get low LF scores) and yield less diversified responses. 4) BLEU (Papineni et al., 2002) measures n-gram (n=2, 3, 4) overlap between the generated responses and references.
Results: As shown in Table 2, our method AdaLabel outperforms all the baselines by large margins on all the datasets. We can further observe that: 1) AdaLabel achieves the best diversity scores (Dist-1,2, Ent-1,2, and LF). This indicates that our method yields better training targets that help to produce more diverse responses; 2). The models that explicitly tackle the over-confidence issue (i.e., AdaLabel and FACE) generally outperform other baselines in diversity-related metrics. For example, FACE obtains the second-best diversity scores (i.e., Dist, Ent, and LF) on the OpenSubtitles dataset. This verifies our motivation that alleviating the over-confidence issue helps to produce more diverse responses.
Note that our method also outperforms all the baselines using the stochastic decoding scheme. Please refer to Appendix C for more details.

Manual Evaluation
Metrics: Pairwise manual evaluations are conducted to further validate our method. Specifically, for a given dialogue post, our model's response is paired with the one from a baseline. Three individual annotators were employed to rank each response pair from three aspects: 1) Fluency (Flu.): which response is more fluent; 2) Coherency (Coh.): which response is more coherent to the context; 3) Informativeness (Info.): which response contains more informative content. We also asked the annotator to choose an overall preferred response (Pref.). Ties were allowed.
Results: 200 posts were randomly sampled from each of these two datasets, respectively, and totally 3.6K response pairs were generated. The inter-rater annotation agreement was measured using Fleiss's kappa κ (Fleiss, 1971). Particularly, the κ value on DailyDialog, OpenSubtitles dataset was 0.59 and 0.55, respectively, indicating moderate agreement.
As shown in Table 3, AdaLabel outperforms all the baselines on the informativeness measure. This means that our method can respond with more informative content. We can further observe that: 1). All models achieve competitive fluency because it is easy for neural models to produce fluent responses by yielding trivial responses like "I   don't know". However, our model surpasses most baselines in terms of fluency while ensuring high diversity scores. This demonstrates the superiority of our method in producing high quality responses.
2). AdaLabel produces more coherent responses comparing to most baselines. This verifies that our model does not sacrifice the response quality when achieving high diversity scores. In fact, by controlling the model's confidence, more lowfrequency words are encouraged, and thus AdaLabel can produce more relevant and coherent responses. This claim is further verified by observing that our model achieves the best overall preference score among all the baselines.

Ablation study
Ablation studies were performed to verify the effect of each component in our method. Specifically, two groups of variants were tested: The first group validates the effectiveness of the calculated target word probability, i.e., ε: 1). w/o ε directly sets a fixed value for ε in Eq. 2. The specific value of ε is searched from 0.1 to 0.7 with a stride of 0.1; 2). w/o α omits the empirical factor α in calculating ε, i.e., the value of ε in Eq. 2 is calculated using Eq. 4 in instead of Eq. 6.
The second group validates the effectiveness of the non-target word probabilities produced by D a , i.e., v: 3). Orig. v does not truncate the head of v when inferring from D a . Note that the truncation for the tail of v is still applied since its effectiveness has already been proved in previous studies (Tang et al., 2020;Tan et al., 2019); 4). Uniform uses an uniform distribution as v in Eq. 2. Note that different from the baseline LS, the value of ε is calculated using Eq. 6 in this ablation model, whereas the value of ε in the baseline LS is fixed ; 5). Rand use a random distributions as v in Eq. 2; 6). BERT follows the work of  to fine-tune a pre-trained BERT model to produce v.
Note that our dialogue model may benefit from the multi-task training of D a since D a shares the same encoder with our dialogue model. Optimizing Eq. 8 may help the encoder to capture better features. For fair comparison, we kept the task of optimizing D a in ablation models 4-6 although it is not used to infer v. Table 4 shows the results of ablation models on the DailyDialog dataset. As can be seen from the first two rows, our method to adaptively calculate ε helps to improve the performance of our model by a large margin, and the empirical adjustment factor α helps to further improve our performance by facilitating the learning of low-probability words. The performance of ablation models 3-6 in Table  4 proves that v captures reliable distribution and helps our model produce more diverse responses. Moreover, truncating the head distribution of v enables the dialogue model to focus more on the low-frequency words and thus facilitates more informative responses.
It is also interesting to note that our auxiliary decoder D a surpasses the BERT teacher used by  in helping the dialogue model   to produce more diverse responses. This further proves the effectiveness of D a considering that BERT contains 6 times parameters than D a and consumes much more computation resources.

Auxiliary Decoder
To further test the performance of D a , we evaluated the averaged accuracy score of D a when predicting each target word in the test set (first row in Table 5). Specifically, a target word y t in the reference response is determined to be correctly predicted if it is top-ranked in the predicted distribution p(·|y <t , X). A better decoder is generally believed to obtain a higher accuracy. Table 5 also reports the uni-directional dialogue decoders' accuracy in AdaLabel and CE. It can be seen that D a can make substantially more accurate predictions with the help of modeling bi-directional contexts using only one layer. Moreover, the dialogue model's decoder in AdaLabel, which is guided by D a , achieves better accuracies than the CE. This further proves that our light-weight D a is capable of producing effective v.

Prediction Confidence
We also visualized the distribution of confidence scores assigned by each dialogue model to highfrequency words. Figure 4 shows the results of four best performing models on the OpenSubtitles dataset. The spikes of high confidence score observed in Figure 4b and 4d indicate that CE and FACE assign extremely high confidence scores to a large number of high-frequency words. Although the smoothed labels in LS manage to alleviate these high-confidence-spikes (Figure 4c), a considerable amount of words still receives high confidence scores in LS. Our model outperforms all the baselines to avoid assigning over-confidence scores, thus alleviating the over-confidence issue. A similar trend is also observed on the DailyDialog dataset (see Appendix D for results of all models on both datasets).

Predicted Rare Word Distribution
Over-confident models produce less diversified responses because they usually under-estimate rare words. To evaluate the effectiveness of AdaLabel, we tested whether AdaLabel encourages more "rare words" in its generations. Specifically, the ratio of generated tokens corresponding to different token frequency bins is calculated, and the results on the OpenSubtitles dataset are shown in Figure 5. It can be seen that AdaLabel produces more rare words in the generated responses than other baselines. Similar results are also observed on the DailyDialog dataset (see Appendix E).

Conclusion
We address the low-diversity issue of neural dialogue models by introducing an adaptive label smoothing approach, AdaLabel. In our method, the probability of each target word is estimated based on the current dialogue model's prediction, and the probabilities for these non-target words are calculated using a novel auxiliary decoder D a . A target-masked attention scheme is introduced in D a to help capture forward and backward contexts. We evaluate our method on two benchmark datasets: DailyDialog and OpenSubtitles. Extensive experiments show that our method effectively alleviates the over-confidence issue and improves the diversity of the generated responses. As future work, we believe this method is extensible to other text generation tasks.

A Implementation Details
This appendix describes the implementation details of our model. All our experiments are implemented with python 3.7.4, PyTorch 1.7.1, and the Open-NMT package (Klein et al., 2017). Training is performed on one TITAN Xp GPU. Our model's backbone is the transformer-based sequence to sequence model, the encoder and decoder each contains 6 transformer layers with 8 attention heads, and the hidden size is set to 512. The dimension of the feedforward layer is also 512. The WordPiece tokenizer provided by BERT-base-uncased is used (the vocabulary contains 30522 tokens). The total number of parameters in our model is about 90M. The Adam optimizer is employed to train our model from random initializations with β 1 = 0.9, β 2 = 0.999, = 1e − 9 and a learning rate of 1e-4. The batch size is set to 64 with 2 gradient accumulation so that 2 * 64 samples are used for each parameter update. The model is evaluated every 1000 steps on the validation set. We use early-stopping with patience 10, 30 for DailyDialog and OpenSubtitles, respectively. Specifically, the model stops training when the evaluation perplexity and accuracy are not increased for "patience" steps. The model training takes 4 hours and 3 days on DailyDialog and OpenSubtitles, respectively.
The auxiliary distribution produced by the auxiliary decoder is smoothed with the temperature scaling approach. The temperature used in this process is searched in [1, 1.5, 2]. The temperature value of 1.5 and 1.0 is used for DailyDialog, and OpenSubtitles, respectively. The hyper-parameter value of η is set to 0.2 for all datasets. The fixed value of epsilon in our ablation model w/o is searched in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], and we find the value of 0.1 works best.

B Baseline Implementation Details
This appendix contains more implementation details of our baselines. All the baselines utilize the same backbone architecture and basic hyperparameter settings as our model (see Appendix A). The hyper-parameters specialized for each baseline is determined with the grid search based on the Dist

C Automatic Evaluation Results with Other Decoding Schemes
This appendix reports our model's automatic evaluation results and all the baselines when different decoding schemes are used. Specifically, Table 6 shows the results for the beam search decoding scheme (beam size of 5), and Table 7 shows the results when the top-K decoding scheme (k = 10) is used. Note that for the F 2 -softmax, we use the decoupled top-k sampling as the authors suggested.
As can be seen from Table 6 and 7, our method outperforms all the baselines on the diversityrelated scores (i.e., Dist, Ent, and LF) by a large margin. This indicates that our method can produce more diverse responses even with the stochastic based decoding scheme.
We also include the results of AdaLabel when the greedy decoding scheme is used in Table 6 and  Table 7 (the second line from the bottom). It is interesting to see that the greedily decoded responses from AdaLabel are more diverse than some baselines that are decoded using the sampling scheme (see Table 7). Moreover, our model AdaLabel with the greedy decoding scheme achieves the best BLEU among all the baselines on both datasets.

D Prediction Confidence
This appendix reports the prediction confidence scores assigned by each model to high-frequency words. Specifically, words occupying the top 40% of the frequency mass in the training set of each dataset are regarded as high-frequency words. Figure 6 shows the results of our model and all the baselines on the DailyDialog dataset. Figure 7 shows the results of our model and all the baselines on the OpenSubtitles dataset. It can be seen that most of our baselines assign extremely high confidence scores (nearly 1.0) to these high-frequency words, and thus resulting in a spike of high confidence scores in the plotted distribution. Our model outperforms all the baselines in avoiding assigning extremely high confidence scores to these highfrequency words.

E Predicted Rare Word Distribution on DailyDialog
This appendix shows the distribution of rare words in the generated responses on the DailyDialog   dataset (see Figure 8). It can be seen that more "rare words" are predicted by our method on the DailyDialog dataset. This observation is in line with the results on the OpenSubtitles dataset as reported in Section 5.3.

F Use BERT Model to Obtain v
This appendix provides more experiment results comparing to the CMLM model : 1). CMLM exactly follows the setting of , i.e., the teacher distribution produced by [0,20] [  the BERT model is merged with the one-hot distribution using a fixed ε. 2). CMLM+ε adaptively adjust the value of ε using Eq. 6 in our paper. 3). CMLM+ε+D a add an additional training task to optimize the auxiliary decoder D a on the basis of CMLM+ε. It is expected that optimizing D a help our dialogue encoder to capture better representations. The trained D a is not used in the training and inference phase of our dialogue model. Note that the last model CMLM+ε+D a is the same with our ablation model 6. BERT as reported in our paper. As can be seen Table 8, our approach to adaptively change ε helps to produce better dialogue responses, and the training of D a helps our dialogue encoder to learn better representations.

G Case study
We sampled some generated cases on the DailyDialog and OpenSubtitles dataset. The results of our model and some competitive baselines are shown in Table 9 and Table 10. It can be seen that the responses generated by our method are coherent to the context and contain richer contents. Moreover, our model also produces more rare words that make our response more diverse.