Adaptive Bridge between Training and Inference for Dialogue Generation

Although exposure bias has been widely studied in some NLP tasks, it faces its unique challenges in dialogue response generation, the representative one-to-various generation scenario.In real human dialogue, there are many appropriate responses for the same context, not only with different expressions, but also with different topics. Therefore, due to the much bigger gap between various ground-truth responses and the generated synthetic response, exposure bias is more challenging in dialogue generation task.What’s more, as MLE encourages the model to only learn the common words among different ground-truth responses, but ignores the interesting and specific parts, exposure bias may further lead to the common response generation problem, such as “I don’t know” and “HaHa?” In this paper, we propose a novel adaptive switching mechanism, which learns to automatically transit between ground-truth learning and generated learning regarding the word-level matching score, such as the cosine similarity. Experimental results on both Chinese STC dataset and English Reddit dataset, show that our adaptive method achieves a significant improvement in terms of metric-based evaluation and human evaluation, as compared with the state-of-the-art exposure bias approaches. Further analysis on NMT task also shows that our model can achieve a significant improvement.


Introduction
Auto-regressive models(ARM) are widely used for natural language generation(NLG) tasks, such as machine translation (Sutskever et al., 2014;Wu et al., 2018), dialogue response generation (Li et al., 2017), image captioning (Lin et al., 2014; and video description (Donahue et al., 2015). They utilize the encoder-decoder framework to predict the next token conditioned * Work done at Data Science Lab, JD.com. † Corresponding author.  on the previous tokens, and minimize the crossentropy between the generation and ground-truths as their objective function. Specifically, at training time, the ground-truth is utilized as the previous tokens, which forces the model directly to learn the distribution of ground truths. But at inference, the previous tokens come from the ARM decoder itself, which is different from the input distribution at training time.
Although this discrepancy, named exposure bias, has been studied in some classic NLG tasks, such as neural machine translation(NMT) Venkatraman et al., 2015;Zhang et al., 2019), it faces its unique challenges in dialogue response generation, the representative one-to-various generation scenario. In human dialogue, given the context, people can reply many relevant and appropriate responses, not only with various expressions but also with different topics. Take the Dialogue 1 in Table 1 as an example, given the context "I heard that Guangzhou has become a summer resort", the response 1 and response 2 are in the same topic but with different tokens. In this various expression situation, like NMT task, data distribution and model distribution are easy to fit, relatively, even with exposure bias problem. However, in different topics situation, data distribution is often different from the model, because it is too divergent and covers various word distribution of each topic. Through our data analysis, we find that in dialogue generation task, the various ground-truth responses and the generated sentences have a bigger gap than in NMT tasks. We calculate the overlap measures at word-level and semantic-level, i.e., BLEU and cosine similarity, between the generated sentence and the ground-truth sentences. The results show that on NMT WMT'14 dataset, the BLEU and similarity are 27.38 and 0.96, respectively, while on dialogue Reddit dataset, the BLEU and similarity are 2.17 and 0.81, respectively. We can see that the overlap measures of the dialogue generation task are significantly lower than that of the NMT task, which indicates the severity of the exposure bias problem in the dialogue generation.
What's more, as Maximum Likelihood Estimation(MLE) encourages the model to only learn the common words among different ground-truth responses, but ignores the interesting and specific parts, exposure bias may aggravate the common response problem of the generation model, due to the strict matching between the generated response and the ground-truth responses. Take the Dialogue 2 in Table 1 as an example, the response 1 is "What kind of cat is this? So cute, I want it too.", the response 2 is "I really want a cat like this to play with my son. " and the response 3 is "Wow, it's so cute, I have a kitten like this in my house.". If we train the model with word-level strict matching between the generated response and the groundtruth, it can only learn the common words, i.e., "So cute, I want it", but ignore the specific parts, i.e., "What kind of cat is this?". Therefore, it is beneficial to improve the strict matching mechanism for the dialogue generation task.
In this paper, we propose a novel Adaptive switch mechanism as a Bridge(AdapBridge), which introduces the generator distribution to the training phase and learns to automatically transit between the ground-truth learning and the generated learning, with respect to the word-level matching scores, such as the cosine similarity. Specifically, at each training step, we calculate the cosine similarity for each generated word with respect to all its ground-truths. If the matching score is bigger than the threshold, the generated word is fed to the decoder, while if lower, the ground-truth is fed for training. The threshold is increasing as the training epoch grows. With this adaptive sampling scheme, the switch mechanism can consider the generation quality of every word, i.e., relevance between the generated word and the ground-truth, to decide whether utilizing the generated learning or not.
We evaluate the proposed models on two public datasets, i.e. the Chinese STC and the English Reddit dataset. Experimental results show that our models significantly outperform the state-of-the-art exposure bias models with respect to both metricbased evaluations and human judgments. Further analysis on NMT task also shows that our model can achieve a significant improvement.
The main contributions of this paper include: • We study the exposure bias problem in dialogue generation task, one of the one-tovarious generation scenarios. And find that the exposure bias may further lead to the common response generation problem.
• We propose the adaptive switch mechanism with word-level matching scores to determine the training input source, in order to resolve the common response problem.
• We evaluate AdapBridge on two public dialogue datasets and conduct rigorous experiments to demonstrate the effectiveness of our proposed models. Further analysis on NMT task also shows that our model can achieve a significant improvement.

Related Work
This section briefly introduces recent research progresses related to this work in literature.
To solve the exposure bias problem in autoregressive or seq2seq models (Sutskever et al., 2014;Welleck et al., 2019;Holtzman et al., 2019), Venkatraman et al. (2015) tried to use data as Demonstrator(DAD) to augment the training set through the tokens predicted by the model, so as to make the training set to meet the test distribution. The method of Scheduled Sampling(SS) proposed by  attempted to randomly sample the previous generated words to replace the ground-truth words for the model input during training time. Zhang et al. (2019) made a further exploration of this method by sampling the previous words with decay not only from word-level oracle but also from the sentence-level oracle with a semantic metric. The main idea of this kind of method is to introduce the model's prediction information to its input at training time, and reduce the discrepancy between training and inference to alleviate the exposure bias problem. In comparison to those methods and related ideas (Qi et al., 2020;Goodman et al., 2020), our proposed method adaptively determines whether the input words of model during training are ground truth or predicted by scoring each generated word.
Alternative based on Reinforcement Learning(RL) (Williams, 1992) methods have been explored for generation tasks, in particular for NMT. Mixed Incremental Cross-Entropy Reinforce (MIXER) (Ranzato et al., 2016) leverage hybrid loss function which combines both cross-entropy and reinforce to directly optimized the metrics used at test time, such as BLEU or ROUGE. There are many other similar works (Shen et al., 2016;Shao et al., 2018). More recently, text generation via Generative Adversarial Networks(GAN) (Goodfellow et al., 2014) called Text GANs has attracted of researchers (Nie et al., 2019;Zhou et al., 2019;Wu et al., 2021;Scialom et al., 2020). They framed the problem under the GAN paradigm, which uses the RL-Based (Williams, 1992) algorithms to get the gradient estimation, as the text generation is discrete. However, both RL and Text GANs cannot be avoided the high variance of gradient estimation caused by sparse rewards, which consequently makes the training process unstable and limits improvements.
Different from traditional methods, our proposed model can adaptively determine whether the current input word is from ground truth or from generation with the word-level matching scores.

Proposed Method
Given a context sentence X k = {x k 1 , x k 2 , · · · , x k S k }, and a target response sentence Y k = {y k 1 , y k 2 , · · · , y k T k }, where S k and T k are the word length of context and response, respectively. The dialogue generation model based on sequence-to-sequence (Seq2Seq) (Sutskever et al., 2014) framework, directly models the response probability: where θ are the parameters of model and y k <t denote the previous ground-truth words. Given a set of training examples D = {X k , Y k } N k=1 , the stander training objective is to minimize the negative loglikelihood of all the training data: where N is the number of training examples. Different from training time, during inference, the probability of each target word p(y k t |y k <t , X k , θ) in Equation 1 is conditioned on the previous generated words y k <t * rather than the ground-truth y k <t , as the ground truth words are not available in real inference time. This discrepancy is called exposure bias.

AdapBridge
The architecture of our model is illustrated in Figure 1. We first define a sampling function: where y k <t and y k <t * are the inputs of sampling function, representing the ground truth words and generated words, respectively, and y k <t ∧ } denotes the inputs of decoder after sampled by the sampling function at t − th time step, which may contain both ground truth and generated words. In this framework, to predict t − th target word y k t * , we can follow those steps:  Figure 1. I(> β)is a Indicator function, if input is upper β, output will be 1, otherwise, output will be 0. α and β are both increasing as the training epochs grows.
• Decoders predict t − 1 words y k <t * as the previous generated words.
• Use the sequences y k <t * and y k <t (ground truth words) as inputs of SamFun(Equation 4), and get the outputs y k <t ∧ of this function.
• Replace the inputs y k <t of model in Equation 1 with y k <t ∧ , then predict the t-th word y k t * .
The SamFun can be any function, i.e. random sampling. The process is to replace the corresponding ground-truth words in y k <t with the generated words y k <t * . We propose a novel SamFun called AdapBridge, which can be seen in Figure 2.
The main idea of AdapBridge is simple: we first use the model to generate all words with Equation 1, and then compute the pairwise cosine similarity between the generated words and the ground truth. If the generated word is learned good enough (similar to the ground truth or a synonym), the max cosine similarity of this word will be close to 1 and upper a threshold β showed in Figure 2. Therefore, we can use this generated word to replace the ground truth word, which introduces the generator distribution to the training phase. The summary of the algorithm is illustrated in Table 2.

Sampling with Increase
The threshold α and β determine the frequency of sampling function and the similarity between generated words with ground truth, respectively. Note that, when α = 0, the training type is same as before, while when α = 1, the model is trained as inference. If α is set too low(≈ 0), the inputs of decoder will almost be ground-truth, and will not be able to cope with the unknown words predicted in the reference. On the other hand, if α is set too high(≈ 1) at the beginning of training, the model will yield tokens randomly, because the model is not well trained, which may lead to slow convergence. Similarity, because model can not generate high cosine similarity score at the beginning, it is necessary to set the β low to ensure that part of ground truth words can be replaced by generated words, and increase its value as the training steps or epochs growing. In this sense, α and β should be both increase with the training time grows. Note that, the probability α in our method determine whether to execute the transition mechanism. And the the threshold β determine which words in the generated sentence should be replaced, with respect to the cosine similarity score.

Input:
Sequence of generated words y k <t * ; Sequence of ground-truth words y k <t ; Word embedding of model with size of shared vocabulary; Number of epoch n.

Increase Function of α
We define α with an increase function dependent on the number of training epochs n: where k ≥ 1 is hyper-parameter, which determines the speed of convergence. In addition, we add a parameter w, which makes α close to 0 during the first w epochs of training. It is usually set to half number of all training epochs to ensure that the model is trained enough to generate reasonable words. The curve of this function can be seen in Figure 3.

Increase Function of β
At the beginning of training, the words cosine similarity scores generated by the model are generally low, so reducing β appropriately could help. On the other hand, at the end of training, β should increase, as the model was already trained good enough to generate meaningful words after the first w epochs. However, β can not start from zero as α showed in Figure 3. Intuitively, even at the beginning, it should have a certain threshold to ensure that the generated words which will replace the ground truth words have a high quality. We thus propose to use a schedule to increase β as a function of α calculated with Equation 5 , the formula as follows: where γ is the lowest similarity score threshold. In the entire training process, one ground truth word can be replaced only when the score of the generated word is at least greater than γ. The function is strictly monotone increasing the same as the Equation 5, and its curve can be seen in Figure 3.

Experiments
In this section, we evaluate our proposed model on both Chinese STC and the English Reddit dataset.

Experimental Settings
We first introduce datasets, baselines, parameters setting and evaluation measures.

Datasets
We use two public one-to-many single turn dialogue datasets in our experiments. The Chinese Weibo Dataset, named STC 1 , consists 4,391,266 , 23,532 and 21,161 dialogue context-response pairs for training, validation and testing sets, respectively. We remove those pairs whose context contains response or contrary, and obtain 4,295,557 , 23,039 and 20,749 pairs for three data sets. The average number of responses corresponding to each context in STC is 19.7. The English Reddit dialogue corpus is extracted from the Reddit comments-posts by the script 2 , named Reddit. The original data consists of 6 million dialogues from all even months of 2011. We use the official script to tokenize, and the duplicates and sentences with length less than 3 or longer than 64 are removed. If the number of responses of one context is more than 20, we will randomly select 20 responses and ignore others, and the average number of responses corresponding to each context in Reddit is 6.2. Finally, we randomly split the data to training, validation and testing sets, which contains 1,107,860, 23,183, 12,429 pairs, respectively.

Baselines and Parameters Setting
Three baselines are used for comparison, including Transformer-based model (Vaswani et al., 2017), Random Sampling with word(RS-word) and sentence(RS-Sentence) (Zhang et al., 2019). For STC, we utilize the Chinese word as input, and set the vocabulary size as 10,599. For Reddit, context-response pairs are encoded using byte-pair encoding(BPE) (Sennrich et al., 2016) with vocabularies of 11,527 tokens. For a fair comparison among all baseline models and our model, the dimension of all word embedding is 512, and beam size in testing is 5. The transformer model has 6 layers in both encoder and decoder, while 8 heads in multi-head attention. All parameters are initialized by the uniform distribution over[−0.1, 0.1]. We adopt the optimizer Adam (Kingma and Ba, 2015) with β1 = 0.9, β 2 = 0.98 and with a weight decay of = 10 −8 . We set the learning rate as 0.0007 and the maximum tokens of a batch as 8192 with the update frequency 2. We run all models on 4 Tesla P40 GPU cards with PyTorch 3 . The code will be released when this paper is accepted.

Evaluation Measures
The evaluation of quantitative metrics and human judgements are used in our experiments. Specifically, quantitative metrics contains traditional metrics, such as PPL and BLEU score (Papineni et al., 2002), and Distinct (Li et al., 2016) metric which is recently proposed to evaluate the degree of diversity of the generated responses by calculating the number of distinct unigrams and bigrams in the generated responses. We also evaluate each generated response by calculating BLEU score with all reference responses, and use the highest BLEU score to represent the quality of generated response. The average of all highest BLEU score in the testing set is named AH-BLEU. In addition, BLEU score is calculated by using the toolkit of NLTK 4 .
For human evaluation, given 300 randomly sampled contexts and their responses which are generated by different models, three annotators (all CS majored students) are required to give the score of those context-response pairs, e.g. 3, 2, 1 means relevant, common and no-relevant, respectively, based on the coherence of the generated responses with respect to the contexts. The mean score is the average of all scores given by the three annotators with context-response pairs generated by a model. Meanwhile, in order to get the relative score of different models, we also evaluate the ground-truth context-response pairs by human evaluation.

Experimental Results
In this section, we demonstrate our experimental results on the two public datasets. Table 3 shows the quantitative evaluation results. From this table, we can see that models with switch mechanism, such as RS-Word, RS-Sentence and AdapBridge, outperform the traditional Transformer-based model in terms of BLEU, Distinct-1 and Distinct-2 evaluations. The results show that the switch mechanism plays an important role in the dialogue generation task.

Metric-based Evaluation
RS-Word and RS-Sentence both replace the ground truth tokens by the generated tokens with a random scheduled sampling. However, their performances are both worse than our proposed model, as our model considers the relevance between the generated words and the ground truth with wordlevel matching scores. For the BLEU score on STC dataset as an example, the BLEU-4 score of Adap-Bridge is 2.17, which is better than that of RS-Word and RS-Sentence, i.e., 2.05 and 2.12. Especially, our model achieves best AH-BLEU-2 score on both two datasets, which is the significant performance gains. It shows that the responses of our model have higher quantity than other baselines.
The diversity of responses can be evaluated by Distinct score. As shows in Table 3, our Adap-Bridge achieves significant performance gains. Take the results of Reddit in Table 3 as an example, the proposed AdapBridge model improves the

STC Datasets
Model PPL BLEU-2(%) BLEU-4(%) DIS-1(%) DIS-2(%) AH-BLEU-2(%)   Transformer, RS-Word and RS-Sentence models by 2.65, 0.37 and 0.48 Distinct-2 points, respectively. We can also note that our model has the highest Distinct score on both STC and Reddit datasets, which indicates that our model can generate more diverse response and avoid generating common responses. In summary, our proposed AdapBridge model has the ability to generate high quality and diverse responses, compared with baselines. We also conducted the significant test, and the result shows that the improvements of our model are significant on both two datatests, i.e., p − value < 0.01.

Human Evaluation
The human evaluation results are shown in Table  4. The percentages of relevant, common and norelevant are given to evaluate the quality of responses generated by different models. From the results we can see that our AdapBridge gets the highest score in human evaluation. Take the STC as an example, compared with Transformer, RS-Word Oracle and RS-Sentence Oracle, the Adap-Bridge achieves performance gains 22.38%, 5.73%, 7.65% on the relevant score. For the Mean score, we can observe that our AdapBridge generates the most relevant responses, while generates less no-relevant responses, which indicates that the responses generated by our model are attractive to annotators. We also conducted the significant test, and the result shows that the improvements of our model are significant on both two datatests, i.e., p − value < 0.01.

Case study
In this section, we conduct case studies to show our model can generated more relevant and diverse responses than other baseline models. We give two examples as in Table 5. In the Example 1, the response of Transformer is "Is this a fish or a fish?", which is an unreasonable sentence, as the words in the positions of two "fish" should be different according to common sense. For the response of RS-Word, we can see that it repeats "want to eat fish" twice, which is a part of the context. Although the response of RS-Sentence "How lovely! I want to eat, too." is relevant, it conforms  to a common response paradigm, such as "how · · · , I want it, too" or "what's this, I want to · · · ". If the context contains food, animal, locations etc., such responses all seem appropriate, which is not attractive to human. While the response generated by our AdapBridge is more specific and relevant, i.e. "What kind of fish is this?". We can also see the similar phenomenon in the Example 2 of Table  5, with the context "Venice, the city of water, the city of dreams.", Transformer repeats the same content that comes from the context, and the responses of RS-Word and RS-Sentence are both common responses as mentioned above. Compared with responses generated by baseline models, response of AdapBridge "I want to go here with my parents." is more relevant and attractive. Therefore, those results indicate that our proposed model can generate high quality and attractive responses with the adaptive switch mechanism.

AdapBridge on NMT
The method we propose can also be adapted for the neural machine translation(NMT) in an easy way. With this task we want to investigate if AdapBridge could be helping to improve the performance of NMT which is a classic natural language generation task. We perform experiments on the WMT'14 English→German(En→De) datasets, which contains 3,900,502, 39,414, 3,003 sentences for training, validation and testing sets, respectively. We train the Transformer-based model with the same setting described in Section 4.1.2, and then we measure the translation quality with BLEU. The evaluation results are listed in Table 6. From the results, we can see that our method can also achieve significant performance gains, and im-

Model
BLEU-2(%) BLEU-4(%)  prove the Transformer-based model by 0.95 BLEU-4 points on average. For the BLEU-2 score, our model is slightly lower than RS-Sentences model same as the results in Table 3, it can attributed to the sentence-level information of RS-Sentence Oracles. In order to analyze the gap of ground-truth and the generated sentences, we calculate the cosine similarity between the hidden representations of ground truth sentences and generated sentences, with a trained Bert model (Wolf et al., 2020), and get the similarity score 0.96 and 0.81 on WMT'14 and Reddit datastes, respectively. At the same time, we can also notice that BLEU score of NMT is much higher than that of dialogue generation task. The results of overlap measures indicate the severity of the exposure bias problem in dialogue generation as analyzed in Section 1.

Conclusion
In this paper, we propose a novel adaptive switch mechanism with word-level matching scores to solve the problem of exposure bias for the dialogue generation task, named AdapBridge. Our core idea is to utilize the word-level matching scores to determine the input is from ground truth or from prediction at each step of training. Experimental results show that our model significantly outperforms pre-vious baseline models. Further analysis on NMT also indicates that our model can achieve significant improvement on different generation tasks.
In future work, we plan to further design different scoring methods, i.e. Bert score or BLEU, to guide the model selects better words. It is also interesting to extend our AdapBridge model to other generation tasks, such as abstractive summarization.