A Three-Stage Learning Framework for Low-Resource Knowledge-Grounded Dialogue Generation

Neural conversation models have shown great potentials towards generating fluent and informative responses by introducing external background knowledge. Nevertheless, it is laborious to construct such knowledge-grounded dialogues, and existing models usually perform poorly when transfer to new domains with limited training samples. Therefore, building a knowledge-grounded dialogue system under the low-resource setting is a still crucial issue. In this paper, we propose a novel three-stage learning framework based on weakly supervised learning which benefits from large scale ungrounded dialogues and unstructured knowledge base. To better cooperate with this framework, we devise a variant of Transformer with decoupled decoder which facilitates the disentangled learning of response generation and knowledge incorporation. Evaluation results on two benchmarks indicate that our approach can outperform other state-of-the-art methods with less training data, and even in zero-resource scenario, our approach still performs well.


Introduction
Neural dialogue systems have made rapid progress in recent years thanks to the advances in sequence generation technology (Vinyals and Le, 2015;Vaswani et al., 2017). Though such models in neural architectures are able to reply with plausible responses regarding to dialogue history, people can still feel a clear gap when they converse with the chatbots, compared with the conversation with humans. To bridge the gap and generate fluent and informative responses, a number of approaches have been proposed by leveraging external knowledge. Knowledge-grounded dialogue is a task of generating an informative response based on both dialogue history and a collection of external knowledge (Dinan et al., 2019). The forms of knowledge * Corresponding author. are diverse, and in this work, we only focus on knowledge in the form of unstructured documents.
Generally, it is difficult to construct large scale conversations that are naturally grounded on the documents for learning of a response generation model (Zhao et al., 2020a), and most of the previous methods Li et al., 2019;Kim et al., 2020;Dinan et al., 2019) perform poorly when transfer into a new domain with limited training samples. So there are growing appeals for lowresource dialogue response generation, which aims to leverage past experience to improve the performance with limited labeled training examples of target corpus.
To address this issue, we envisage to absorb useful information from other easily accessible heterogeneous datasets to enhance the performance of the knowledge-based dialogue model under lowresource setting. Based on this assumption, we propose a novel Three-Stage Learning Framework (TSLF). TSLF attempts to divide the parameters of a model into dialogue-related and knowledge integration-related. In the first stage, we use supervised learning to pre-train dialogue-related parameters on general dialogues (e.g., online forum comments), and perform domain-adaptive pre-training (Gururangan et al., 2020) to initialize knowledgerelated parameters on unlabeled knowledge base (e.g., items in Wikipedia). In the second stage, inspired by the distant supervision in the relation extraction (Mintz et al., 2009), we match a set of pseudo-knowledge for each ungrounded dialogue to construct a lower quality knowledge-grounded dialogue dataset, and further co-pretrain the above two groups of parameters on this dataset. In the third stage, the trained model will be fine-tuned on the target low-resource dataset. The flow of TSLF is shown in Figure 1.
In order to better cooperate with the disentangled learning mechanism in TSLF, we devise Knowledge-Aware Transformer (KAT), a vari- ant of vanilla Transformer (Vaswani et al., 2017) whose parameters are decoupled that facilitates the separate learning of dialogue generation and knowledge incorporation. As shown in Figure 2, besides dialogue history, KAT also accepts a set of knowledge as additional input. KAT has a knowledgeaware decoder which could obtains information from the dialogue context and background documents through cross-attention and integrates them through a controller.
We conduct experiments on two knowledgegrounded dialogue generation benchmarks including Wizard-of-Wikipedia (Dinan et al., 2019) and CMU_DoG (Zhou et al., 2018). Evaluation results in terms of both automatic metrics and human judgment indicate that using only about 1/4 of the training data on Wizard (1/16 on CMU_DoG), the performance of our approach outperforms the competitive baselines which are learned from full crowdsourced training corpora. Even without using any training data of the target dataset, our method still performs well.
The contributions in this work are summarized as follows: (1) We propose a novel three-stage learning framework that leverages weakly supervised learning to help build a low-resource knowledgegrounded dialogue generation model; (2) We devise knowledge-aware Transformer, a knowledgegrounded neural conversation model with a novel dynamic knowledge selection mechanism, which can fully exploits the external knowledge to generate fluent and informative dialogue responses; (3) Our KAT-TSLF achieves surprising performance under the scenarios of full data, low-resource and even zero-resource.

Approach
Low-resource knowledge-grounded dialogue generation is task that requires a method to learn from experience E, which consists of direct experience E d containing limited monolingual contextknowledge-response triples and indirect experience E i , to improve the performance in response generation measured by the evaluation metric P . The direct experience E d refers to the training samples of target corpus j=1 is a set of external knowledge documents of i-th sample) which are under low-resource settings. In this work, we consider E i as a large scale ungrounded (m 2 , m 3 m 1 ) and a pretrained language model which are easy to obtain. In the following, we first introduce our KAT, and then show how to train it from coarse to fine under our TSLF.

Knowledge-Aware Transformer
KAT accepts U and K = {K i } s i=1 as inputs, and generates a responseŶ . It consists of three components: a dialogue context encoder (DE) to encode U , a knowledge encoder (KE) to encode K, and a decoder to incorporate dialog history, dynamically select knowledge and generate response. The architecture of KAT is shown in Figure 2.

Encoder
We define DE as a Transformer encoder, and the output is represented as U ∈ R n×d , where n is the sequence length, and d is the hidden state dimension. Similarly, KE is defined as another Transformer encoder, and it encode each document individually. Following KE is a concatenation opera-  tion that concatenates all document representations: K = [K 1 ; ...; K s ] ∈ R sz×d , where K i ∈ R z×d is output of i-th KE, and z is the sequence length of each document. K and U will be used for the input of the decoder.

Knowledge-Aware Decoder
Generally, not all knowledge in the K contributes to the generation of the response, so the model should have the ability to select knowledge. Different from (Dinan et al., 2019;Kim et al., 2020) who perform knowledge selection in the encoding phase (or in a pipeline), we leaves it to the decoding phase. Based on the Transformer decoder, we propose a cross attention based decoder which can select knowledge dynamically and generate informative response.
Knowledge Integration Block (KIB) As shown in the right part of Figure 2, we add a new block after the dialogue history attention block in Transformer decoder layer. It takes the output from last block as query, and the memory from K as key and value. The output of this block can be obtained by multi-head attention mechanism (Vaswani et al., 2017). During decoding, KIB can dynamically select different knowledge according to dialogue context and the tokens that have been generated at current time step.
Controller To control the knowledge and context contributions in each layer, we add a gate after the knowledge selection block. Denote h k as output of KIB and h c as the residual from the previous block, the output of controller can be expressed by where w ∈ R 2d is a learnable parameter and σ denotes sigmoid function.

Three-Stage Learning Framework
For further discussion, we denote θ d , θ k , and θ a as the learnable parameters of the green, yellow and pink parts in Figure 2 respectively. We can observe that θ d is related to context encoding and response generation, θ k is related to knowledge representation and integration, and these two parts are disentangled. In order to benefit from a wealth of heterogeneous corpora, we propose a three-stage learning framework. In TSLF, we first initialize θ d and θ k in a decoupled scheme by training in ungrounded dialogues and unstructured knowledge documents respectively, and then co-optimize them with θ a by weakly supervised learning and finally transfer KAT to target low-resource dataset. The illustration of TSLF is shown in Figure 1.

Stage I
We choose the state-of-the-art Transformer based encoder-decoder model BART (Lewis et al., 2020) as the the backbone, pre-training it on D d with dialogue response generation task: Besides, inspired by Gururangan et al. (2020), we conduct domain-adaptive pre-training on unlabeled knowledge documents to improve knowledge representation ability. Specifically, 15% of tokens in a text K are replaced with <mask> or noise words, and another Transformer tries to rebuild it: whereK is the corrupt K. We disentangle the encoder and the cross-attention block in each decoder layer from this Transformer (θ + k ) and initialize θ k with them.
Algorithm 1 Construction of D p Input: Ungrounded dialogues D d , documents D k , threshold γ and number of negative samples o; Output: D p ; if score > γ then 5: end for 10: end if 12: end for 13: return D p ;

Stage II
In stage I, θ d and θ k are trained separately, and the connection between knowledge and dialogue has not yet been established. If KAT is fine-tuned directly on low-resource dataset D k , it may cause inconsistency problems, so we add a warm-up process to it.
Intuitively, responses from humans carry clues to relevance of the knowledge candidates (Zhao et al., 2020b), so the knowledge document that promotes the flow of dialogue usually has a high textual similarity with the response. Based on this assumption, we construct a set of pseudo-knowledge for some dialogues in D d to form a new weak supervision dataset D p according to Algorithm 1.
I(query, documents) means retrieve the document with the highest similarity (e.g., TF-IDF and BM25). Context-response pairs with low quality will be removed. In the knowledge-grounded dialogue corpora, only less documents in knowledge pool are valuable, and others are noise. The design of negative samples is to simulate this situation and make the distribution of knowledge in D p closer to the target data set.
We perform weakly supervised learning on D p to warmup KAT:

Stage III
After warming up on D p , KAT will be fine-tuned on the target low-resource dataset: If not fine-tuned, KAT can also be directly applied to zero-resource response generation.

Datasets and Evaluation Methods
We conduct extensive experiments on two public English knowledge-grounded datasets: Wizard-of-Wikipedia (Dinan et al., 2019) and CMU_DoG (Zhou et al., 2018). Wizard-of-Wikipedia is a chitchatting dataset between two agents, and the two participants are not quite symmetric: one will play the role of a knowledgeable expert (which we refer to as the wizard) while the other is a curious learner (the apprentice). Each wizard turn is associated with ∼60 sentences retrieved from the Wikipedia and each sentence contains ∼30 words, and most of them are noise. The test set is split into two subsets, test seen and test unseen. The difference between the two is that the former contains some topics that overlap with the training set. CMU_DoG also contains conversations between two workers who know the background documents and try to discuss the content in depth. Different from Wizard-of-Wikipedia which spans multiple topics, CMU_DoG mainly focuses on film reviews.
Reddit Conversation Corpus is a large scale open domain dialogue corpus cleaned by Dziri et al. (2018) which consists of ∼15M samples for training and ∼0.8M samples for validation. Following Zhao et al. (2020a); Li et al. (2020), we merge the training and validation data of RedditCC as D d . Besides, we split ∼0.5M Wikipedia articles provided by ParlAI (Miller et al., 2017) into ∼6.6M sentences as D k . Information retrieval function I mentioned in Sec. 2.2.2 is implemented by Apache Lucene with BM25 algorithm and the size of D p is ∼0.1M. γ and o are set to 16.4 and 39 respectively.
Following the common practice in evaluating open domain dialogue generation, we choose perplexity (PPL), corpus-level BLEU (Papineni et al., 2002), sentence-level ROUGE (Lin, 2004) and corpus-level DISTINCT (Li et al., 2016)   has a larger vocabulary that could express more information. BLEU is computed with NLTK library (Bird, 2006) and ROUGE is calculated with the code published with Kim et al. (2020). Besides quantitative evaluation, we also recruit three human annotators to do qualitative analysis on response quality. For each dataset, we randomly sample 100 samples, and each sample contains the conversation history, response, and external knowledge set (for Wizard-of-Wikipedia, we only provide ground-truth knowledge). The annotators then judge the quality of the responses from three aspects, including context coherence, language fluency and knowledge relevance, and assign a score in {0, 1, 2} to each response for each aspect. Each response receives 3 scores per aspect, and the agreement among the annotators is measured via Fleiss' kappa (Fleiss, 1971).

Baselines
We compare our approach with the following baselines: (1) ITDD: an Transformer-based architecture which incrementally represents multi-turn dialogues and knowledge, and conducts response decoding in two passes ; (2) BART cat : A simple BART-based model that take the concatenation of dialogue context and all knowledge as the input of BART for response generation. BART sets constraint on the maximum number of tokens it can handle, and we directly truncate the text that exceeds the length limit; (2) BART skt : SKT is variational model that introduced BERT on the basis of  and considered the knowledge selection history in multi-turn dialogue (Kim et al., 2020). We feed the knowledge candidate selected by SKT to BART for response generation. It should be noted that training SKT requires human labels that indicate ground-truth knowledge which are crucial to the performance of the model. For fair comparison, we use I to reselect the knowledge label; (3) DRD: Another low-resource dialogue model which devise a disentangled response decoder with copy mechanism (See et al., 2017) and use a two-stage framework to learn it (Zhao et al., 2020a). DRD is not open source, so we can't make a very detailed comparison with it; (4) ZRKGC: A double latent variable model that achieves the state-of-the-art performance in zeroresource knowledge-grounded dialogue generation . ZRKGC is based on UNILM   Ours w/o BART DRD Figure 3: Comparison with DRD in low-resource setting. DRD does not provide results when the training data is less than 1/16 (1/8 in CMU_DoG). In order to save space, we merge the Wizard seen and unseen into one subfigure. (Dong et al., 2019) with 110M parameters whose performance is close to BART, so we will not replace the backbone of ZRKGC.

Implementation Details
The knowledge pool of target dataset is usually very large (e.g. ∼60 sentences in Wizard), in order to reduce the time overhead, following (Kim et al., 2020), we only keep the first 40 sentences. We use the base version of BART with 139M parameters in our work, and the number of parameters of KAT is 196M. The batch size in stage I, II and III is 2048, 128 and 16 respectively. The max sequence length in source and target is 256 and 64 respectively. All models are optimized with AdamW (Loshchilov and Hutter, 2017) with learning rate 5e − 5 in 3 epochs. We employ beam search in response decoding (the number of beams from 1 to 3) implemented by Wolf et al. (2020).

Evaluation Results
Table 1, 2 and 3 reports the evaluation results on automatic metrics, and we have the following observations: (1) In the full-data scenario, KAT achieves state-of-the-art performance without using any additional corpora, which means that KAT itself is an excellent dialogue model. Besides, additional resources are unnecessary when there are enriched training datas, so TSLF has little effect in this setting; (2) KAT-TSLF achieves the comparable performance with BART cat/skt even though the baselines have leveraged all training data, while our model is only learned with 1/4 training data on Wizard (1/16 on CMU_DoG). We compare the low-resource performance with DRD, and the results are shown in Figure 3. For a fair comparison, we removed the pre-training language model and reduce the number of model parameters. We can see that KAT-TSLF outperforms DRD (especially in CMU_DoG). The comparison with BART cat is supplemented in Figure 4; (3) Although our TSLF is mainly for low-resource scenarios, under the setting of zero resources (i.e., without stage III), the performance of KAT-TSLF also surpasses ZRKGC in most evaluation metrics; (4) Responses generated by KAT have higher DIST-n, which means that our KAT can better obtain information from multiple knowledge and generate more diverse texts. Table 4 reports the human evaluation results. We observe that responses from our KAT-TSLF are more fluent and more contextually coherent than those from BART skt and ZRKGC. Compared with our low-resource model, SKT has stronger knowledge relevance in the case of full data, thanks to its  Table 4: Human evaluation results on Wizard-of-Wikipedia and CMU_DoG. CC, LF and KR marks context coherence, language fluency and knowledge relevance respectively. In zero-resource setting, our KAT-TSLF outperforms ZRKGC. Besides, our model surpasses BART skt (full data) in most metrics with only only 1/8 of the training data.
well-designed knowledge selection module.

Ablation Study
We conduct ablation experiments on Wizard and CMU_DoG, and the results are shown in Figure 4. So as to verify the effect of TSLF, we first removed stage I, stage II, and stage I II respectively. Inserting a new module into an already well-trained large-scale pre-trained language model will cause inconsistency problems, which require a lot of data to reconcile, so after removing stage II or stage I II, the performance of our KAT in low-resource dropped sharply. Although the quality of the automatically constructed warm-up dataset D p is lower than the target dataset D l , it also helps to establish the connection between the knowledge representation component and the dialogue component. Besides, we tried not to pre-train θ k on unlabeled documents, and the result has dropped slightly, which demonstrates that is still helpful to tailor a pretrained model to the domain of a target task. In addition, replacing negative sampling with top-k retrieval will increase the inconsistency with the knowledge distribution of target dataset, leading to performance degradation. Moreover, the controller also has an effect on the generalization of the model. It can help KAT quickly adapt to new domains by adjusting the proportion of knowledge and context in the response. In order to improve the generalization performance with limited training data, some works (Chen and Shuai, 2021;Zhao et al., 2020a) fix most of the parameters during fine-tuning. We also tried to frozen knowledge encoder and context encoder in stage III or stage II III, and the results show that the performance has not improved, indicating that with the help of stage II, our model can hardly fall into overfitting.
In order to verify the effect of our TSLF on other models, we try to combine BART cat with TSLF. Since the parameters of BART are tightly coupled, we can only apply stage II to it. Experimental results show that the performance is improved significantly under low-resource setting.

Discussions
Case Study Table 5 shows a case from Wizard, from which we can see that the response from our model with zero data not only smoothly catches the ground-truth knowledge (highlighted in blue), but also expands the topic with proper pieces of other knowledge (highlighted in yellow). ZRKGC generated sentences that were inconsistent with the facts. Although BART skt chose the correct knowledge, the narrative was too straightforward, and there is a repetition phenomenon. We showed some other cases in the supplementary material.
Comparison with DRD If we ignore the details, DRD is actually a special case of our method, which skips stage II. During pre-training, DRD completely separates dialogue-related components and knowledge representation-related components, which makes it difficult to effectively promote the integration of dialogue and knowledge with only a small number of samples during fine-tuning. So when the training data is extremely small, DRD can hardly work. Besides, in order to prevent overfitting, DRD has to limit the number of parameters of the knowledge integration component and use fix other parameters when fine-tuning, which leads to limited performance of the model. In addition, the complex model structure makes it difficult for DRD to use pre-trained language models.
KAT v.s. BART cat BART (as well as most other pre-training language models) has a limit on the maximum tokens of the input, so useful knowledge is likely to be truncated. For example, there are about 60 external documents per sample in Wizard, and about 40 documents will be truncated. In theory, KAT can accept an unlimited number of knowledge, so this should be one of the reasons why KAT's performance is better than BAER cat .

Related Work
Open domain end-to-end dialogue response generation is inspired by the success of applying neural sequence to sequence models on machine translation (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017). Very recently, in order to generate fluent, coherent and informative response, many approaches have been proposed by introducing external background documents (Ghazvininejad et al., 2018;Yavuz et al., 2019;Li et al., 2019;Lin et al., 2020). Besides documents (Dinan et al., 2019;Zhou et al., 2018), the are many forms of knowledge such as images (Huber et al., 2018) and triples in knowledge graph Tuan et al., 2019). Dinan et al. (2019) presents to divide knowledgegrounded dialogue into two steps: knowledge selection and dialogue generation. PostKS , SKT (Kim et al., 2020), PIPM (Chen et al., 2020) and SKT-KG (Zhan et al., 2021) use the prior and posterior distribution of knowledge to improve the accuracy of knowledge selection. Zhao et al. (2020b) devise a reinforcement learning method to train a knowledge selector without ground-truth knowledge label. DeepCopy (Yavuz et al., 2019), ITDD  and KIC (Lin et al., 2020) have improved the structure of the decoder so that it can better integrate knowledge. Since knowledgeguided dialogue corpora need to be constructed through crowdsourcing, the size of datasets such as Wizard-of-Wikipedia (Dinan et al., 2019) are relatively small. Zhao et al. (2020a) and Li et al. (2020) proposed to conduct the knowledge-grounded conversation under the low-resource and zero-resource settings respectively. We do not compare with Lin et al. (2020); Zhao et al. (2020b) since they did not release their entire source codes.
Our three-stage learning framework is inspired by Zhao et al. (2020a), which uses ungrounded dialogues and unstructured documents to train a knowledge-grounded dialogue model that can work in low-resource situations. In addition, the design of stage II is inspired by distant supervision technology in relation extraction task (Mintz et al., 2009). The idea of KAT is also encouraged by disentangled decoder (Raghu et al., 2019) and the recent breakthrough in variants of Transformer Hashemi et al., 2020;Izacard and Grave, 2020).

Conclusion
We study knowledge-grounded dialogue generation under a low-resource setting by proposing a threestage learning framework and a knowledge-aware Transformer. Evaluation results on two benchmarks indicate that our model achieves the stateof-the-art performance with less training data. Besides, KAT-TSLF exhibits a good generalization ability on zero-resource scenario.

Broader Impact
Incorporating knowledge into dialogue systems has been the pursuit of researchers in this field for many years. This kind of system will make AI dialogue more natural definitely. It will be more favored by people when the technology does not require a large amount of artificially annotated data. More importantly, the knowledge-based dialogue system can fundamentally change the experience of human-machine dialogue, because system can develop with the update of external knowledge base. One day it will be true that people can obtain effective information through simple conversations. However, coins always have two sides. In addition to the well-known problems caused by large pretrained datasets for end-to-end dialogue models, special knowledge bases which may be deliberately tailored can also be used to make the generated dialogues biased, just as search engines inadvertently spread biased content created by someone. In order to prevent this technology from being abused, we look forward to more research effort for detecting fake/biased/offensive content. At the same time, we recommend that developers choose content carefully to build a knowledge base for the dialogue system. Good external knowledge can adjust the behavior of the dialogue model in the response process and help the model overcome the biases hidden in large-scale social media datasets.