A Pre-training Strategy for Zero-Resource Response Selection in Knowledge-Grounded Conversations

Recently, many studies are emerging towards building a retrieval-based dialogue system that is able to effectively leverage background knowledge (e.g., documents) when conversing with humans. However, it is non-trivial to collect large-scale dialogues that are naturally grounded on the background documents, which hinders the effective and adequate training of knowledge selection and response matching. To overcome the challenge, we consider decomposing the training of the knowledge-grounded response selection into three tasks including: 1) query-passage matching task; 2) query-dialogue history matching task; 3) multi-turn response matching task, and joint learning all these tasks in a unified pre-trained language model. The former two tasks could help the model in knowledge selection and comprehension, while the last task is designed for matching the proper response with the given query and background knowledge (dialogue history). By this means, the model can be learned to select relevant knowledge and distinguish proper response, with the help of ad-hoc retrieval corpora and a large number of ungrounded multi-turn dialogues. Experimental results on two benchmarks of knowledge-grounded response selection indicate that our model can achieve comparable performance with several existing methods that rely on crowd-sourced data for training.


Introduction
Along with the very recent prosperity of artificial intelligence empowered conversation systems in the spotlight, many studies have been focused on building human-computer dialogue systems (Wen et al., 2017;Zhang et al., 2020) with either retrievalbased methods (Wang et al., 2013;Wu et al., 2017; * Equal Contribution. † Corresponding author: Rui Yan (ruiyan@ruc.edu.cn). Whang et al., 2020) or generation-based methods Serban et al., 2016;Zhang et al., 2020), which both predict the response with only the given context. In fact, unlike a person who may associate the conversation with the background knowledge in his or her mind, the machine can only capture limited information from the query message itself. As a result, it is difficult for a machine to properly comprehend the query, and to predict a proper response to make it more engaging.
To bridge the gap of the knowledge between the human and the machine, researchers have begun to simulating this motivation by grounding dialogue agents with background knowledge (Zhang et al., 2018;Dinan et al., 2019;Li et al., 2020), and lots of impressive results have been obtained.
In this paper, we consider the response selection problem in knowledge-grounded conversion and specify the background knowledge as unstructured documents that are common sources in practice. The task is that given a conversation context and a set of knowledge entries, one is required 1): to select proper knowledge and grasp a good comprehension of the selected document materials (knowledge selection); 2): to distinguish the true response from a candidate pool that is relevant and consistent with both the conversation context and the background documents (knowledge matching).
While there exists a number of knowledge documents on the Web, it is non-trivial to collect large-scale dialogues that are naturally grounded on the documents for training a neural response selection model, which hinders the effective and adequate training of knowledge selection and response matching. Although some benchmarks built upon crowd-sourcing have been released by recent works (Zhang et al., 2018;Dinan et al., 2019), the relatively small training size makes it hard for the dialogue models to generalize on other domains or topics . Thus, in this work, we focus on a more challenging and practical scenario, learning a knowledge-grounded conversation agent without any knowledge-grounded dialogue data, which is known as zero-resource settings.
Since knowledge-grounded dialogues are unavailable in training, it raises greater challenges for learning the grounded response selection model. Fortunately, there exists a large number of unstructured knowledge (e.g., web pages or wiki articles), passage search datasets (e.g., query-passage pairs coming from ad-hoc retrieval tasks) (Khattab and Zaharia, 2020) and multi-turn dialogues (e.g., context-response pairs collected from Reddit) (Henderson et al., 2019), which might be beneficial to the learning of knowledge comprehension, knowledge selection and response prediction respectively. Besides, in multi-turn dialogues, the background knowledge and conversation history (excluding the latest query) are symmetric in terms of the information they convey, and we assume that the dialogue history can be regarded as another format of background knowledge for response prediction.
Based on the above intuition, in this paper, we consider decomposing the training of the grounded response selection task into several sub-tasks, and joint learning all those tasks in a unified model. To take advantage of the recent breakthrough on pretraining for natural language tasks, we build the grounded response matching models on the basis of a pre-trained language model (PLMs) (Devlin et al., 2019;Yang et al., 2019), which are trained with large-scale unstructured documents from the web. On this basis, we further train the PLMs with query-passage matching task, query-dialogue history matching task, and multi-turn response matching task jointly. The former two tasks could help the model not only in knowledge selection but also in knowledge (and dialogue history) comprehension, while the last task is designed for matching the proper response with the given query and background knowledge (dialogue history). By this means, the model can be learned to select relevant knowledge and distinguish proper responses, with the help of a large number of ungrounded dialogues and ad-hoc retrieval corpora. During the testing stage, we first utilize the trained model to select proper knowledge, and then feed the query, dialogue history, selected knowledge, and the response candidate into our model to calculate the final matching degree. Particularly, we design two strategies to compute the final matching score.
In the first strategy, we directly concatenate the selected knowledge and dialogue history as a long sequence of background knowledge and feed into the model. In the second strategy, we first compute the matching degree between each queryknowledge and the response candidates, and then integrate all matching scores.
We conduct experiments with benchmarks of knowledge-grounded dialogue that are constructed by crowd-sourcing, such as the Wizard-of-Wikipedia Corpus (Dinan et al., 2019) and the CMU DoG Corpus (Zhou et al., 2018a). Evaluation results indicate that our model achieves comparable performance on knowledge selection and response selection with several existing models trained on crowd-sourced benchmarks.
Our contributions are summarized as follows: • To the best of our knowledge, this is the first exploration of knowledge-grounded response selection under the zero-resource setting. • We propose decomposing the training of the grounded response selection models into several sub-tasks, so as to empower the model through these tasks in knowledge selection and response matching. • We achieve a comparable performance of response selection with several existing models learned from crowd-sourced training sets.

Related Work
Early studies of retrieval-based dialogue focus on single-turn response selection where the input of a matching model is a message-response pair (Wang et al., 2013;Ji et al., 2014;Wang et al., 2015).
Recently, researchers pay more attention to multiturn context-response matching and usually adopt the representation-matching-aggregation paradigm to build the model. Representative methods include the dual-LSTM model (Lowe et al., 2015), the sequential matching network (SMN) (Wu et al., 2017), the deep attention matching network (DAM) (Zhou et al., 2018b), interaction-overinteraction network (IoI)  and multi-hop selector network (MSN) (Yuan et al., 2019). More recently, pre-trained language models (Devlin et al., 2019;Yang et al., 2019) have shown significant benefits for various NLP tasks, and some researchers have tried to apply them on multi-turn response selection. Vig and Ramea (2019) exploit BERT to represent each utteranceresponse pair and fuse these representations to calculate the matching score; Whang et al. (2020) and Xu et al. (2020) treat the context as a long sequence and conduct context-response matching with BERT. Besides, Gu et al. (2020a) integrate speaker embeddings into BERT to improve the utterance representation in multi-turn dialogue.
To bridge the gap of the knowledge between the human and the machine, researchers have investigated into grounding dialogue agents with unstructured background knowledge (Ghazvininejad et al., 2018;Zhang et al., 2018;Dinan et al., 2019). For example, Zhang et al. (2018) build a persona-based conversation data set that employs the interlocutor's profile as the background knowledge; Zhou et al. (2018a) publish a data where conversations are grounded in articles about popular movies; Dinan et al. (2019) release another documentgrounded data with Wiki articles covering a wide range of topics. Meanwhile, several retrievalbased knowledge-grounded dialogue models are proposed, such as document-grounded matching network (DGMN)  and dually interactive matching network (DIM) (Gu et al., 2019) which let the dialogue context and all knowledge entries interact with the response candidate respectively via the cross-attention mechanism. Gu et al. (2020b) further propose to pre-filter the context and the knowledge and then use the filtered context and knowledge to perform the matching with the response. Besides, with the help of gold knowledge index annotated by human wizards, Dinan et al. (2019) consider joint learning the knowledge selection and response matching in a multi-task manner or training a two-stage model.

Model
In this section, we first formalize the knowledgegrounded response matching problem and then introduce our method from preliminary to response matching with PLMs to details of three pre-training tasks.

Problem Formalization
We first describe a standard knowledge-grounded response selection task such as Wizard-of-Wikipedia. Suppose that we have a knowledge- where k i = {p 1 , p 2 , . . . , p l k } represents a collection of knowledge with p j the j-th knowledge entry (a.k.a., passage) and l k is the number of entries; c i = {u 1 , u 2 , . . . , u lc } denotes multi-turn dialogue context with u j the j-th turn and l c is the number of dialogue turns. It should be noted that in this paper we denote the latest turn u lc as dialogue query q i , and dialogue context except for query is denoted as h i = c i /{q i }. r i stands for a candidate response. y i = 1 indicates that r i is a proper response for c i and k i , otherwise y i = 0. N is the number of samples in data set. The goal knowledge-grounded dialogue is to learn a matching model g(k, c, r) from D, and thus for any new (k, c, r), g(k, c, r) returns the matching degree between r and (k, c). Finally, one can collect the matching scores of a series of candidate responses and conduct response ranking.
Zero-resource grounded response selection then is formally defined as follows. There is a standard and an ad-hoc retrieval dataset where q i is a query and p i stands a candidate passage, z i = 1 indicates that p i is a relevant passage for q i , otherwise z i = 0. Our goal is to learn a model g(k, h, q, r) from D c and D p , and thus for any new input (k, h, q, r), our model can select proper knowledgek from k and calculate the matching degree between r and (k, q, h).

Preliminary: Response Matching with PLMs
Pre-trained language models have been widely used in many NLP tasks due to the strong ability of language representation and understanding. In this work, we consider building a knowledge-grounded response matching model with BERT. Specifically, given a query q, a dialogue history h = {u 1 , u 2 , ..., u n h } where u i is the i-th turn in the history, a response candidate r = {r 1 , r 2 , ..., r lr } with l r words, we concatenate all sequences as a single consecutive tokens sequence with special tokens, which can be represented as [CLS] and [SEP] are classification symbol and segment separation symbol respectively.
For each token in x, BERT uses a summation of three kinds of embeddings, including WordPiece embedding (Wu et al., 2016), segment embedding, and position embedding.
Then, the embedding sequence of x is fed into BERT, giving us the contextualized embedding sequence {E [CLS] , E 2 , . . . , E lx }. E [CLS] is an aggregated representation vector that contains the  semantic interaction information between the query, history, and response candidate. Finaly, E [CLS] is fed into a non-linear layer to calculate the final matching score, which is formulated as: where W {1,2} and b {1,2} is training parameters for response selection task, σ is a sigmoid function.
In knowledge-grounded dialogue, each dialogue is associated with a large collection of knowledge entries k = {p 1 , p 2 , . . . , p l k } 1 . The model is required to select m(m ≥ 1) knowledge entries based on semantic relevance between the query and each knowledge, and then performs the response matching with the query, dialogue history and the highly-relevant knowledge. Specifically, we denotek = (p 1 , . . . ,p m ) as the selected knowledge entries, and feed the input sequence

Pre-training Strategies
On the basis of BERT, we further jointly train it with three tasks including 1) query-passage matching task; 2) query-dialogue history matching task; 3) multi-turn response matching task. The former two tasks could help the model in knowledge selection and knowledge (and dialogue history) comprehension, while the last task is designed for matching the proper response with the given query and background knowledge (dialogue 1 The scale of the knowledge referenced by each dialogue usually exceeds the limitation of input length in PLMs. history). By this means, the model can be learned to select relevant knowledge and distinguish the proper response, with the help of a large number of ungrounded dialogues and ad-hoc retrieval corpora.

Query-Passage Matching
Although there exist a huge amount of conversation data on social media, it is hard to collect sufficient dialogues that are naturally grounded on knowledge documents. Existing studies (Dinan et al., 2019) usually extract the relevant knowledge before the response matching or jointly train the knowledge retrieval and response selection in a multi-task manner. However, both methods need in-domain knowledge-grounded dialogue data (with gold knowledge label) to train, making the model hard to generalize to a new domain. Fortunately, the ad-hoc retrieval task (Harman, 2005;Khattab and Zaharia, 2020) in the information retrieval area provides a potential solution to simulate the process of knowledge seeking. To take advantage of the parallel data in the ad-hoc retrieval task, we consider incorporating the query-passage matching task, so as to help the knowledge selection and knowledge comprehension for our task.
Given a query-passage pair (q, p), we first concatenate the query q and the passage p as a single consecutive token sequence with special tokens separating them, which is formulated as: where w p i , w q j denotes the i-th and j-th token of knowledge entry p and query q respectively. For each token in S qp i , token, segment and position embeddings are summated and fed into BERT. It is worth noting that here we set the segment embedding of the knowledge to be the same as the dialogue history. Finally, we feed the output representation of [CLS] E qp [CLS] into a MLP to obtain the final query-passage matching score g(q, p). The loss function of each training sample for query-passage matching task is defined by where p + stands for the positive passage for q, p − j is the j-th negative passage and δ p is the number of negative passage.

Query-Dialogue History Matching
In multi-turn dialogues, the conversation history (excluding the latest query) is a piece of supplementary information for the current query and can be regarded as another format of background knowledge during the response matching. Besides, due to the natural sequential relationship between dialogue turns, the dialogue query usually shows a strong semantic relevance with the previous turns in the dialogue history. Inspired by such characteristics, we design a query-dialogue history matching task with the multi-turn dialogue context, so as to enhance the capability of the model to comprehend the dialogue history with the given dialogue query and to rank relevant passages with these pseudo query-passage pairs. Specifically, we first concatenate the dialogue history into a long sequence. The task requires the model to predict whether a query q = {w q 1 , . . . , w q nq } and a dialogue history sequence h = {w h 1 , . . . , w h n h } are consecutive and relevant. We concatenate two sequences into a single consecutive sequence with [SEP] tokens, For each word in S qh , token, segment and position embeddings are summated and fed into BERT. Finally, we feed E qh [CLS] into a MLP to obtain the final query-history matching score g(q, h). The loss function of each training sample for queryhistory matching task is defined by where h + stands for the true dialogue history for q, h − j is the j-th negative dialogue history randomly sampled from the training set and δ h is the number of sampled dialogue history.

Multi-turn Response Matching
The above two tasks are designed for empowering the model to knowledge or history comprehension and knowledge selection. In this task, we aim at training the model to match reasonable responses based on dialogue history and query. Since we treat the dialogue history as a special form of background knowledge and they share the same segment embeddings in the PLMs, our model can acquire the ability to identify the proper response with either dialogue history or the background knowledge through the multi-turn response matching task.
Specifically, we format the multi-turn dialogues as query-history-response triples and requires the model to predict whether a response candidate r = {w r 1 , . . . , w r nr } is appropriate for a given query Similarly, we feed an embedding sequence of which each entry is a summation of token, segment and position embeddings into BERT. Finally, we feed E hqr [CLS] into a MLP to obtain the final response matching score g(h, q, r).
The loss function of each training sample for multi-turn response matching task is defined by where r + is the true response for a given q and h, r − j is the j-th negative response candidate randomly sampled from the training set and δ r is the number of negative response candidate.

Joint Learning
We adopt a multi-task learning manner and define the final objective function as: In this way, all tasks are jointly learned so that the model can effectively leverage two training corpus and learn to select relevant knowledge and distinguish the proper response.

Calculating Matching Score
After learning model from D c and D p , we first rank {p i } n k i=1 according to g(q, k i ) and then select top m knowledge entries {p 1 , . . . , p m } for the subsequent response matching process. Here we design two strategies to compute the final matching score g(k, h, q, r). In the first strategy, we directly concatenate the selected knowledge and dialogue history as a long sequence of background knowledge and feed into the model to obtain the final matching score, which is formulated as, g(k, h, q, r) = g(p 1 ⊕ . . . ⊕ p m ⊕ c, q, r) (9) where ⊕ denotes the concatenation operation.
In the second strategy, we treat each selected knowledge entry and the dialogue history equally as the background knowledge, and compute the matching degree between each query, background knowledge, and the response candidates with the trained model. Consequently, the matching score is defined as an integration of a set of knowledgegrounded response matching scores, formulated as, g(k, h, q, r) = g(h, q, r)+ max i∈(0,m) g(p i , q, r) (10) where m is the number of selected knowledge entries. We name our model with the two strategies as PTKGC cat and PTKGC sep respectively. We compare the two learning strategies through empirical studies, as will be reported in the next section.

Datasets and Evaluation Metrics
Training Set. We adopt MS MARCO passage ranking dataset (Nguyen et al., 2016) built on Bing's search for query-passage matching task. The dataset contains 8.8M passages from Web pages gathered from Bing's results to real-world queries and each passage contains an average of 55 words. Each query is associated with sparse relevance judgments of one (or very few) passage marked as relevant. The training set contains about 500k pairs of query and relevant passage, and another 400M pairs of query and passages that have not been marked as relevant, from which the negatives are sampled in our task.
For the query-dialogue history matching task and multi-turn response matching task, we use the multi-turn dialogue corpus constructed from the Reddit (Dziri et al., 2018). The dataset contains more than 15 million dialogues and each dialogue has at least 3 utterances. After the pre-processing, we randomly sample 2.28M/20K dialogues as the training/validation set. For each dialogue session, we regard the last turn as the response, the last but one as the query, and the rest as the positive dialogue history. The negative dialogue histories are randomly sampled from the whole dialogue set. On average, each dialogue contains 4.3 utterances, and the average length of the utterances is 42.5.
Test Set. We tested our proposed method on the Wizard-of-Wikipedia (WoW) (Dinan et al., 2019) and CMU DoG (Zhou et al., 2018a). Both datasets contain multi-turn dialogues grounded on a set of background knowledge and are built with crowd-sourcing on Amazon Mechanical Turk. In WoW, the given knowledge collection is obtained from Wikipedia and covers a wide range of topics or domains, while in CMU DoG, the underlying knowledge focuses on the movie domain. Unlike CMU DoG where the golden knowledge index for each turn is unknown, the golden knowledge index for each turn is provided in WoW. Two configurations (e.g., test-seen and test-unseen) are provided in WoW. Following existing works (Dinan et al., 2019;, positive responses are true responses from humans and negative ones are randomly sampled. The ratio between positive and negative responses is 1 : 99 for WoW and 1 : 19 for CMU DoG. More details of the two benchmarks are shown in Appendix A.1. Evaluation Metrics. Following previous works on knowledge-grounded response selection (Gu et al., 2020b;, we also employ recall n at k R n @k (where n = 100 for WoW and n = 20 for CMU DoG and k = {1, 2, 5}) as the evaluation metrics.

Implementation Details
Our model is implemented by PyTorch (Paszke et al., 2019). Without loss of generality, we select English uncased BERT base (110M) as the matching model. During the training, the maximum lengths of the knowledge (a.k.a., passage), the dialogue history, the query, and the response candidate were set to 128, 120 60, and 40. Intuitively, the last tokens in the dialogue history and the previous tokens in the query and response candidate are more important, so we cut off the previous tokens for the context but do the cut-off in the reverse direction for the query and response candidate if the sequences are longer than the maximum length. We set a batch size of 32 for multi-turn response matching and query-dialogue history matching, and 8 for query-document matching in order to train these tasks jointly under the circumstance of training examples inequality. We set δ p = 6, δ h = 1 and δ r = 12 for the query-passage matching, the query-dialogue history matching and the multiturn response matching respectively. Particularly, the negative dialogue histories are sampled from other training instances in a batch. The model is optimized using Adam optimizer with a learning rate set as 5e − 6. The learning rate is scheduled by warmup and linear decay. A dropout rate of 0.1 is applied for all linear transformation layers. The gradient clipping threshold is set as 10.0. Early stopping on the corresponding validation data is adopted as a regularization strategy. During the testing, we vary the number of selected knowledgeentries m ∈ {1, . . . , 15} and set m = 2 for PTKGC cat and set m = 14 for PTKGC sep because they achieve the best performance.
Baselines on CMU DoG 1) Starspace (Wu et al., 2018) selects the response by the cosine similarity between a concatenated sequence of dialogue context, knowledge, and the response candidate represented by StarSpace (Wu et al., 2018); 2) BoW MemNet (

Evaluation Results
Performance of Response Selection. Table 1 and   consistently better than PTKGC cat over all metrics on two data sets, demonstrating that individually representing each knowledge-query-response triple with BERT can lead to a more optimal matching signal than representing a single long sequence. Our explanation to the phenomenon is that there is information loss when a long sequence composed of the knowledge and dialogue history passes through the deep architecture of BERT. Thus, the earlier different knowledge entries and dialogue history are fused together, the more information of dialogue history or background knowledge will be lost in matching. Particularly, on the WoW, in terms of R@1, our PTKGC sep achieves a comparable performance with the existing stateof-the-art models that are learned from the crowdsourced training set, indicating that the model can effectively learn how to leverage external knowledge feed for response selection through the proposed pre-training approach.
Notably, we can observe that our PTKGC sep performs worse than DIM and FIRE on the CMU DoG. Our explanation to the phenomenon is that the dialogue and knowledge in CMU DoG focus on the movie domain while our train data including ad-hoc retrieval corpora and multi-turn Performance of Knowledge Selection. We also assess the ability of models to predict the knowledge selected by human wizards in WoW data. The results are shown in Table 4. We can find that the performance of our method is comparable with various supervised methods trained on the gold knowledge index. In particular, on the testseen, our model is slightly worse than Transformer (w/ pretrain), while on the test-unseen, our model achieves slightly better results. The results demonstrate the advantages of our pretraining tasks and the good generalization ability of our model.

Discussions
Ablation Study. We conduct a comprehensive ablation study to investigate the impact of different inputs and different tasks. First, we remove the dialogue history, knowledge, and both of them from the model, which is denoted as PTKGC sep (q+k), PTKGC sep (q+h) and PTKGC sep (q) respectively. According to the results of the first four rows in Table 3, we can find that both the dialogue history and knowledge are crucial for response selection as removing anyone will generally cause a performance drop on the two data. Besides, the background knowledge is more critical for response selection as removing the background knowledge causes more significant performance degradation than removing the dialogue history.
Then, we remove each training task individually from PTKGC sep , and denote the models  as PTKGC sep -X, where X ∈ {L p , L h } meaning query-passage matching task and query-dialogue history matching task respectively. Table 4 shows the ablation results of knowledge selection. We can find that both tasks are useful in the learning of knowledge selection, and query-passage matching plays a dominant role since the performance of knowledge selection drops dramatically when the task is removed from the pre-training process. The last two rows in Table 3 show the ablation results of response selection. We report the ablation results when only 1 knowledge is provided since the knowledge recalls for different ablated models and the full model are very close when m is large (m = 14). We can see that both tasks are helpful and the performance of response selection drops more when removing the query-passage matching task. Particularly, L p plays a more important role and the performance on test-unseen of WoW drops more obvious when removing each training task.
To further investigate the impact of our pretraining tasks on the performance of the multiturn response selection (without considering the grounded knowledge), we conduct an ablation study and the results are shown in Table 5. We can observe that the performance of the response matching model (no grounded knowledge) drops obviously when removing one of the pretraining tasks or both tasks. Particularly, the query-passage matching task contributes more to the response selection.
The impact of the number of selected knowledge. We further study how the number of selected knowledge (m) influences the performance of PTKGC sep . Figure 2 shows how the performance of our model changes with respect to different numbers of selected knowledge. We observe that the performance increases monotonically until the knowledge number reaches a certain value, and then stable when the number keeps increasing. The results are rational because more knowledge entries can provide more useful information for response matching, but when the knowledge becomes enough, the noise will be brought to matching.

Conclusion
In this paper, we study response matching in knowledge-grounded conversations under a zeroresource setting. In particular, we propose decomposing the training of the knowledge-grounded response selection into three tasks and joint train all tasks in a unified pre-trained language model. Our model can be learned to select relevant knowledge and distinguish proper response, with the help of ad-hoc retrieval corpora and amount of multiturn dialogues. Experimental results on two benchmarks indicate that our model achieves a comparable performance with several existing methods trained on crowd-sourced data. In the future, we would like to explore the ability of our proposed method in retrieval-augmented dialogues.

A Appendices
A.  We tested our proposed method on the Wizardof-Wikipedia (WoW) (Dinan et al., 2019) and CMU DoG (Zhou et al., 2018a). Both datasets contain multi-turn dialogues grounded on a set of background knowledge and are built with crowdsourcing on Amazon Mechanical Turk.
In the WoW dataset, one of the paired speakers is asked to play the role of a knowledgeable expert with access to the given knowledge collection obtained from Wikipedia, while the other of a curious learner. The dataset consists of 968 complete knowledge-grounded dialogues for testing. It is worth noting that the golden knowledge index for each turn is available in the dataset. Response selection is performed at every turn of a complete dialogue, which results in 7512 for testing in total. Following the setting of the original paper, positive responses are true responses from humans and negative ones are randomly sampled. The ratio between positive and negative responses is 1 : 99 in testing sets. Besides, the test set is divided into two subsets: Test Seen and Test Unseen. The former shares 533 common topics with the training set, while the latter contains 58 new topics uncovered by the training or validation set.
The CMU DoG data contains knowledgegrounded human-human conversations where the underlying knowledge comes from wiki articles and focuses on the movie domain. Similar to Dinan et al. (2019), the dataset was also built in two scenarios. In the first scenario, only one worker can access the provided knowledge collections, and he/she is responsible for introducing the movie to the other worker; while in the second scenario, both workers know the knowledge and they are asked to discuss the content. Different from WoW, the golden knowledge index for each turn is unknown for both scenarios. Since the data size for an individual scenario is small, we merge the data of the two scenarios following the setting with . Finally, there are 537 dialogues for testing. We evaluate the performance of the response selection at every turn of a dialogue, which results in 6637 samples for testing. We adopted the version shared in , where 19 negative candidates were randomly sampled for each utterance from the same set. More details about the two benchmarks can be seen in Table 6.

A.2 Baselines for Knowledge Selection
To compare the performance of knowledge selection, we choose the following baselines from Dinan et al. (2019) including (1) Random: the model randomly selects a knowledge entry from a set of knowledge entries; (2) IR Baseline: the model uses simple word overlap between the dialogue context and the knowledge entry to select the relevant knowledge; (3) BoW MemNet: the model is based on memory network where each memory item is a bag-of-words representation of a knowledge entry, and the gold knowledge labels for each turn are used to train the model; (4) Transformer: the model trains a context-knowledge matching network based on Transformer architecture; (5) Transformer (w/ pretrain): the model is similar to the former model, but the transformer is pre-trained on Reddit data and fine-tuned for the knowledge selection task.  As an additional experiment, we also evaluate the proposed model for a low-resource setting. We randomly sample t ∈ {10%, 50%, 100%} portion of training data from WoW, and use the data to finetune our model. The results are shown in Table 7. We can find that with only 10% training data, our model can significantly outperform existing models, indicating the advantages of our pretraining tasks. With 100% training data, our model can achieve 2.7% improvement in terms of R@1 on the test-seen and 4.7% improvement on the testunseen.