Dialogue Response Selection with Hierarchical Curriculum Learning

We study the learning of a matching model for dialogue response selection. Motivated by the recent finding that models trained with random negative samples are not ideal in real-world scenarios, we propose a hierarchical curriculum learning framework that trains the matching model in an “easy-to-difficult” scheme. Our learning framework consists of two complementary curricula: (1) corpus-level curriculum (CC); and (2) instance-level curriculum (IC). In CC, the model gradually increases its ability in finding the matching clues between the dialogue context and a response candidate. As for IC, it progressively strengthens the model’s ability in identifying the mismatching information between the dialogue context and a response candidate. Empirical studies on three benchmark datasets with three state-of-the-art matching models demonstrate that the proposed learning framework significantly improves the model performance across various evaluation metrics.


Introduction
Building intelligent conversation systems is a longstanding goal of artificial intelligence and has attracted much attention in recent years (Shum et al., 2018;Kollar et al., 2018). An important challenge for building such conversation systems is the response selection problem, that is, selecting the best response to a given dialogue context from a set of candidate responses (Ritter et al., 2011).
To tackle this problem, different matching models are developed to measure the matching degree between a dialogue context and a response candidate (Wu et al., 2017;Lu et al., 2019;Gu et al., 2019). Despite their differences, * The main body of this work was done during internship at Tencent Inc. The first two authors contributed equally. Yan Wang is the corresponding author.  most prior works train the model with data constructed by a simple heuristic. For each context, the human-written response is considered as positive (i.e., an appropriate response) and the responses from other dialogue contexts are considered as negatives (i.e., inappropriate responses). In practice, the negative responses are often randomly sampled and the training objective ensures that the positive response scores are higher than the negative ones.

Dialogue Context Between Two Speakers A and B
Recently, some researchers Lin et al., 2020) have raised the concern that randomly sampled negative responses are often too trivial (i.e., totally irrelevant to the dialogue context). Models trained with trivial negative responses may fail to handle strong distractors in real-world scenarios. Essentially, the problem stems from the ignorance of the diversity in context-response matching degree. In other words, all random responses are treated as equally negative regardless of their different distracting strengths. For example, Ta-ble 1 shows a conversation between two speakers and two negative responses (N1, N2) are presented. For N1, one can easily dispel its appropriateness as it unnaturally diverges from the TV show topic. On the other hand, N2 is a strong distractor as it overlaps significantly with the context (e.g., fantasy series and Game of Thrones). Only with close observation we can find that N2 does not maintain the coherence of the discussion, i.e., it starts a parallel discussion about an actor in Game of Thrones rather than elaborating on the enjoyable properties of the TV series. In addition, we also observe a similar phenomenon on the positive side. For different training context-response pairs, their pairwise relevance also varies. In Table 1, two positive responses (P1, P2) are provided for the given context. For P1, one can easily confirm its validity as it naturally replies the context. As for P2, while it expatiates on the enjoyable properties of the TV series, it does not exhibit any obvious matching clues (e.g., lexical overlap with the context). Therefore, to correctly identify P2, its relationship with the context must be carefully reasoned by the model. Inspired by the above observations, in this work, we propose to employ the idea of curriculum learning (CL) (Bengio et al., 2009). The key to applying CL is to specify a proper learning scheme under which all training examples are learned. By analyzing the characteristics of the concerned task, we tailor-design a hierarchical curriculum learning (HCL) framework. Specifically, our learning framework consists of two complementary curriculum strategies, corpus-level curriculum (CC) and instance-level curriculum (IC), covering the two distinct aspects of response selection. In CC, the model gradually increases its ability in finding matching clues through an easy-to-difficult arrangement of positive context-response pairs. In IC, we sort all negative responses according to their distracting strength such that the model's capability of identifying the mismatching information can be progressively strengthened.
Notably, our learning framework is independent to the choice of matching models. For a comprehensive evaluation, we evaluate our approach on three representative matching models, including the current state of the art. Results on three benchmark datasets demonstrate that the proposed learning framework leads to remarkable performance improvements across all evaluation metrics.
In a nutshell, our contributions can be summa-rized as: (1) We propose a hierarchical curriculum learning framework to tackle the task of dialogue response selection; and (2) Empirical results on three benchmark datasets show that our approach can significantly improve the performance of various strong matching models, including the current state of the art.

Background
, the learning of a matching model s(·, ·) is to correctly identify the positive response r i conditioned on the dialogue context c i from a set of negative responses R − i . The learning objective is typically defined as where m is the number of negative responses associated with each training context-response pair.
In most existing studies (Wu et al., 2017;Gu et al., 2019), the training negative responses R − i are randomly selected from the dataset D. Recently,  and Lin et al. (2020) proposed different approaches to strengthen the training negatives. In testing, for any contextresponse (c, r), the models give a score s(c, r) that reflects their pairwise matching degree. Therefore, it allows the user to rank a set of response candidates according to the scores for response selection.

Overview
We propose a hierarchical curriculum learning (HCL) framework for training neural matching models. It consists of two complementary curricula: (1) corpus-level curriculum (CC); and (2) instance-level curriculum (IC). Figure 1 illustrates the relationship between these two strategies. In CC ( §3.2), the training context-response pairs with lower difficulty are presented to the model before harder pairs. This way, the model gradually increases its ability to find the matching clues contained in the response candidate. As for IC ( §3.3), it controls the difficulty of negative responses that associated with each training context-response pair. Starting from easier negatives, the model progressively strengthens its ability to identify the mismatching information (e.g., semantic incoherence) in the response candidate. The following gives a detailed description of the proposed approach. On the left part, two training context-response pairs with different difficulty levels are presented (the upper one is more difficult than the lower one, and P denotes the positive response). For each training instance, we show three associated negative responses (N1, N2 and N3) with increasing difficulty from the bottom to the top. In the negative responses, the words that also appear in the dialogue context are marked as italic.

Corpus-Level Curriculum
, the corpuslevel curriculum (CC) arranges the ordering of different training context-response pairs. The model first learns to find easier matching clues from the pairs with lower difficulty. As the training progresses, harder cases are presented to the model to learn less obvious matching signals. Two examples are shown in the left part of Figure 1. For the easier pair, the context and the positive response are semantically coherent as well as lexically overlapped (e.g., TV series and Game of Thrones) with each other and such matching clues are simple for the model to learn. As for the harder case, the positive response can only be identified via numerical reasoning, which makes it harder to learn.
Difficulty Function. To measure the difficulty of each training context-response pair (c i , r i ), we adopt a pre-trained ranking model G(·, ·) ( §3.4) to calculate its relevance score as G(c i , r i ). Here, a higher score of G(c i , r i ) corresponds to a higher relevance between c i and r i and vice versa. Then, for each pair (c i , r i ) ∈ D, its corpus-level difficulty Here, a lower difficulty score indicates the pair (c i , r i ) is easier for the model to learn and vise versa.
Pacing Function. In training, to select the training context-response pairs with appropriate difficulty, we define a corpus-level pacing function, p cc (t), which controls the pace of learning from easy to hard instances. In other words, at time step t, p cc (t) represents the upper limit of difficulty and the model is only allowed to use the training instances (c i , r i ) whose corpus-level difficulty score d cc (c i , r i ) is lower than p cc (t). In this work, we propose a simple functional form for p cc (t) 1 as where p cc (0) is a predefined initial value. At the training warm up stage (first T steps), we learn a basic matching model with a easy subset of the training data. In this subset, the difficulty of all samples are lower than p cc (t). After p cc (t) becomes 1.0 (at time step T ), the corpus-level curriculum is completed and the model can then freely access the entire dataset. In Figure 2(a), we give an illustration of the corpus-level curriculum.

Instance-Level Curriculum
As a complement of CC, the instance-level curriculum (IC) controls the difficulty of negative responses. For an arbitrary training context-response (1) p cc (t) is computed based on the current step t; and (2) a batch of context-response pairs are uniformly sampled from the training instances whose corpuslevel difficulty is lower than p cc (t) (shaded area in the example). In this example, p cc (0)  pair (c i , r i ), while its associated negative responses can be any responses r j (s.t. j = i) in the training set, the difficulties of different r j are diverse. Some examples are presented in the right part of Figure 1. We see that the negative responses with lower difficulty are always simple to spot as they are often obviously off the topic. As for the harder negatives, the model need to identify the fine-grained semantic incoherence between them and the context. The main purpose of IC is to select negative responses with appropriate difficulty based on the state of the learning process. At the beginning, the negative responses are randomly sampled from the entire training set, so that most of them are easy to distinguish. As the training evolves, IC gradually increases the difficulty of negative responses by sampling them from the responses with higher difficulty (i.e., from a harder subset of the training data). In this way, the model's ability in finding the mismatching information is progressively strengthened and will be more robust when handling those strong distractors in real-world scenarios.
Difficulty Function. Given a specific training instance (c i , r i ), we define the difficulty of an arbitrary response r j (s.t. j = i) as its rank in a sorted list of relevance score in descending order, In this formula, the response r h with the highest relevance score, i.e., r h = max r j ∈D,j =i G(c i , r j ), has a rank of 1, thus d ic (c i , r h ) = 1. For the response r l with the lowest relevance score, i.e., r l = min r j ∈D,j =i G(c i , r j ), has a rank of |D|, thus d ic (c i , r l ) = |D|. Here, a smaller rank means the corresponding negative response is more relevant to the context c i , thus it is more difficult for the model to distinguish.
Pacing Function. Similar to CC, in IC, the pace of learning from easy to difficult negative responses is controlled by an instance-level pacing function, p ic (t). It adjusts the size of the sampling space (in log scale) from which the negative responses are sampled from. Given a training instance (c i , r i ), at time step t, the negative examples are sampled from the responses r j (s.t. j = i) whose rank is smaller than 10 p ic (t) (d ic (c i , r j ) ≤ 10 p ic (t) ), i.e., the negative responses are sampled from a subset of the training data which consists of the top-10 p ic (t) relevant responses in relation to c i . The smaller the p ic (t) is, the harder the sampled negatives will be. In this work, we define the function p ic (t) as where T is the same as the one in the corpus-level pacing function p cc (t). k 0 = log |D| 10 , meaning that, at the start of training, the negative responses are sampled from the entire training set D. k T is a hyperparameter and it is smaller than k 0 . After p ic (t) becomes k T (at step T ), the instance-level curriculum is completed. For the following training steps, the size of the sampling space is fixed at 10 k T . An example of p ic (t) is depicted in Figure 2(b).

Hierarchical Curriculum Learning
Model Training. Our learning framework jointly employs the corpus-level and instance-level curriculum. For each training step, we construct a batch of training data as follows: First, we select the positive context-response pairs according to the corpus-level pacing function p cc (t). Then, for each instance in the selected batch, we sample its associated negative examples according to the instance- Uniformly sample one batch of context-response pairs, Bt, from all (ci, ri) ∈ D, such that dcc(ci, ri) ≤ pcc(t), as shown in Figure 2 Invoke the trainer, T , using as input to optimize the model using Eq. (1). 7 end Output :Trained Matching Model level pacing function p ic (t). Details of our learning framework are presented in Algorithm 1.
Fast Ranking Model. As described in Eq. (2) and (3), our framework requires a ranking model G(·, ·) that efficiently measures the pairwise relevance of millions of possible context-response combinations. In this work, we construct G(·, ·) as an non-interaction matching model with dualencoder structure such that we can precompute all contexts and responses offline and store them in cache. For any context-response pair (c, r), its pairwise relevance G(c, r) is defined as where E c (c) and E r (r) are the dense context and response representations produced by a context encoder E c (·) and a response encoder E r (·) 2 .
Offline Index. After training the ranking model on the same response selection dataset D using the in-batch negative objective (Karpukhin et al., 2020), we compute the dense representations of all contexts and responses contained in D. Then, as described in Eq. (4), the relevance scores of all possible combinations of the contexts and responses in D can be easily computed through the dot product between their representations. After this step, we can compute the corpus-level and instance-level difficulty of all possible combinations and cache them in memory for a fast access in training.

Related Work
Dialogue Response Selection. Early studies in this area devoted to the response selection for single-turn conversations (Wang et al., 2013;Tan et al., 2016;. Recently, researchers turned to the scenario of multi-turn conversations and many sophisticated neural network architectures have been devised (Wu et al., 2017;Gu et al., 2019;Gu et al., 2020). There is an emerging line of research studying how to improve existing matching models with better learning algorithms.  proposed to adopt a Seq2seq model as weak teacher to guide the training process. Curriculum Learning. Curriculum Learning (Bengio et al., 2009) is reminiscent of the cognitive process of human being. Its core idea is first learning easier concepts and then gradually transitioning to more complex concepts based on some predefined learning schemes. Curriculum learning (CL) has demonstrated its benefits in various machine learning tasks (Spitkovsky et al., 2010;Ilg et al., 2017;Svetlik et al., 2017;Platanios et al., 2019). Recently, Penha and Hauff (2020) employed the idea of CL to tackle the response selection task. However, they only apply curriculum learning for the positive-side response selection, while ignoring the diversity of the negative responses.

Datasets and Evaluation Metrics
We test our approach on three benchmark datasets.
Douban Dataset. This dataset (Wu et al., 2017) consists of multi-turn Chinese conversation data crawled from Douban group 3 . The size of training, validation and test set are 500k, 25k and 1k. In the test set, each dialogue context is paired with 10 candidate responses. Following previous works,  we report the results of Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Precision at Position 1 (P@1). In addition, we also report the results of R 10 @1, R 10 @2, R 10 @5, where R n @k means recall at position k in n candidates.
Ubuntu Dataset. This dataset (Lowe et al., 2015) contains multi-turn dialogues collected from chat logs of the Ubuntu Forum. The training, validation and test size are 500k, 50k and 50k. Each dialogue context is paired with 10 response candidates. Following previous studies, we use R 2 @1, R 10 @1, R 10 @2 and R 10 @5 as evaluation metrics.
E-Commerce Dataset. This dataset (Zhang et al., 2018) consists of Chinese conversations between customers and customer service staff from Taobao 4 . The size of training, validation and test set are 500k, 25k and 1k. In the test set, each dialogue context is paired with 10 candidate responses. R n @k are employed as the evaluation metrics.

Baseline Models
In the experiments, we compare our approach with the following models that can be summarized into three categories.
BERT-based Matching Models. Given the recent advances of pre-trained language models (Devlin et al., 2019), Gu et al. (2020) proposed the SA-BERT model which adapts BERT for the task of response selection and it is the current state-ofthe-art model on the Douban and Ubuntu dataset.

Implementation Details
For all experiments, we set the value of p cc (0) in the corpus-level pacing function p cc (t) as 0.3, meaning that all models start training with the context-response pairs whose corpus-level difficulty is lower than 0.3. For the instance-level pacing function p ic (t), the value of k T is set as 3, meaning that, after IC is completed, the negative responses of each training instance are sampled from the top-10 3 relevant responses. In the experiments, each matching model is trained for 40, 000 steps with a batch size of 128, and we set the T in both p cc (t) and p ic (t) as half of the total training steps, i.e., T = 20, 000. To build the context and   response encoders in the ranking model G(·, ·), we use a 3-layer transformers with a hidden size of 256. We select two representative models (SMN and MSN) along with the state-of-the-art SA-BERT to test the proposed learning framework. To better simulate the true testing environment, the number of negative responses (m in Eq. (1)) is set to be 5.
6 Result and Analysis 6.1 Main Results Table 2 shows the results on Douban, Ubuntu, and E-Commerce datasets, where X+HCL means training the model X with the proposed learning HCL.
We can see that HCL significantly improves the performance of all three matching models in terms of all evaluation metrics, showing the robustness and universality of our approach. We also observe that, by training with HCL, a model (MSN) without using pre-trained language model can even surpass the state-of-the-art model using pre-trained language model (SA-BERT) on Douban dataset. These results suggest that, while the training strategy is under-explored in previous studies, it could be very decisive for building a competent response selection model.

Effect of CC and IC
To reveal the individual effects of CC and IC, we train different models on Douban dataset by remov-ing either CC or IC. The experimental results are shown in Table 3, from which we see that both CC and IC make positive contributions to the overall performance when used alone. Only utilizing IC leads to larger improvements than only using CC. This observation suggests that the ability of identifying the mismatching information is a more important factor for the model to achieve its optimal performance. However, the optimal performance is achieved when CC and IC are combined, indicating that CC and IC are complementary to each other.

Contrast to Existing Learning Strategies
Next, we compare our approach with other learning strategies proposed recently Penha and Hauff, 2020;Lin et al., 2020). We use Semi, CIR, and Gray to denote the approaches in , Penha and Hauff (2020), and Lin et al.
(2020) respectively, where Gray is the current state of the art. We conduct experiments on Douban and Ubuntu datasets and the experimental results of three matching models are listed in Table 4. From the results, we can see that our approach consistently outperforms other learning strategies in all settings. The performance gains of our approach are even more remarkable given its simplicity; it does not require running additional generation models (Lin et al., 2020) or re-scoring negative samples at different epochs .

Further Analysis on HCL
In this part, we study how the key hyper-parameters affect the performance of HCL, including the initial difficulty of CC, p cc (0), and the curriculum length of IC, k T . 5 In addition, we also investigate the effect of different ranking model choices.
Initial Difficulty of CC. We run sensitivity analysis experiments on Douban dataset with the SMN model by tuning p cc (0) in the corpus-level pacing function p cc (t). The results of P@1 and R 10 @2 in terms of p cc (0) and k T are shown in Figure  3(a). We observe that when p cc (0) is small (i.e., p cc (0) ≤ 0.3), the model performances are relatively similar. When p cc (0) approaches to 1.0, the results drop significantly. It concurs with our expectation that, in CC, the model should start learning with training context-response pairs of lower difficulty. Once p cc (0) becomes 1.0, the CC is disabled, resulting the lowest model performances.
Curriculum Length of IC. Similair to p cc (0), we also run sensitivity analysis experiments by tuning k T in the instance-level pacing function p ic (t) and Figure 3(b) shows the results. We observe that  a too small or too large K T results in performance degradation. When k T is too small, after IC is completed, the negative examples are only sampled from a very small subset of the training data that consists of responses with high relevance. In this case, the sampled responses might be false negatives that should be deemed as positive cases. Thus, learning to treat those responses as true negatives could harm the model performance. On the other hand, as k T increases, the effect of IC becomes less obvious. When k T = log 500k 10 (|D|= 500k), IC is completely disabled, leading to the further decrease of model performances.
Ranking Model Architecture. Lastly, we examine the effect of the choice of the ranking model architecture. We build two ranking model variants by replacing the Transformers module E c (·) and E r (·) in Eq. (4) with other modules. For the first case, we use 3-layer BiLSTM with a hidden size of 256. For the second one, we use BERT-base (Devlin et al., 2019) model. Then, we train the matching models using the proposed HCL but with different ranking models as the scoring basis.
The results on Douban dataset are shown in Table 5. We first compare the performance of different ranking models by directly using them to select the best response. The results are shown in the "Ranking Model" row of Table 5. Among all three variants, BERT performs the best but it is still less accurate than these sophisticated matching models. Second, we study the effect of different ranking models on the matching model performance. We see that, for different matching models, Transformers and BERT perform comparably but the results from BiLSTM are much worse. This further leads to a conclusion that, while the choice of ranking model does have impact on the overall results, the improvement of the ranking model does not necessarily lead to the improvement of matching models once the ranking model achieves certain accuracy.

Conclusion
In this work, we propose a novel hierarchical curriculum learning framework for training response selection models for multi-turn conversations. During training, the proposed framework simultaneously employs corpus-level and instance-level curricula to dynamically select suitable training data based on the state of the learning process. Extensive experiments and analysis on two benchmark datasets show that our approach can significantly improve the performance of various strong matching models on all evaluation metrics.