Do It Once: An Embarrassingly Simple Joint Matching Approach to Response Selection

Existing matching models for response selection adopt the independent matching ( IM ) approach. To complete a prediction, they have to perform N independent matches, where N is the number of response options. In this paper, we explore a joint matching ( JM ) approach which performs matching only once regardless of the number of options. The JM approach does not change the structure of matching component but only modiﬁes its input and output format. It also enables a cheap but effective data augmentation method. Extensive experiments on the MuTual dataset demonstrate that, even with the simplest formulation, JM outperforms IM approach by a large margin and reduces training time by over half.


Introduction
The availability of large-scale datasets has driven the development of neural dialogue systems. One important task in dialogue systems is response selection, which plays an essential role in retrievalbased chatbots (Ji et al., 2014). It aims to select the best-matched response from a set of response options for a dialogue. As shown in Figure 1, given a dialogue context and four response options, we need to choose the only logically correct one.
Previous work in response selection follows an independent matching (IM) approach and computes a matching score for each of the N response options independently. Various matching models following this approach have been proposed (Zhou et al., 2016;Wu et al., 2017;Zhou et al., 2018;Chaudhuri et al., 2018;Tao et al., 2019;Yuan et al., 2019). Despite its success in pre-BERT era, we argue that the IM approach does not make full use of the ability of pretrained encoders (such as BERT and RoBERTa) to encode multiple sentences, hence may hinder both efficiency and effectiveness. Specifically, to Options: ✘A: Sorry. I won't smoke in the hospital again. ✓B: OK. I won't smoke. Could you please give me a menu? ✘C: Could you please tell the customer over there not to smoke? We can't stand the smell. ✘D: Sorry. I will smoke when I get off the bus. complete a prediction, the IM approach has to perform N independent matches, which means N gradient computations (where N is the number of response options). Besides, the dialogue context is repeatedly encoded N times, which further contributes to the inefficiency. The other drawback is that options in these models are independent and agnostic of each other. In reality, humans often compare all the options and utilize their correlations to make a comprehensive decision.
In this paper, we describe a joint matching (JM) approach for this task. For any matching model, we do not change its inner structure but only modify its input and output format. Specifically, we first add a special token at the start of each option, and then concatenate all options into a single sequence. The option sequence is then matched as a whole with the dialogue context. Finally, we extract vectors corresponding with the special token to calculate matching scores. Note that JM can complete a prediction with a single match, which means it only requires one gradient computation and context encoding. Besides, thanks to the self-attention mechanism (Vaswani et al., 2017) of BERT-based matching models, options can now directly attend to each other, rather than being agnostic.
Another advantage of JM approach is that it nat-  urally enables a simple yet effective data augmentation method. The basic idea is that since options are sequentially concatenated in JM, new training instances can be easily created by changing the permutation order of options. Therefore, a dialogue with M response options can create at most M ! (factorial M ) times as many training instances.
We conduct experiments on the MuTual dataset (Cui et al., 2020), a publicly available English dataset for multi-turn dialogue response selection. Results show that JM advances IM on three matching models and can significantly reduce training time. Besides, the permutation-based data augmentation method gives further improvement.

Model
The overview of IM and JM is shown in Figure  2. We describe the details in the following subsections 1 .

Background
, and a set of N response options {O j } N j=1 , the goal of response selection is to select the logically correct optionÔ.
Previous work (Cui et al., 2020) shows that pretrained matching models define the state-of-the-art 1 The code is at https://github.com/gitzlh/JM-Matching on this task. Similar to using BERT for sentencepair classification (Devlin et al., 2019), they first concatenate the context (sentence A) and a candidate response (sentence B) as BERT input (i.e., "[CLS] Excuse me ... [SEP] Sorry ... [SEP]"). On the top of BERT, a fully-connected layer is used for transforming the [CLS] token representation to the matching score. In order to compete a prediction, M independent matchings have to be made, where M is the number of options.

Joint Matching
Instead of conducting N times of independent matches, we make the first step outside the IM framework and explore a joint matching approach for this task. We first adds a special token [OP] j at the start of the j th option. It is a token used to aggregate the matching information between the context and the j th option into a single vector. We then concatenate all the options into a single sequence. Formally, For the dialogue context, we concatenate all the utterances into a single sequence. Formally, The two sequences are then separated with a [SEP] token and fed into our pretrained encoder. Formally, For any BERT-based matching model, suppose the output embeddings of the model are H t ∈ R |X|×d . To perform scoring, we first extract outputs corresponding to [OP] j and represent them as h [OP ] j . The only new parameters learned are a score vector W ∈ R d . The probability of option j being the answer is computed as a dot product between h [OP ] j and W followed by a softmax over all of the options. Formally, The training objective is the log-likelihood of the correct answer 2 .
In this way, the JM approach only needs to match the context sequence and the option sequence once. Compared with the IM approaches, JM is computational efficient in two ways: 1) It encodes the dialogue context only once, instead of M times. However, this benefit is partially offset by the fact that the complexity of transformer grow quadratically with the length of input 2) More importantly, IM approaches need to compute gradients M times for each training step. Besides, in each self-attention layer of the BERT-based matching model, options can directly attend to and interact with each other. This process mimics how humans solve multi-choice questions, that we often compare all the options before making the decision.

Permutation-Based Data Augmentation
Another advantage of JM is that it naturally enables a permutation-based data augmentation (PBDA) method, which can generate high-quality labeled data to improve response selection.
Specifically, since the input of our model is organized as Equation 3, we can create new training instances by simply changing the concatenation order of the options. For example, from we create a new training example (see Figure 2). Correspondingly, the ground-truth label of the training instance may be changed. In this way, a single dialogue can create at most M ! times training instances.

Dataset
We evaluate our model on the Mutual dataset (Cui et al., 2020), a human-labeled, open-domain and reasoning-based dataset for multi-turn response selection. Compared with previous datasets (Lowe et al., 2015;Zhang et al., 2018;Welleck et al., 2019), MuTual is more challenging since it requires some reasoning ability. Models that achieve closeto-human performance on previous datasets, still perform far behind human performance on MuTual. The statistics of MuTual are shown in Table 1. Note that since Mutual has 4 options for each dialogue, PBDA can thus creates at most 24 (4!) times as many training instances.

Settings
We use PyTorch to implement JM on three matching models. We adopt AdamW (Loshchilov and  Hutter, 2018) as our optimizer, and the peak learning rate and warmup proportion are set to 1e-5 and 0.06, respectively. We use the largest batch size that fits in the memory of our GPU and use gradient accumulation for an effective batch size of 32. Dropout (Srivastava et al., 2014) is employed before the score layer with a rate of 0.1. We train our model for 15 epochs and choose the model that reports the highest R@1 on the validation set. Following previous work (Cui et al., 2020), we evaluate our model with recall at position 1 in 4 candidates (R@1), recall at position 2 in 4 candidates (R@2) and Mean Reciprocal Rank (MRR). Table 2 gives the comparison of IM and JM on three matching models on the MuTual dataset. Note that we only experiment with pretrained matching models given that they have an overwhelming advantage over non-pretrained models. For a fair comparison, we report baseline results both from the official reports of MuTual (Cui et al., 2020) and our own implementation.

Main Performance
Our first observation is that RoBERTa-based models significantly outperform BERT-based models, suggesting that RoBERTa is a more powerful feature extractor. More importantly, we note that our JM approach outperforms IM approach on all three matching models. For example, BERT-JM improves over BERT-IM by 6% (absolute) R@1. We suppose that this is because the JM approach concatenates all the options as the model input, and, thanks to the self-attention mechanism, each option can directly attend to each other. In this way, JM can make a more comprehensive decision and boost the performance especially on challenging datasets like MuTual.
The conclusion holds true for larger pretrained matching models such as RoBERTa-large. As   shown in Table 2, RoBERTa-large-JM brings about 3% (absolute) improvement over RoBERTa-large-IM in terms of R@1 and even surpasses human performance in terms of R@2.

Training Time
In the task of response selection, the scalability of the model becomes an issue when the number of options increases. In this subsection, We compare JM and IM with respect to training time 3 . As shown in Table 3, RoBERTa-JM reduces the training time by 55% compared with RoBERTa-IM. More detailed analysis shows that the reduction is mostly contributed to the backward propagation process. This is because to complete a prediction, RoBERTa-IM performs M independent matches and thus requires M gradient computations, a costly process. It also needs to encode the dialogue context M times, leading to computational inefficiency especially in multi-turn settings. By contrast, RoBERTa-JM requires only a single match 4 .
3 Both models are trained on a single NVIDIA TITAN Xp GPU. 4 We note that this benefit is partially offset by the transformer's quadratic complexity with regard to the length of input. Suppose that the average length of the dialogue utterance and response option is L, then the time complexity

Data Augmentation
In this subsection, we conduct experiments to verify the effectiveness of PBDA.
As shown in Table 4, PBDA 4x brings a 1% (absolute) improvement in terms of both R@1 and R@2, showing that by simply permuting the options, we can create high-quality training instances. Besides, when increasing the data size by 8 times, we observe another 2% (absolute) improvement in terms of R@1. An interesting observation is that PBDA 24x does not further improve model performance, showing that there is a limit to the improvement brought by data augmentation.

Context Length
Following previous work (Wu et al., 2017), we further investigate how JM performs cross the length of context. As demonstrated in Figure 3, the performance of BERT-JM and RoBERTa-JM is generally satisfactory, except for the slight deterioration when the context has more than seven utterances. It can also deal with a short context that only has two utterances.
On the other hand, RoBERTa-large-JM consistently performs better than RoBERTa-JM, and when the context becomes longer, the gap becomes larger. It also gives more stable performance across different context lengths, further showing the strong representation ability of RoBERTa-large.

Conclusions
In this paper, we make the first step outside the independent matching framework and explore a joint matching approach for response selection. We also present an effective permutation-based data augmentation method. We conduct experiments on the MuTual dataset and demonstrate the effectiveness and efficiency of our approach. Besides, the proposed data augmentation further improves model performance.