Attend, Select and Eliminate: Accelerating Multi-turn Response Selection with Dual-attention-based Content Elimination

,


Introduction
Constructing intelligent dialogue systems has attracted wide attention in the field of natural language processing (NLP) in recent years.There are two approaches widely used for the dialogue Table 1: A dialogue example from Ubuntu Corpus.The light gray words are eliminated in shadow layers, the light red words are eliminated in mediate layers, and the black words are retained all the time and sent to the deeper layer for the context and response matching.system, generation-based and retrieval-based methods.The former views conversation as a generation problem (Vinyals and Le, 2015;Serban et al., 2016;Zhang et al., 2020b), while the latter aims to select the optimal response from candidates given a dialog context (Wu et al., 2017;Tao et al., 2019b;Xu et al., 2021;Han et al., 2021;Feng et al., 2022).Since retrieval-based methods can usually provide fluent and informative responses, they are widely adopted in a variety of industrial applications such as XiaoIce (Shum et al., 2018) from Microsoft and AliMe Assist (Li et al., 2017) from Alibaba.
We focus on multi-turn response selection in retrieval-based dialogue systems in this paper.Recently advances of pre-trained language models (Devlin et al., 2019) further push the research frontier of this field by providing a much powerful backbone for representation learning (Whang et al., 2020;Gu et al., 2020) and dialogue-oriented selfsupervised learning (Xu et al., 2021;Zhang and Zhao, 2021;Han et al., 2021).Although significant performance improvement has been made by these PLM-based response selection models, they usually suffer from substantial computational cost and high inference latency due to the growing model size, presenting challenges for their development in resource-limited real-world applications.Therefore, there is an urgent need to accelerate PLMbased response selection models while maintaining their satisfactory performance.
To accelerate PLM-based multi-turn response selection, one direct idea is to avoid unnecessary calculation when joint modeling dialogue context and response.Through empirical observation, we find that there are many unimportant contents that are either redundant (i.e., repeated by many context turns) or less relevant to the topic, especially in the lengthy dialogue context (Zhang et al., 2018).If accurately identified and appropriately eliminated, the removal of the unnecessary calculation on them can bring minimum performance degradation.Drawing inspiration from Goyal et al. (2020), we propose an inference framework together with a post-training strategy customized for PLM-based multi-turn response selection, where unimportant contents are progressively identified and dropped as the calculation goes from shallow layers to deep.In our framework, we seek to answer three research questions (RQs): (1) how to accurately identify these unimportant contents, (2) how to properly decide the intensity of elimination for these unimportant contents under various computation demands, and (3) how to eliminate unnecessary calculations on those contents at the minimum cost of performance degradation.As the answer to the above questions, we propose an inference framework together with a post-training strategy customized for PLM-based multi-turn response selection as illustrated in Table 1.For RQ1, we propose a dualattention-based method to measure the relative importance of tokens in context and response as we find this method is in accordance with our empirical observation.For RQ2, we adopt evolutionary search (Cai et al., 2019) to build the Pareto Frontier of performance-efficiency map and choose proper retention configurations (i.e., which defines how many tokens are passed to the next layer for each layer) from the frontier.For RQ3, we notice the gap between the proposed efficient inference framework and training and employ knowledge distillation (Hinton et al., 2015) to mitigate this gap by forcing the model with progressively eliminated contents to mimic the predictions of the original model with no content elimination.
We evaluate our proposed method on three benchmarks for multi-turn response selection: Ubuntu (Lowe et al., 2015), Douban (Wu et al., 2017) and E-commerce (Zhang et al., 2018).Experimental results show that our proposed method can accelerate the inference of PLM-based multiresponse selection models with acceptable performance degradation under various computation constraints, while significantly outperforming previous acceleration methods.We also conduct comprehensive analyses to thoroughly investigate the effectiveness of proposed components.
We summarize the contributions of this paper as follows: (1) We propose Attend, Select and Eliminate (ASE), an efficient inference framework customized for PLM-based multi-turn response selection models that identify and progressively eliminate unimportant contents.(2) We propose a knowledge-distillation-based post-training strategy to mitigate the training-inference gap and decrease the performance degradation caused by content elimination.(3) We conduct comprehensive experiments on three benchmarks to verify the effectiveness of our proposed method and prove its superiority over other acceleration methods.

Related Work
Recently, methods based on pre-trained models are relatively popular, Whang et al. (2020) introduced the next sentence prediction and mask language model tasks in the PLMs into the conversation corpus, conducted post-domain training, and finally treated the context as a long sequence, and adjusted the model directly by fine-tuning the model.Compute context-response match scores.Xu et al. (2021) tries to introduce self-supervised learning tasks to increase the difficulty of model training, and the results show the effectiveness of these works.From the perspective of data augmentation, BERT-FP (Han et al., 2021) splits the context into multiple sets of short context-response pairs and introduces a conversational relevance task, which achieves state-of-the-art performance.
Although the performance of the pre-training model is powerful, it also brings some problems.
The expensive computational cost and high inference latency hinder the further implementation of the PLMs to a certain extent.Some works try to alleviate this problem, one of the branches is to reduce the model size, such as distillation (Jiao et al., 2020;Wang et al., 2021;Liu et al., 2022a,b), structural pruning (Michel et al., 2019;Fan et al., 2019;Gordon et al., 2020;Hou et al., 2020) and quantization (Zafrir et al., 2019;Shen et al., 2020;Zhang et al., 2020a;Bai et al., 2021), etc. Goyal et al. (2020) adopts the Attention Strategy to select the important tokens with a fixed length configuration, but its speed ratio cannot be selected as needed and once full training can only get a model with a fixed speedup.
Since existing method Goyal et al. ( 2020) is mainly evaluated on single-sentence or sentencepair tasks, it not fully suitable for response selection where the model needs to understand the relationship between all the utterances in a dialogue session and learn the interaction of the utterances closely related to the response.Therefore, we propose to select and eliminate the token representation based on context-to-response and responseto-context attention (i.e., dual-attention, DualA), which make good use of the relationship between context-response.

Task Formulation
Considering a dialogue system given a dialogue dataset D = {(c i , r i , y i )} n i=1 .Each sample in the dataset is a triple that consists of context c i , response r i , and ground truth label y i .c i = {u 1 , u 2 , ..., u l } is dialogue context with l utterances and {u j } l j=1 are arranged in a temporal order.r i is a response candidate and y i = 1 represents r i is a proper response for the context c i , otherwise y i = 0.The core problem of this research is to learn a matching model M (•, •) which can measure the matching degree between context and response.

Methodology
We aim to accelerate the inference of PLM-based multi-turn response selection models by proposing Attend, Select and Eliminate (ASE) that progressively identifies and eliminates unimportant contents to avoid unnecessary calculations.The overall framework is illustrated in Figure 1.There are three crucial questions that need to be answered: (1) how to accurately identify the unimportant contents, (2) how to properly decide the intensity of content elimination, and (3) how to effectively mitigate the training-inference gap in our framework and decrease the performance degradation.In the following part of this section, we elaborate on our method by answering the above three research questions.

Content Selection
In the specific scenario of multi-turn dialogue, there is a lengthy context with multiple turns and a single sentence of candidate response and the model aims to measure their semantic similarity.
To achieve this goal, existing PLM-based methods calculate the interaction of all contents without distinction, regardless of the various importance of contents where many of them are redundant or topic-irrelevant.In order to eliminate them for inference acceleration, we need to accurately identify them first during encoder flow as in Figure 2

Empirical Methods
The multi-turn context accounts for a large proportion of the input pair (c i , r i ), making it a good choice to start our content selection.For multiturn context, the easiest way is to conduct content selection in sentence-level.Empirically, the last few utterances in the dialogue context are more close to the response in the dialogue flow, so they might be more important than the utterances in the beginning.Hereby, we can also simply select the last k utterances in the original context as the new context (i.e., c i = {u j } n j=n+1−k ) and concatenate them with the candidate response, resulting in the setting that we denote as Last k .Similarly, we can select other context utterances, such as the first k utterances and randomly selected k utterances which are denoted as First k and Rand k , respectively.

Dual-attention-based Content Selection
Although simply adopting empirical methods (i.e.,Last k ) yields plausible results as will be shown in our experiments later, this approach takes all the last k utterances without distinction, regardless of the various importance of utterances and tokens.A reasonable way is to conduct content selection in a more fine-grained manner (i.e., token-level).Recent works have shown that the importance of a token can be measured by the total attention weights it receives from other tokens (Goyal et al., 2020;Kim and Cho, 2021), denoted as AM.However, AM treats all tokens in the input sequence equally without distinction, neglecting the imbalanced relationships between tokens in context and response.Intuitively, for a token in the context, the attention it receives from other context tokens reflects its importance in the context, which we call self-importance, and the attention obtained from response tokens reflects its importance for semantic matching, which we call mutual-importance.Therefore, we propose to disentangle the attention received by a token into two parts: (1) the selfattention within a context or response and (2) the mutual-attention between a context and a response, and jointly consider them when measuring the importance of a token, and we call it DualA.Specifically, take a token w in the context for example in Figure 2(a), we use the averaged attention weights posed by the response tokens on it as its mutualimportance score, formulated as: (1) where T res means the set of tokens belonging to the response, A h represents the attention received by token w from w ′ on head h, and H denotes the number of attention heads.While for the selfimportance of w, we adopt the averaged attention weights posed by other context tokens on it: (2) where T res means the set of context tokens.We then jointly consider the self-importance and the mutual-importance of w by a weighted sum of g c,self (w) and g c,mutual (w): where α c , β c that satisfy 0 ≤ α c , β c ≤ 1 and α c + β c = 1 are weights for calculating the overall importance score for context tokens.Similarly, we can calculate the overall importance score for the tokens in the response with the only difference lying in the weights for response tokens α r , β r : It should be noted that our method can be viewed as a generalization of typical attention-based importance measurement (Goyal et al., 2020), and can flexibly balance the influence of self-attention and dual-attention parts.

Retention Configuration Search
After having the basis for evaluating the importance of the token, the model needs to determine retention configuration, i.e., how to properly decide the intensity of content elimination and how many tokens to keep and pass to deeper encoder layers.
Given a PLM-based model M (θ) with m encoder layers, and θ is the parameter of model M .
is a monotonically non-increasing sequence and l j indicates that l j tokens are kept from the output of the l j−1 -th encoder layer and passed to the l j -th encoder layer.According to s, the model M (θ) keeps and eliminates the corresponding number of tokens in each encoder, M (θ) can get faster inference, but the performance may degrade.
In theory, there can be possible combinations for each s.By using evolutionary algorithms (Cai et al., 2019), we search for the Pareto Frontier to make the optimal tradeoffs between performance and efficiency which can satisfy various given computation constraints.

Training Framework
In the aforementioned sections, we have introduced our accelerated inference framework for PLMbased multi-turn response selection models.Here, we present our training framework.Given a pre-trained language model such as BERT (Devlin et al., 2019), we first adapt it to the task of multi-turn response selection by using the SOTA method (i.e., BERT-FP (Han et al., 2021)) on some multi-turn response selection dataset, obtaining the model M (θ).Then we conduct retention configuration search (described in Sec.4.2) based on our proposed method DualA to obtain a set of optimal retention configurations S * .Now with the trained model M (θ) and S * with n retention configurations, we can get n acceleration settings for model inference with various speedup ratio, denoted as Although one can directly utilize M (θ, s j ) for faster inference, we argue that there is a gap between the training and our proposed accelerated inference framework.The previously trained model M (θ) didn't encounter the situation where the input sequence of tokens is progressively eliminated from shallow layers to deep layers.Therefore, we propose to mitigate this training-inference gap with once-for-all self-distillation.Specifically, we fix M (θ) as the teacher and make a copy of it as the student.During self-distillation, the teacher receives the complete inputs without content elim- Randomly sample a configuration s j from S * ; ination and produces a probability distribution p M (θ) (c i , r i ) of whether the response is appropriate to the context or not.While for the student, in order to ensure it can be customized to all retention configurations S * simultaneously with the same parameters θ * , we randomly sample the configuration s j and compute its output distribution under content elimination setting as p M (θ ′ ,s j ) (c i , r i ), which is used to compute the KL-divergence with the teacher's outputs following Hinton et al. (2015): After self-distillation, we obtain the adapted model M (θ * ) customized for all the searched optimal retention configurations S * , making our final inference acceleration settings G * = {M (θ * , s 1 ), • • • , M (θ * , s n )} efficient at the minimum cost of performance degradation.

Dataset
We evaluate our framework on three widely used multi-turn response selection benchmarks: the Ubuntu Corpus (Lowe et al., 2015), the Douban Corpus (Wu et al., 2017)and the E-commerce Corpus (Zhang et al., 2018).

Experimental Settings
We use BERT-FP's trained model to search on the validation set and get k (k<20) different length configurations.We adopt the weighted sum of the distillation loss and the cross-entropy loss, as the training objective function running 5 to 8 epochs.We employ recall rate R n @k as the evaluation metric.Especially for some samples in the Douban corpus having more than one true candidate response, we use MAP, MRR, and P@1 same as Tao et al. (2019b) and Yuan et al. (2019).For inference efficiency, we employ FLOPs (floating-point operations) speedup ratio compared to the BERT model as the measure, as it is agnostic to the choice of the underlying hardware.To avoid the pseudo improvement by pruning padding, we evaluate all models with input sequences without padding to the maximum length such as to pad length to 256.

Overall Performance
Table 2 and Figure 3 shows the overall comparison results with baselines.We can see that with ASE, the performance and efficiency of the BERT and BERT-FP are greatly improved.Specifically, BERT-FP+ASE † performs slightly better than the model BERT-FP on Ubuntu and E-commerce and achieves a significant improvement by 2.0% in P @1 and by 1.9% in R 10 @1 on Douban.BERT-FP+ASE * achieves comparable performance with a double speed on Douban.The ASE also gives the vanilla BERT significant performance improvement: 9.0% in R 10 @1 at 1.4x speed, 5.4% in R 10 @1 at 2.3x on E-commerce, and slightly better performance with a double speed on Ubuntu and Douban.The detail of the BERT with ASE is shown in Appendix.Figure 3 compares the effect of combining BERT-FP with three different accelerating methods: ASE, PoWER-BERT, and L-adaptive.It can be seen that with ASE, BERT-FP achieves better results than with other method by a large margin, which demonstrates that extracting important tokens based on dual attention is feasible for accelerating the inference of multi-turn response selection.In contrast, both baselines have shown a large decline due to the incomplete adaptation of the task.

Discussions
Comparison between different content selection strategies.Intuitively, the latter utterances may be helpful for the multi-turn response selection.We compare several different strategies, including empirical methods (i.e.,Last k , First k , and Rand k ), the attention-based method AM and dualattention-based method DualA.
Figure 4 shows the results of these strategies with k=3, 4, and 5 on Ubuntu.It can be seen that based on the three simple empirical strategies, Last k , First k , and Rand k , the model can also achieve good performance with a certain inference speed.Strategy Last k performs much better than strategy First k and Rand k , which validates our hypothesis that latter utterances in context may be more helpful and more important for selecting appropriate responses.Most importantly, the performance-efficiency tradeoffs of our proposed strategy based on dual attention are completely better than the other strategies.This result shows that to achieve the effect of faster inference, DualA, a fine-grained strategy of selecting token, is more effective than the utterance-level selection method for the response selection.
The effects of using only the k-th utterance from last as the context.To understand the effect of utterances in different positions on the task of response selection, we test the performance using only the k-th from last utterance as context.From the validation set, we first filter out examples where the context is too short and keep the examples where the context consists of more than 6, 8, 10, and 12 utterances on Ubuntu.Then, the k-th utterance from last of the context and the candidate response are concatenated, being fed to a trained model for classification.As experimental results in Figure 5(a) show, the overall performance of the model is relatively low.Even for the last utterance of the context, also the previous turn of the response, the performance is still not high.However, model performance increases rapidly as the utterance position moves forward under these four settings, which means that the closer the utterance to the candidate response, the better the performance for the response selection.This is also in line with the actual chat scene of human beings, where both parties usually respond to each other's current utterance.
The distribution of the selected token representations.Under the same retention configuration, the token selected by different strategies will be different.To better observe which tokens are selected by strategies, we divide the dialogue context into three parts, the first third, middle third, and last third of the context.On the Ubuntu IRC V1 corpus, we set the same retention configuration for both strategies, then as the encoder layer deepens, we count the distribution of token in the context part that is selected using AM and DualA.
In Figure 5(b), under the same retention configuration, it can be seen that under the method AM which uses the total attention weights it receives from other tokens to evaluate the token's importance, as the encoder layer deepens, the proportion of token selected in the last third part is slightly higher, while the first third and the middle third are basically the same.However, there is almost no difference in the distribution of the three parts.While in Figure 5(c), under the method DualA based on the dual-attention of the context and response, it can be seen that as the encoder layer deepens, the percentage of token selected in the first third of the context drops sharply.The middle and last third parts still retain a large part.Until after the ninth encoder layer, the middle and last parts begin to decrease drastically but are still more than the first third part of the context.This is consistent with the results in Figure 5(a).To a certain extent, this result shows that when the attention of response-tocontext is used as the query, the response prefers to focus on the middle and last parts of the context, that is, the tokens that are closer to the response will provide more help in response selection, but are never the same.
Hyper-parameter tuning.According to Equation 4, the self-importance g r,self and the mutualimportance g r,mutual have different contributions to selecting tokens.We experiment with the effects on the performance with different g r,self and g r,mutual weights.As shown in Figure 6, the horizontal axis is α/β, which represents the weight coefficient of the g r,self to g r,mutual during the model selecting tokens belonging to the context.It can be seen that as the α/β increases, the tokens selected in the context change, and the performance also gradually improves, reaching the maximum at α/β = 0.25.Consistent with our finds in Figure 4, method Du-alA is consistently performant than AM by a large margin.These results under different speedup ratios show consistent trends, i.e., the method of selecting tokens based on dual-attention is more effective for the response selection task.
The effects of the once-for-all self-distillation.After token selection, we compare model performance on Ubuntu with or without self-distillation.Different from the traditional distillation method, we adopt the once-for-all self-distillation method to distill the teacher's knowledge to the student by sampling different retention configurations during the training.Figure 7 is a comparison of the performance with and without self-distillation.It can be seen that with self-distillation, the performance is significantly improved for the model under all retention configurations, especially at large speedup ratio.As the speedup ratio of the model increases, that is, more tokens are eliminated during    inference, and the performance of the model starts to degrade, but the performance improvement of self-distillation is also enhanced.This way of optimizing all the retention in the training once avoids the problem of re-distilling if configuration various during the actual deployment process.
The flexibility of ASE.We demonstrate the flexibility of ASE by applying it on top of vanilla BERT.ASE can be easily integrated with any BERT-like model.We use the bert-base model from Huggingface1 and finetune it on three benchmarks: Ubuntu, Douban and E-commerce.Then we apply the Dualattention-based Content Selection method in Section 4.1.2to search for the optimal retention and perform self-distillation.Figure 8 shows that ASE can boost BERT performance by 2.0% at 1.1x on Ubuntu and 9.0% at 1.4x on E-commerce.

Conclusion
In this paper, we propose a new framework of progressively extracting important tokens and eliminating redundant tokens to accelerate inference for multi-turn response selection, which identifies important tokens based on dual-attention of the context and response.The experimental results empirically verify the effectiveness of this method.In the future, we plan to accelerate inference further by combining it with the layer-wise reduction.

Limitations
During the configuration search stage, because this is a multi-objective optimization problem involving performance and efficiency, we use the evolutionary algorithm to search here.Designing a robust and efficient optimization objective is not simple and it will affect the convergence of search results.
Figure 2: (a) The averaged attention weights of post ed by the blue response part as the token w's mutualimportance.(b) between the encoders, tokens are eliminated and selected to be sent to the next layer.

Figure 3 :
Figure 3: Model performance-efficiency comparison of BERT-FP equipped with different accelerating methods.

Figure 4 :
Figure 4: Comparison between different content selection strategies without self-distillation on Ubuntu.

Figure 6 :
Figure 5: (a) Effect of using only single utterance for response selection.The distribution of selected tokens as the encoder layer deepens based on (b) AM and (c) DualA.Selection strategies are at the same configuration on Ubuntu.

Figure 7 :
Figure 7: The effect of once-for-all self-distillation.SD and w/o.SD mean with and without self-distillation, respectively.
Algorithm 1: Model Training Steps Input: PLM (i.e.,BERT base ) ; Datasets D train and D dev ; Training BERT base on D train to get M (θ) using BERT-FP (Han et al., 2021); 1 Initialize retention set S; 2

Table 2 :
Model comparison on three benchmarks.BERT-FP is the previous SOTA model.ASE * and ASE † are two representative points of the models with a different speedup ratio.