How to Represent Context Better? An Empirical Study on Context Modeling for Multi-turn Response Selection

,


Introduction
Recently, building a dialogue system for open domain human-machine conversation is attracting more and more attention due to both availability of large-scale human conversation data and powerful models learned with neural networks.Existing work on building a conversational system includes generation-based methods and retrievalbased methods.A generation-based model directly synthesizes a response with a natural language generation method (Shang et al., 2015;Serban et al., 2016), while a retrieval-based model replies to a human input by selecting a proper response from a pre-built index (Lowe et al., 2015;Humeau et al., 2019).In this work, we study the problem of multi-turn response selection for retrieval-based dialogues, since retrieval-based systems are superior in terms of response fluency and informativeness, and play an important role in industrial products.
Real-world dialogues usually comprise multiple turns, where a retrieval model should select the most proper response by measuring the matching degree between multi-turn dialogue context and a number of response candidates.The key problem is how to make better use of multi-turn context information.Currently, there emerge two lines of research to represent the multi-turn dialogue context.One is to model each turn of utterance individually first and then aggregate a sequence of utteranceresponse matching features to get a final score (Wu et al., 2017;Zhou et al., 2018;Gu et al., 2019;Yang et al., 2018Yang et al., , 2020;;Tao et al., 2019b), which are known as the representation-matching-aggregation paradigm.The other line is to concatenate all turns of utterances into a long sequence first and make them fully interact with each other by RNNs (Lowe et al., 2015;Zhou et al., 2016;Chen and Wang, 2019) or transformer layers (Humeau et al., 2019;Whang et al., 2020;Gu et al., 2020).In particular, recently models based on pre-trained language models (PLMs) such as BERT or SA-BERT (Gu et al., 2020) conduct multi-turn context modeling and response matching in a unified process.
These mainstream methods, including fully concatenating all utterances or independently encoding each dialogue turn, equally represent the information of each dialogue element and ignore the characteristics of multi-turn dialogue context, which may lead to sub-optimal context representations and response matching features.Recently, researchers have begun to notice the importance of explicitly modeling the multi-turn dialogue context based on the characteristics of multi-turn dialogue context, including using of natural sequential relationship between dialogue turns (Zhou et al., 2016) or using the last turn of dialogue context to guide the modeling process of previous turns (Zhang et al., 2018;Yuan et al., 2019).However, there is no systematic comparison to analyze how to effectively model the multi-turn dialogue context considering characteristics of dialogue model and no framework to unify those methods for retrieval-based dialogues.
In this paper, instead of configuring new architectures, we investigate how to improve the performance of existing matching models with better context modeling methods.Following this idea, we heuristically summarize three categories of turn-aware context modeling strategies which model context messages from the perspective of sequential relationship, local relationship, and queryaware manner respectively.To compare those methods, we apply them on several representative response selection models through a Turn-Aware Context Modeling (TACM) layer, which allows different context modeling strategies to be flexibly applied to dialogue models.
To verify the effectiveness of the framework, we choose three representative multi-turn response selection models as our matching models, and conduct experiments on three public data sets including Ubuntu Dialogue Corpus (Lowe et al., 2015), Douban Conversation Corpus (Wu et al., 2017), and E-Commerce Dialogue Corpus (Zhang et al., 2018).Based on a series of experiments, we find query-aware context modeling is the best strategy and employing multiple context modeling strategies can consistently improve the performance of response selection.Besides, we also observe that our TACM layer can improve the capability of modeling long context.We hope our empirical comparison can shed light on future research on this line of work.Our contributions in this paper are four-fold: • Three categories of turn-aware context modeling strategies inspired by inherent characteristics of multi-turn dialogues are summarized; • A TACM layer is explored to flexibly adapt and unify these context modeling strategies to the advanced response selection models; • A systematic comparison of different context modeling strategies and their combinations with representative response selection models on three benchmarks; • Consistent improvements are brought to various response matching models without involving heavy-machinery, and are easy to generalize to downstream dialogue applications.

Related Works
Retrieval-based models design a discriminative model to measure the matching degree between a human input and a response candidate for response selection.Early studies mainly focus on single-turn context-response matching (Wang et al., 2013;Hu et al., 2014;Wang et al., 2015).Recently, researcher have devoted themselves to the multi-turn scenario.Several methods concatenate all turns of utterances into a long sequence first and then make them fully interact with each other by RNNs (Lowe et al., 2015;Zhou et al., 2016;Chen and Wang, 2019) or transformer layers (Humeau et al., 2019;Gu et al., 2020).In addition to these methods, some researchers construct dialogue models with a representation-matchingaggregation paradigm.Such approaches encode each turn of utterance individually first and then aggregate a sequence of utterance-response matching features to get a final score.Representative methods including sequential matching network (SMN) (Wu et al., 2017), deep attention matching network (DAM) (Zhou et al., 2018) and multi-hop selector network (MSN) (Yuan et al., 2019).
As an important problem in dialogue systems, multi-turn context modeling has raised great interests in recent years.Especially for generationbased methods, various models adopt hierarchical encoder-decoder framework to model all context sentences (Serban et al., 2016(Serban et al., , 2017;;Xing et al., 2018;Chen et al., 2018).Tian et al. (2017) compare various methods to get a global representation for the context.Zhang et al. (2019) propose ReCoSa where attention weights between each context and response representations are computed and used in further decoding process.The problem is less explored in existing retrieval-based methods.Zhang et al. (2018) concatenate the last utterance to other turns, and then use Gated Self Attention to obtain utterance representation.Yuan et al. (2019) use multi-hop selectors to select useful information in dialogue history.These methods only consider modeling one type of characteristic of multi-turn dialogue context.Besides, there is no systematic comparison to analyze how to model context effectively and no framework to unify those methods.Therefore, we consider exploring how to improve the existing models with a better context modeling method in this paper.Specifically, we summarize three categories of turn-aware context modeling strategies and conduct an empirical study on context modeling for multi-turn response selection.

Problem Formalization
Given a data set D = {(y, c, r) z } N z=1 where c = {u 1 , ..., u nc } represents a n c turns of conversation context with u i the i-th turn, r is a response candidate, and y ∈ {0, 1} denotes a label with y = 1 indicating r a proper response for c and otherwise y = 0.The goal of response selection is to learn a matching model s(•, •) from D. For any contextresponse pair (c, r), s(c, r) gives a score that reflects the matching degree between c and r.According to s(c, r), one can rank a set of candidates for response selection.

Matching with Turn-Aware Context Representation
Most of the representative context-response matching models follow a representation-matchingaggregation paradigm (Wu et al., 2017;Zhang et al., 2018;Zhou et al., 2018;Tao et al., 2019a;Wang et al., 2019;Yuan et al., 2019).The framework consists of (1) a representation layer to explicitly encode the utterance at each turn individually based on its word-level representations, where each utterance does not explicitly receive the contextual information from other turns of utterances, (2) a matching layer that lets the context and response interact based on their representations, (3) an aggregation layer incorporating the interaction features.The idea of Turn-Aware Context Modeling (TACM) is to embed a new layer before the representation layer of a specific response selection model, where various modeling strategies are designed to make each utterance interact with other turns, so that the subsequent individually encoding of each utterance can be aware of the important contextual dialogue information from other utterances in the same session.It should be noted that we mainly focus on TACM layer in this paper, so the definition of the matching architecture follows k=1 in a context and its response candidate r = [w r,k ] nr k=1 , where n u i and n r are the number of words in u i and r respectively, we first represent u i and r as sequences of word embeddings, namely U e i = [e u i ,1 , e u i ,2 , ..., e u i ,nu i ] and R e = [e r,1 , e r,2 , ..., e r,nr ], where e ∈ R d denotes a ddimension word embedding.
Then, we propose a Turn-Aware Context Modeling (TACM) layer which takes in the word embeddings of all turns of utterances [U e i ] nc i=1 , and then models through several turns so that each utterance is fully interacted with other turns of utterances.Through different categories of TACM modules, each utterance representation can absorb contextual information from other turns of utterances in different semantic aspects.Suppose we have K sorts of TACM modules, the computation results of k-th module can be formalized as , where ϕ k (•) denotes the k-th TACM strategy.Now, we can obtain a set of dialogue context representations as { Ũ k i } K k=1 , which serve as the input of the representation-matching-aggregation paradigm.
Consistently with these models, ∀i ∈ {1, ..., n c }, Ũ k i is encoded individually and achieves Intra-Utterance Representation (IUR) as where f IUR (•) stands for the representation function which can be a RNN (Wu et al., 2017), a self-attention module (Zhou et al., 2018 Figure 2: Sketches of four types of turn-aware context modeling strategies.Q, K and V denote the query sentence, the key sentence and the value sentence respectively.u {1,2,3,4} represent the utterances in the context in chronological order.For convenience, we only draw four turns of utterances.In other words, u 4 is the most recent turn, also referred as the query utterance.For the rolling representation, we only draw the forward rolling process. or even a fusion network of multiple types of representation functions (Tao et al., 2019a).Similarly, R e can also be processed into Then, an Utterance-Response Matching (URM) layer follows, where ∀k ∈ {1, ..., K}, Û k i interacts with R and finally matched into matrices with several channels which can be formalized as , where f URM (•, •) represents the matching function, which can be a similarity function or an attention-based interaction function (Tao et al., 2019a), M i ∈ R (K * n M )×nu i ×nr with n M the number of channels of the matching matrices in the original response selection models1 .
Finally, an Aggregation (AGG) layer is employed to fuse or aggregate {M i } nc i=1 defined as ŷ = f AGG ({M i } nc i=1 ), where ŷ is final logits.This aggregation process f AGG (•) also depends on specific multi-turn response selection model, which may be a 3D convolutional neural network (Zhou et al., 2018) or a 2D convolutional neural network followed by a recurrent neural network (Wu et al., 2017;Yuan et al., 2019) to model the dependencies among different turns on interaction features, instead of the utterance representations.

Turn-Aware Context Modeling Strategies
We consider the following three categories of strategies that cover sequential context modeling, local context modeling and query-aware context modeling, which are depicted succinctly in Figure 2.
Sequential Context Modeling: Due to the natural sequential relationship between dialogue turns, understanding of subsequent turn of utterance requires dialogue information flow2 .Therefore, we propose a "Rolling" strategy to model such temporal relationships and directly connecting representa-tions from previous or following turns of utterances into the current utterance, which is similar to a hierarchical transformer-based architecture.It can capture intra-sentence and inter-sentence connections in a structured and dynamic sequential manner.
As shown in Figure 2(a), at turn i ∈ {1, ..., n c }, we compute the rolling representation as: where f ATT (Q, K, V ) is a transformer layer (Vaswani et al., 2017) ∀i ∈ {1, ..., n c }, the window representation of i-th utterance is calculated by f ATT between the current utterance and local context.The window representation and local context is defined as: where α = max(1, i − γ), β = min(n c , i + γ) denote both sides of the window respectively and γ is the offset, which is a hyper-parameter to be tuned by us.In this equation, is the concatenation of all utterances around the current i-th utterance.
Query-Aware Context Modeling: Apart from the above two strategies, we also grasp the importance of the last turn of utterance u nc which is often considered as a dialogue query, since most of the response candidates are directly respond to it3 .For this purpose, we intuitively utilize the dialogue query to capture relevant utterance information in conversation history.Towards measuring the necessity of each turn of utterances to replenish u nc , we propose two different strategies named "Highway" and "Weighted" illustrated in Figure 2(c,d).
Both of them can identify important utterances and capture the implicit relationship of the whole context.The difference is that the former calculates a specific weight for each entry of the representation vector (namely word-level), while the latter only assigns a weight to utterance-level representation.
For "Highway" representation, firstly, ∀i ∈ {1, ..., n c }, i-th turn of utterance U e i is concatenated with the dialogue query U e nc and obtain a concatenated representation U c i = [U e i ; U e nc ] where U c i ∈ R (nu i +nu nc )×d .Then, inspired by Srivastava et al. (2015), this concatenated representation U c i and the current utterance representation U e i are then fed into a Highway Network to fuse both features, which is processed as follows: where W g , W r , b g , b r are learnt parameters and ⊙ denotes the element-wise multiplication.GELU (Hendrycks and Gimpel, 2016) is an activation function.The gating unit o i ∈ R nu i ×d is learnt to regulate the flow of query-aware information U r i .Similarly, U H i and U e i have the same dimensions.
For "Weighted" representation, we first integrate each turn of dialogue information into an utterancelevel representation vector Ūi .Then, we calculate the semantic similarity between each turn and dialogue query through cosine similarity to obtain the relevant score of each utterance.Finally, all turns of utterances are weighted by relevant score to get a new weighted representation.At turn i ∈ {1, ..., n c }, we can formulate this procedure as: ) where MEAN(•) represents mean pooling operation over self-attended word embeddings, s i is the weight scalar for the i-th turn4 .Darker area means larger value.The weighted representation U W i has the same dimension as U e i .

Datasets and Evaluation Metrics
We evaluate our methods on three public data sets: Ubuntu Dialogue Corpus (Lowe et al., 2015), Douban Conversation Corpus (Wu et al., 2017), and E-commerce Dialogue Corpus (Zhang et al., 2018).
The first data set we adopt is Ubuntu Dialogue Corpus (Lowe et al., 2015) which is a multi-turn English conversation data set constructed from chat logs of the Ubuntu forum.We use the version provided by Xu et al. (2017).The data contain 1 million context-response pairs for training, and 0.5 million pairs for validation and test respectively.In all three sets, positive responses are human responses, while negative ones are randomly sampled.The ratio of the positive and the negative is 1:1 in the training set, and 1:9 in both the validation set and the test set.Following Lowe et al. (2015), we employ recall at position k in n candidates (R n @k) as evaluation metrics.
The second data set is Douban Conversation Corpus (Wu et al., 2017) which is a multi-turn Chinese conversation data set crawled from Douban group5 .
Apart from the above two data sets, we also choose E-commerce Dialogue Corpus (Zhang et al., 2018).The data consist of real-world conversations between customers and customer service staffs in Taobao6 , which is the largest e-commerce platform in China.There are 1 million context-response pairs in the training set, and 10 thousand pairs in the validation set and test set respectively.Each context in the training and validation set corresponds to one positive response candidate and one negative response candidate, while in the test set, the number of response candidates per context is 10 with only one of them positive.Human responses are treated as positive responses, and negative ones are automatically collected by ranking the response corpus based on conversation history augmented messages using Apache Lucene7 .Following Zhang et al. (2018), we employ R 10 @1, R 10 @2, and R 10 @5 as evaluation metrics.

Referenced Models
Since the task of retrieval-based dialogues was proposed, many impressive models have emerged.Therefore, we choose these models as referenced baselines including the multi-view matching model (Multi-View) (Zhou et al., 2016), the deep utterance aggregation model (DUA) (Zhang et al., 2018), the multi-representation fusion network (MRFN) (Tao et al., 2019a), the interaction-over-interaction network (IOI) (Tao et al., 2019b) and BERT for response selection (Gu et al., 2020).

Selected Matching Models
Since our proposed TACM layer can be adapted to the existing multi-turn context-response matching models, we choose the following three representative models to verify its effectiveness.SMN: Wu et al. (2017) first lets each turn of utterance interact with the response, and forms a matching vector for the pair through CNNs.Then, all of the matching vectors are aggregated with a RNN as a matching score.We select the model as it is a representative in the framework of representationmatching-aggregation, where the f IUR is a RNN encoder, f URM is an inner-product similarity function and f AGG is a 2D CNN followed by an RNN.DAM: Zhou et al. (2018) constructs representations of utterances in the context and the response with stacked self-attention and cross-attention.We select the model as it is a representative contextresponse matching model based on Transformer architecture (Vaswani et al., 2017), where the f IUR is an Attentive Module, f URM is a similarity function over representations and f AGG is a 3D CNN.
MSN: Yuan et al. (2019) firstly utilizes a multihop selector to select relevant utterances as context.
Then, the model matches the filtered context with candidate response and obtains a matching score.We choose the model as it is the best performing multi-turn context-response matching model without PLMs on three benchmarks, where f IUR is a multi-hop selector network, f URM is the ensemble of inner-product and cosine similarity functions over self-attention and cross-attention representations and f AGG is a 2D CNN followed by an RNN.
It is worth noting that we do not adopt PLMs as backbone in our main experiments because they concatenate multi-turn context and treat the problem in single-turn scenario.Take BERT as an example, it conducts full interaction over the whole dialogue turns of utterances for context comprehension (Gu et al., 2020).This interaction is direct, but there may be redundant calculations for multi-turn context, resulting in a large amount of parameters.Instead, we put forward series of heuristic strategies to conduct turn-aware interactions for multiturn dialogue context.In subsequent experiments, we find that our model can achieve comparable performance to BERT8 with one third of parameters.

Implementation Details
We implement all models with PyTorch (Paszke et al., 2017).Word embedding is pre-trained with Word2Vec (Mikolov et al., 2013) on the training set of each corpus, and the dimension of word vectors is 200.For fair comparison, we limit the maximum number of utterances in each context as 10 and the maximum number of words in each utterance and response as 50 following Wu et al. (2017); Zhou et al. (2018); Yuan et al. (2019).Truncation or zero-padding is applied to a context or a response candidate when necessary.All other settings such as the kernel size of CNN in matching module and the dimension of hidden states of RNN in aggregation layer are consistent with the original papers.The batch size and initial learning rate are also consistent with the default setting of the proposed baselines (SMN, DAM, MSN).We used their public code to reproduce their models, and the results were similar to those reported in the original papers.For more detailed settings of baselines, please refer to Appendix A.1.The parameters were updated by Adam (Kingma and Ba, 2015).In "Window" strat-egy, γ is set as 1.Early stopping on the validation data is adopted as a regularization strategy.

Evaluation Results
Table 1 reports the evaluation results of training with turn-aware context modeling with "Rolling", "Window", "Highway" and "Weighted" strategies.We can see that all modeling strategies can consistently improve the original matching models on all three data sets.The improvement from the corresponding baselines is statistically significant (t-test with p-value < 0.05) on R 10 @1 (the most important evaluation metric in retrieval-based chatbots) and many other metrics.In particular, as SMN and DAM both use non-turn-aware representation, the improvement also shows the effectiveness of turnawareness.Furthermore, we can observe that as the performance of the original model enhances (that is, from SMN to DAM to MSN), the improvement brought by TACM layer gradually decreases.This may due to the increasing complexity of the original model's utterance-response matching (f URM ) and feature fusion (f AGG ), which alleviates the missing semantic relationship among different turns of utterances.Besides, it is interesting to find that a simple SMN with TACM even performs better than DAM (encoding the context with five self-attention layers) on the Ubuntu and Douban data, although DAM is in a more complicated structure.
In addition, we are surprised to find that MSN+TACM achieves comparable or even better performance in most metrics to BERT but uses fewer parameters.It should be noted that the number of parameters of SMN + TACM , DAM + TACM and MSN + TACM are 35.1M,35.7M, 39.6M respectively, which is almost 1/3 of BERT (110M).Such results indicate that the heuristic TACM strategies are lighter and more effective than the full-interaction in multiple transformer layers (such as BERT), which may contains amounts of redundant semantic interactions among the dialogue turns, thus greatly increasing the complexity.The above experimental phenomenon suggests that a dynamic interaction strategy among dialogue turns in PLMs can be explored in future work.

Further Discussions
Ablation Study.We also conducted additional comprehensive ablation experiments to explore the improvement of the model brought by the above four TACM strategies on Ubuntu data with SMN, DAM and MSN respectively as demonstrated in Ta- [2, 3] [4, 5] [6, 7] [8, 9]   egy in Table 3, where the training environment and hyper-parameters are strictly consistent.The results show that we can obtain a significant performance improvement with a considerable increase of model parameters.Comparing the performance of the model that uses the best strategy and the model that uses all strategies, we can find that if all strategies are exploited, the performance of the matching model will be further improved.Intuitively, the representation features obtained by the "Rolling" and "Window" strategies have a certain degree of redundancy, since the effect of "Window" strategy might be covered by that of "Rolling", because of the update mechanism of recurrent attention.Besides, "Highway" and "Weighted" can also capture similar query-aware features since the query-aware representation of "Highway" strategy is based on word-level attention, which may cover the utterance-level weighting mechanism of "Weighted" strategy.To verify our assumption, we conducted an additional group of experiments: For each matching model, we selected two strategies X, Y and equipped them, where X represents a strategy with better performance chosen from "Rolling" and "Window", and Y stands for another strategy with better performance selected from "Highway" and "Weighted".We denote the model as X+TACM top2 and demonstrate the results in Table 2.It is interesting to find that uti-lizing two better strategies (such as "Rolling" and "Highway") leads to a slight performance drop compared to X+TACM all , though each representation is useful.In real application, we can choose one of them for multi-turn response selection.

Impact of Context Length
We further study how the number of turns influences the performance of different models when the TACM layer is incorporated.Figure 3 shows how the performance of the models changes with respect to different numbers of turns in contexts.We observe a similar trend for all models: they first increase monotonically until context length reaches a certain value (9 for all three matching models), and then drop when context length keeps increasing.The reason might be that when only a few utterances are available in contexts, the model could not capture enough information for matching, but when the contexts become long enough, noise will be brought to matching as utterances in early history could be irrelevant to the query utterance.Despite the fact that long context (Turn=10) is still challenging, the gap between the two forms is bigger on long contexts than it is on short contexts, indicating that our TACM layers can improve the capability of modeling long context and demonstrate higher improvement of matching accuracy on long contexts.It is noted that the performance gap between MSN and MSN with TACM layer does not widen obviously as the number of turns increases.The reason might be that the architecture of MSN is complex and it introduces query-aware features for context-response matching.Nonetheless, MSN with TACM still significant outperforms MSN, which confirms the effectiveness of our framework.We provide more empirical studies of TACM including comparison between PLM-based interaction and heuristic interaction, case visualization and analysis of hyper-parameter sensitivity in Appendix A.2, A.3 and A.4 respectively.

Conclusion
This paper investigates how to improve the performance of existing matching models with a better context modeling method.Empirical results on three benchmarks indicate that query-aware context modeling is the best strategy and employing multiple context modeling strategies can consistently improve the performance of existing response selection models.Additionally, our TACM layer can improve the capability of modeling long context.

Limitations
Besides its merits, our framework still has a few limitations could be further explored in future works.On the one hand, although we try our best to summarize the existing context modeling strategies into three categories, there may still be hybrid or complex methods that cannot be directly categorized; On the other hand, although our methods have been shown to be effective for retrieval-based dialogue models, it also seems reasonable for generative approaches, which needs to be investigated in future work.
We hope our results could encourage future work on addressing these limitations to further explore context modeling for multi-turn response selection.We are also curious about how to adapt turn-aware context modeling strategies to existing PLMs.Take BERT as an example, all utterances in context and response candidate are concatenated as a single consecutive token sequence with special tokens separating them, which converts multi-turn context understanding into a single-turn scenario and makes context interaction non-turn-aware (Gu et al., 2020).Thus, we cannot directly use the several strategies introduced in Section 3.3.But we can still borrow the idea and validate the effectiveness of the aforementioned context interaction patterns on PLMs by masking the partial input sequence at the turn level in each transformer layer.Figure 4 depicts the visualization of turn-aware masking.Specifically, for all strategies, [CLS] can be aware of any other words, [SEP ] is consistent with the utterance it follows.To simplify our exposition, we operate on the attention matrix A ∈ [0, 1] ns×ns of the self-attention mechanism, where n s is the length of concatenated input sequence of BERT.In order to distinguish the query, key and value in the attention mechanism of BERT from the dialogue query u nc , here we denote the query, key, and value in the attention mechanism as Q, K and V .As for the Sequential Context Modeling, each turn of utterances can only see the previous turns (here we only test the forward rolling process) and the other turns are masked out.The value of i-th Q word and j-th K/V word in the attention mask matrix is:

A Appendix
where T (i), T (j) stand for the turn id of i-th Q word and j-th K/V word respectively.As for the Local Context Modeling, each turn of utterances can only be aware of the two adjacent turns of utterance (here we only consider γ = 1 consistent

Table 1 :
Evaluation results on three data sets.Numbers marked with * mean that the improvement is statistically significant compared with corresponding baseline (t-test with p-value < 0.05).
sponses are randomly sampled.The ratio of the positive and the negative is 1:1 in training and validation set.In the test set, each context has 10 response candidates retrieved from an index whose appropriateness regarding to the context is judged by human annotators.The average number of positive responses per context is 1.18.FollowingWu  et al. ( Performance of models (with or without TACM) across different length of contexts on Ubuntu.

Table 3 :
Experimental parameter statistics when the model uses the best strategy on Ubuntu data.↑ stands for the growth rate.
(Mikolov et al., 2013)gs of the Experiments for BaselinesWe used public codes to reproduce all three baselines (SMN, DAM, MSN), and the results were similar to those reported in the original papers.Specifically, we limited the maximum number of utterances in each context as 10 and the maximum number of words in each utterance and response as 50.FollowingWu et al. (2017);Zhou et al. (2018);Yuan et al. (2019), we padded zeros if the number of turns in a context is less than 10, otherwise we kept the last 10 turns.If the length of each utterance or each response candidate exceeded the limitation, we only kept the first tokens because we assume that the most important part will be given first, otherwise we padded zeros behind.Word embeddings were all initialized by the results of Word2Vec(Mikolov et al., 2013)which ran on the training data, and the dimension of word vectors is 200.Adam (Kingma and Ba, 2015) algorithm was used in all baselines.For SMN, the window size of CNN was (3, 3) and the initial learning rate was 0.001.The batch size was 200.The hidden size of the two GRUs was 200 and 50.For DAM, the number of stacked self-attention layers was 5.The learning rate was initialized as 1e − 3 and gradually decreased during training, and the batch size was 256.For MSN, the dimension of the hidden states of GRU was 300.The learning rate was also initialized as 1e − 3 and gradually decreased during training.The batch size was 200, 150, 200 on Ubuntu, Douban and E-commerce Corpus respectively.

Table 4 :
Figure 4: Transformer attention masks for different turn-aware context modeling strategies in BERT, and white color indicates absence of attention.(a) Sequential Context Modeling; (b) Local Context Modeling; (c) Query-Aware Context Modeling.For convenience, we only draw five turns of utterances.In other words, n c = 5 and u 5 is the query utterance.For the Sequential Context Modeling, we only test the forward rolling process.As for the Local Context Modeling, we only consider γ = 1 consistent with the "Window" strategy.Performance of the BERT models with heuristic context interaction on Ubuntu data.* means the results copied from Gu et al. (2020).