DialogConv: A Lightweight Fully Convolutional Network for Multi-view Response Selection

Current end-to-end retrieval-based dialogue systems are mainly based on Recurrent Neural Networks or Transformers with attention mechanisms. Although promising results have been achieved, these models often suffer from slow inference or huge number of parameters. In this paper, we propose a novel lightweight fully convolutional architecture, called DialogConv, for response selection. DialogConv is exclusively built on top of convolution to extract matching features of context and response. Dialogues are modeled in 3D views, where DialogConv performs convolution operations on embedding view, word view and utterance view to capture richer semantic information from multiple contextual views. On the four benchmark datasets, compared with state-of-the-art baselines, DialogConv is on average about 8.5x smaller in size, and 79.39x and 10.64x faster on CPU and GPU devices, respectively. At the same time, DialogConv achieves the competitive effectiveness of response selection.


Introduction
An important challenge in building intelligent dialogue systems is the response selection problem, which aims to select an appropriate response from a set of candidates given a dialogue context.Such retrieval-based dialogue systems have attracted great attention from academia and industry due to the advantages of informative and fluent responses produced (Tao et al., 2021).
The PrLM Pattern (i.e., Figure 1 (c)) uses special symbols to connect all utterances into a continuous sequence, similar to Concatenated Pattern.While PrLM Pattern has obtained state-of-the-art performance in response selection (Cui et al., 2020;Gu et al., 2020;Liu et al., 2021), this method having Transformer (Vaswani et al., 2017) as the de facto standard architecture suffer from a large number of parameters and heavy computational cost.Very large models not only lead to increased training costs, but also prevent researchers from iterating quickly.At the same time, slow inference hinders development and deployment of dialogue systems in real-world scenarios.
Furthermore, these three patterns treat dialogue contexts as flat structures (Li et al., 2021).Methods based on such flat structures usually capture the sequential features of text by considering each word as a unit.However, previous work (Lu et al., 2019) revealed that given a multi-turn dialogue (e.g., Figure 2 (a)), the context of the dialogue can exhibit a composition of 3D stereo structures as 1 12086   (Gu et al., 2019;Zhou et al., 2016;Gu et al., 2020) only extract features based on the flat structures, but cannot simultaneously capture complex features from such stereoscopic views.
In this paper, we propose a lightweight fully 1 convolutional network model, called DialogConv, without any RNN and attention module for multiview response selection.Different from previous studies (Zhou et al., 2016;Gu et al., 2019Gu et al., , 2020;;Li et al., 2021) which model the dialogue in a flat view, DialogConv models the dialogue context and response together in the 3D space of the stereo views, i.e., embedding view, word view, and utterance view (as shown in Figure 2 (d)).In the embedding view, the word-level features will be refined through convolution operations on the plane formed by the word sequence dimension and the utterance dimension.In the word view, the global conversation features will be captured by concatenating all words into a continuous sequence, and the features of each utterance will be refined by performing convolution on each utterance.In the utterance view, the dependency features between different local contexts will be distilled by performing convolution across different utterances.In general, DialogConv can simultaneously extract features with different granularities from the stereo structure.
DialogConv is completely based on CNN, which uses much fewer parameters and computing resources.DialogueConv has an average number 1 Here 'fully' means DialogConv is built exclusively on CNNs.
of parameters of 12.4 million, which is on average about 8.5× smaller than other models.The inference speed of DialogConv can be on average about 79.39× faster on CPU device and 10.64× faster on GPU device than existing models.Moreover, DialogConv achieves competitive results on four benchmarks and performs even better when pretrained with contrastive learning.In summary, we make the following contributions: 2 Related Work

Retrieval-based Dialogue System
Most existing retrieval-based dialogue systems (Wu et al., 2017;Gu et al., 2019;Liu et al., 2021) focus on matching between dialogue context and response.These methods attempt to mine deep semantic features through sequence modeling, e.g., using attention-based pairwise matching mechanisms to capture interaction features between dialogue context and candidate response.However, previous research (Sankar et al., 2019;Li et al., 2021) shows that these methods fail to fully exploit the conversation history information.In addition, methods based on recurrent neural network suffer from slow inference speed due to the nature of recurrent structures.Although Transformer-based methods (Vaswani et al., 2017) get rid of the weakness of recurrent structure, they are usually plagued by a large number of parameters (Wu et al., 2019), making the training and inference of Transformerbased models require a lot of computational cost.
In this paper, we propose a multi-view approach to model dialogue context based on a fully convolutional structure and a lightweight model that is smaller and faster than most existing methods.

Convolutional Neural Networks (CNN)
For the past few years, CNNs have been the goto model in computer vision.The main reason is that CNN enjoys the advantage of parameter sharing and is better at modeling local structures.
A large number of excellent architectures based on CNN have been proposed (Krizhevsky et al., 2012;He et al., 2016;Dai et al., 2021).For text processing, convolutional structures are good at capturing local dependencies of text and are faster than RNNs (Hochreiter and Schmidhuber, 1997).Therefore, some studies (Wu et al., 2016;Lu et al., 2019;Yuan et al., 2019) employ convolutional structures to aggregate the matching features between dialogue contexts and responses.However, these works usually require combining attention mechanisms or the skeleton structure of RNN with CNNs.Furthermore, these studies treat dialogue context as a flat structure.In this paper, we propose a novel fully convolutional architecture to extract matching features from stereo views, which can simultaneously extract the features with different granularities from different views.

Problem Formulation
In this paper, an instance in the dialogue dataset can be represented as (C, y), where C=(u 1 , u 2 , . . ., u t−1 , r) represents the set of dialogue contexts (u 1 , u 2 , . . ., u t−1 ) and the response r, u i is the i-th utterance, and y ∈ {0, 1} is the class label of C. As the core of retrieval-based dialogue system, the purpose of response selection is to build a discriminator g(C) on (C, y) to measure the matching between the dialogue context and response.

Fully Convolutional Matching
We propose a fully convolutional encoder for multiview response selection.Multiple views include embedding view, word view, and utterance view.
In the embedding view, the convolution operations are performed on the plane formed by the word sequence dimension and the utterance dimension, and word-level features can be extracted through nonlinear transformations between different embeddings.
In the word view, global dialogue context features will be captured by convolution of a contiguous sequence connecting all words, and features of each utterance will be obtained by performing convolution on each utterance.In the utterance view, DialogConv is responsible for capturing the dependency features between different local contexts composed of adjacent utterances.Figure 3 shows an overview of our proposed DialogConv, which consists of six layers: (i) embedding layer; (ii) local matching layer; (iii) context matching layer; (iv) discourse matching layer; (v) aggregation layer; (vi) prediction Layer.
Symbol Definition: The embedding layer uses a pretrained word embedding model to map each word in C to a vector space.We stack C chronologically into a 3D tensor G ∈ R t×ℓ×d , where d represents the dimension of word embedding, ℓ represents the length of the utterance and t is the number of utterances including the response.G is the input to DialogConv.We use Conv2D v k×s and Conv1D v w to denote the convolution operations, where Conv2D v k×s denotes a two-dimensional convolution with a convolution kernel size of k × s, Conv1D v w represents a one-dimensional convolution with a convolution kernel size of w, and v represents a specific view.We will describe the details of the remaining layers in the following subsections.

Local Matching Layer
The local matching layer is responsible for extracting features of each utterance.The local matching stage contains features from the embedding and word views.Firstly, we employ 1 × 1 convolutions in the embedding view and the word view, respectively.The process can be formally described as: (1) where σ(•) stands for GELUs activation function (Hendrycks and Gimpel, 2016).The 1 × 1 conv@1 conv@2 convn@3 conv@4 conv@5 flatten skip-connections t t t max-pooling conv@10 max-pooling conv@11 score fc conv@7 t t conv@6 conv@8 conv@9  convolution pays attention to the information of the current element itself and does not consider the influence of local context.The features of individual words will be extracted from the embedding and word views by 1 × 1 convolution.Multi-scale convolution (Szegedy et al., 2015;Gao et al., 2019) has been shown to be effective in extracting local features.Therefore, we use a k 1 ×s 1 convolution in the word view and a 1 × 1 convolution in the embedding view to capture the matching features of each utterance.The formal description is given as follows: Note that we focus on the features of a single utterance in the local matching layer.

Context Matching Layer
The context matching layer is responsible for extracting matching features of the global dialogue context.Firstly, we flatten G 4 into a twodimensional tensor G 5 ∈ R (t×ℓ)×d .This is equivalent to concatenating all utterances in chronological order into one continuous sequence of words.Then, we use convolution across words sequence with kernel size of w 1 in the embedding view, and kernel size of w 2 in the word view.Details are as follows: where f reshape is a function that reshapes the output of the convolution to the same shape as G 5 and G 7 ∈ R t×ℓ×d .The features of the global dialogue context can be aggregated by a nonlinear transformation between different words concatenating all utterances.The features of the global context are basis for extracting the dependency features between different local contexts.

Discourse Matching Layer
The discourse matching layer is responsible for capturing the dependencies between different local contexts composed of adjacent utterances.Modeling dependency features is beneficial for capturing changes in implicit topics, intentions, etc. in the dialogue context, which is important for choosing the correct response.We employ orthogonal convolution to extract dynamic dialogue flow features across utterances to capture discourse-level changes.The specific process is formulated as follows: where the 1 × s 2 convolution and s 2 × 1 convolution are called orthogonal convolutions because the direction of their convolution kernels is vertical.The 1 × s 2 convolution is responsible for forming semantic flow based on the context-level features of a single utterance, and the s 2 × 1 convolution extracts discourse structural features according to the depth of dialogue.Finally, we integrate features of utterances through the 1 × 1 convolution.

Aggregation Layer
The aggregation layer is responsible for obtaining high-level semantic information by integrating the matching features from previous layers.First, we use max-pooling to obtain the sentence representation G 11 ∈ R t×d .Then, we employ two layers of convolution to extract matching features along the embedding dimension in the embedding view and the depth of dialogue in the utterance view, respectively.The formulation is as follows: where w 3 and w 4 are the convolution kernel sizes.We again use a max-pooling operation based on G 13 to obtain the final context representation O.

Self-supervised Pre-training
As a lightweight neural structure, the performance of DialogConv can be further improved by a pretraining strategy using a small corpus.While the masked language model pretraining (Devlin et al., 2019;Lan et al., 2020) usually requires largescale corpora, self-supervised contrastive learning can generally learn representations with a relatively small-scale corpus.Therefore, we employ contrastive learning to learn effective representations by pulling semantically close neighbors together and pushing apart non-neighbors (Hadsell et al., 2006).Given a set of paired examples D = (x i , x + i ), where x i is the dialogue context c and x + i is the correct response.We adopt the previous contrastive learning framework (Liu and Liu, 2021) and employ a cross-entropy objective, where the negatives x − i include responses with y = 0 and in-batch negatives (Chen et al., 2017).The training objective is: (12) where τ is a temperature hyperparameter, x − ij represents the j-th negative example of x i , x + i represents the positive example of x i , and sim(•, •) is the cosine similarity.

Experiments and Results
The baselines are described in the Appendix A.1.

Datasets
We conduct extensive experiments on four public datasets: (i) Ubuntu Dialogue (Ubuntu) (Lowe et al., 2015); (ii) Multi-Turn Dialogue Reasoning (MuTual) (Cui et al., 2020); (iii) Douban Conversation Corpus (Douban) (Wu et al., 2016); (iv) Ecommerce Dialogue Corpus (ECD) (Zhang et al., 2018).Ubuntu consists of 1 million contextresponse pairs for training, 0.5 million pairs for validation, and 0.5 million pairs for testing.The ratio of the positive and the negative is 1:1 for training, and 1:9 for validation and testing.Douban consists of 1 million context-response pairs for training, 50k pairs for validation, and 10k pairs for testing.Response candidates are retrieved from Sina Weibo and labeled by human judges.ECD contains 1 million context-response pairs for training, 10k pairs for validation, and 10k pairs for testing and consists of five different types of conversations (e.g., commodity consultation, logistics express, recommendation, negotiation and chitchat) based on over 20 commodities.MuTual is the first human-labeled reasoning-based dataset for multi-turn dialogue, which contains 7,088 context-response pairs for training, 886 pairs for validation, and 886 pairs for testing.The ratio of the positive and the negative is 1:3 in the training, validation and test sets.

Evaluation Metrics
We follow previous research (Zhang and Zhao, 2021) using evaluation metric Rn@k to measure model performance on the datasets Ubuntu, Douban and ECD, which calculates the proportion of truly positive responses among the top-k responses selected from a list of n available candidates for a context.In addition, the traditional metrics MAP (Mean Average Precision) (Baeza-Yates and Ribeiro-Neto, 1999) and MRR (Mean Reciprocal Rank) (Voorhees et al., 1999) are employed on Douban.We use recall at position 1 of 4 candidates (R@1), recall at position 2 of 4 candidates (R@2) and MRR on MuTual dataset, following previous study (Liu et al., 2021).The Ubuntu, Douban and ECD test sets provide ten candidate responses, while the MuTual provides four candidate responses, leading them to adopt different evaluation metrics.
Self-supervised Pre-training: We conduct 5 12090 Ubuntu (English) Douban (Chinese) Method R10@1 R10@2 R10@5 MAP MRR P@1 R10@1 R10@2 R10@5 small-scale pretraining on the training set of downstream tasks through contrastive learning, such as Ubuntu and Douban.Negative instances include not only negative examples provided by the dataset, but also candidate responses from other instances in the same batch.We use the Stochastic Gradient Descent (SGD) optimizer (Bottou, 2012) in the self-supervised pretraining phase.We set the batch size to 128, the learning rates to 0.001, and the temperature τ to 0.007.Fine-tuning: During the fine-tuning phase, we train DialogConv and other models using the Adam optimizer (Kingma and Ba, 2015).The learning rates are initialized to 1e-3, 5e-4, 1e-4, 5e-5 and 1e-5 via a multi-step strategy.The batch size is set to 32 for the MuTual dataset and 64 for the other datasets.The values of the above hyperparameters are all fixed using the validation set.

Results of Effectiveness
Tables 1 and 2 report the test results of Dialog-Conv and all compared models on the four datasets.While DialogConv does not achieve the best performance, the model attains near-optimal results in most cases.Furthermore, we calculate the confidence level (p < 0.05) of DialogConv compared to BERT base (i.e., , which shows that the results of DialogConv are credible. As shown in Table 1, DialogConv outperforms most classic models such as DUA and DAM, and achieves comparable performance to MRFN on the Ubuntu dataset.DialogConv also outperforms other lightweight variants of BERT such as DBETR-6-768 (i.e.,   TBERT-6-768 (i.e., TinyBETR-6-768).When pretrained with contrastive learning, DialogConv performs close to BERT-12-768 and even outperforms BERT-12-768 on R10@2.On the Douban dataset, the performance of DialogConv is 2.3% lower than the result on R10@1.However, performance of pretrained DialogConv can achieve nearoptimal results.In Table 2, compared to BERT-12-768, Dialog-Conv has a huge advantage of 21.7% on R10@1 and 7.5% higher on R10@2, is much better than other variants of BERT.We will discuss this phenomenon in Section 4.7.DialogConv outperforms some classic retrieval-based dialogue models3 such as DAM and MSN, and is close to some lightweight BERT variants such as DBERT-6-768 and BERT-4-512.Compared to BERT-12-768, DialogConv is 2.6% lower on R10@1 on the MuTual.We believe that the lower performance of DialogConv on Mu-Tual is caused by a limitation of DialogConv itself, which we will discuss in detail in Section 4.7.

Model Efficiency
To measure the complexity of our base model, we analyze the actual inference time of the model on CPU and GPU, as well as the number of parameters, as shown in Table 3. DialogConv a huge speed advantage over other models, no matter on CPU or GPU.For example, on the Ubuntu dataset, Dialog-Conv 4.19× to 115.67× faster on CPU and 2.39× to 11.64× faster on GPU, and the average inference speed is 115.67×faster on CPU and 11.64× faster on GPU than other models.On all four bench-mark datasets, the inference speed of DialogConv is on average 79.39× faster on CPU and 10.64× faster on GPU compared to other models.Overall, the gain of inference speed ranges from 1.97× to 40.61× on GPU and from 3.47× to 697.00× on CPU.The CPU and GPU devices are described in the Implementation Details 4.3 subsection above.
The average number of parameters of Dialog-Conv on the four benchmark datasets is 12.4 million, which is 2.8× larger than BERT-2-128, 1.1× than BERT-4-256, and comparable to TBERT-4-312.However, DialogConv has clear advantages in performance and inference time over these models.Compared to TBERT-6-768 and DBERT-7-768, the average number of parameters of Dialog-Conv is 4.9× and 5.1× smaller, respectively.Compared with BERT-12-768, the average number of parameters of DialogConv on four datasets is 8.5× smaller.As compared to the classic models DUA, DAM, IOI and MSN, DialogConv needs approximately 3.5× less parameters.Overall, DialogConv achieves promising results in both performance and inference time, but relies on generally less parameters.The main reason is that convolutional structure enjoys the advantage of shared parameters, which make DialogConv have fewer parameters compared to other models based on RNN or attention mechanism.We can observe that each submodule plays a vital role in DialogConv.Specifically, the local matching layer captures the features of each utterance by mixing the features from the embedding and word views.The context matching layer updates the matching features based on the entire dialogue context and response.The discourse matching layer extracts the dependencies between different local contexts composed of adjacent utterances.Comparatively, it seems that the local matching layer has a little less impact on the model performance than the other layers.We conjecture that the layer can only extract local features to some extent since convolution is better at capturing local features.

Result Analysis and Discussion
BERT-12-768 is a representative BERT base version among other BERT variants.Therefore, we use it as the basic comparison model.In Table 2, DialogConv has an absolute advantage of 21.7% on R10@1 and 7.5% on R10@2 compared with BERT-12-768 on ECD.We believe that there are three main reasons for this phenomenon.First, Di-alogConv focuses on matching, which can extract matching features from stereoscopic views.We visualize the convolution results of each layer of di-alogConv as a heatmap (Figure 4 in Appendix A.2).According to the heatmap, DialogConv can capture key matching features between dialogue context and response.The local matching layer mainly focuses on the features between words.This is because we use 1 × 1 convolution in the conv@1 and conv@2 layers while matching features appear between several overlapping words in the two layers.When we use larger convolution kernels, DialogConv starts to focus on matching features between phrases.A similar phenomenon can be observed in the context matching layer.We can see that after the local and global features are extracted by the discourse matching layer, some important features are clearly captured.Second, for ECD, the average overlap of keywords between response and context reaches about 40%, which is beneficial for DialogConv to extract matching features from multi-view stereos.Third, ECD is a dataset in the domain of e-commerce.The domain-specific performance of BERT-12-768 may be mediocre because the pre-training corpora of BERT-12-768 is domain-agnostic.
In Table 2, DialogConv achieves relatively insufficient performance on MuTual.We believe the main reason is that Mutual is a human-labeled reasoning dataset for multi-turn dialogues.However, DialogConv focuses on matching between dialogue context and response, and lacks reasoning ability.Therefore, DialogConv cannot make a correct predictions on reasoning-oriented examples in MuTual.Figure 5 (in Appendix A.2) demonstrates the convolutional heatmap visualization of DialogConv on MuTual.According to the heatmap, DialogConv erroneously focuses on the features of "and", "their" and "in" in the dialogue context, and dose not consider "5" and "7" as key features.
DialogConv can effectively exploit the dependencies between different local contexts composed of adjacent utterances.To reveal its capabilities in this regard, we perturb the dialogue structure by randomly perturbing the dialogue context and report the results in Table 6 and Table 7 (in Appendix).We can see that the performance of Di-alogConv degrades to varying degrees on the four benchmark datasets.Specifically, the performance drops by 12.9% R10@1 on Ubuntu, 12% R10@1 on Douban, 17.7% R10@1 on ECD, and 7.4% R@1 on MuTual.We speculate that the dialogue structure contains the dependencies between different local contexts, which is important for multi-turn response selection.When perturbing the dialog strecture, the dependencies between local contexts will be severely broken, resulting in performance degradation of DialogConv.

Conclusion
In this paper, we propose DialogConv, a multi-view lightweight architecture based exclusively on CNN.DialogConv conducts convolutions on embedding, word, and utterance views to capture matching features.Experiment results show that DialogConv has fewer parameters, is faster, and requires less computing resources to achieve competitive results on four benchmark datasets.DialogConv provides a valuable reference for the dialogue system being deployed in real-world scenarios.

6 Limitations
Although our work can achieve competitive results with less computing resources, we acknowledge some limitations of our study.Firstly, DialogConv focuses on matching, resulting in insufficient reasoning ability.Therefore, DialogConv has a lot of room for improvement in the performance of dialogue reasoning (on the MuTual dataset).Secondly, we did not explore the performance of deep Dialog-Conv.Our study mainly focuses on designing a lightweight model, ignoring the potential heavyduty DialogConv under the blessing of large-scale training corpora.We will explore the performance potential of deep DialogConv in future work.5 demonstrates the comparison between Chinese and English of an example of Ecomm.We obtain the heatmaps in Figure 5 and 4 through visualizing the similarity matrix between response and dialogue context.The larger the value of the similarity matrix, the brighter the corresponding visualization result, and the more important the corresponding word is.According to Figure 4, Di-alogConv can capture the key features in dialogue context and response such as "not" (不是 or 非 ), "quality" (质量), "problem" (问题).We can conclude that DialogConv makes decision based on the matching features between dialogue context and responses.According to Figure 5  takenly considers "their", "them", "see", and "i" as important features and ignores the key features "5" and "7" in the dialogue context.

Figure 1 :
Figure 1: Flat modeling.(a) is separate pattern, (b) is concatenated pattern and (c) is PrLM pattern.Grey bars in (c) are embedded representations of special symbols.

Figure 2 :
Figure 2: Stereo view modeling.(a) An example of multi-turn dialogue; (b) Features from different views; (c) A schematic diagram of stereo view; (d) Convolution on different views ((1) is convolution in embedding view; (2) is convolution in word view; and (3) is convolution in utterance view)

Figure 3 :
Figure 3: Overview of our DialogConv.The conv@i symbol represents the i-th convolution operation.

Figure 4 and
Figure 4 and Figure 5 demonstrate example convolutional heatmap visualizations for each layer of DialogConv from datasets Mutual and ECD, respectively.Table5demonstrates the comparison between Chinese and English of an example of Ecomm.We obtain the heatmaps in Figure5and 4 through visualizing the similarity matrix between response and dialogue context.The larger the value of the similarity matrix, the brighter the corresponding visualization result, and the more important the corresponding word is.According to Figure4, Di-alogConv can capture the key features in dialogue context and response such as "not" (不是 or 非 ), "quality" (质量), "problem" (问题).We can conclude that DialogConv makes decision based on the matching features between dialogue context and responses.According to Figure5, DialogConv mis-

我
Figure 4 and Figure 5 demonstrate example convolutional heatmap visualizations for each layer of DialogConv from datasets Mutual and ECD, respectively.Table5demonstrates the comparison between Chinese and English of an example of Ecomm.We obtain the heatmaps in Figure5and 4 through visualizing the similarity matrix between response and dialogue context.The larger the value of the similarity matrix, the brighter the corresponding visualization result, and the more important the corresponding word is.According to Figure4, Di-alogConv can capture the key features in dialogue context and response such as "not" (不是 or 非 ), "quality" (质量), "problem" (问题).We can conclude that DialogConv makes decision based on the matching features between dialogue context and responses.According to Figure5, DialogConv mis-11 12096

Figure 4 :
Figure 4: An example of visualization heatmap from ECD.The conv@i represents the i-th convolution operation in Figure 3.The horizontal axis represents the dialogue history, and the vertical axis represents the response.The English translation refers to Table5.

Table 1 :
Results on Ubuntu and Douban datasets.The first, second and third groups of models belong to the Concatenated Pattern, Separate Pattern and PrLM-based Pattern, respectively.DialogConv* represents the performance when pretraining with contrastive learning.Bold indicates the best result, and underline indicates the second best result.X represents the number of layers and Y represents the hidden size of the model in BERT-X-Y , DBERT-X-Y and TBERT-X-Y .TBERT stands for TinyBERT, and DBERT stands for DistilBERT.The '-' indicates no corresponding BERT version is available.

Table 2 :
Results on ECD and MuTual datasets.The '-' indicates no corresponding BERT version is available.

Table 3 :
Comparison of model inference time and the scale of parameters."m" ("s") stands for minutes (seconds).The number of parameters of Chinese and English BERT is different because their vocabularies differ.The '-' indicates no corresponding BERT version is available.