Context-Aware Interaction Network for Question Matching

Impressive milestones have been achieved in text matching by adopting a cross-attention mechanism to capture pertinent semantic connections between two sentence representations. However, regular cross-attention focuses on word-level links between the two input sequences, neglecting the importance of contextual information. We propose a context-aware interaction network (COIN) to properly align two sequences and infer their semantic relationship. Specifically, each interaction block includes (1) a context-aware cross-attention mechanism to effectively integrate contextual information when aligning two sequences, and (2) a gate fusion layer to flexibly interpolate aligned representations. We apply multiple stacked interaction blocks to produce alignments at different levels and gradually refine the attention results. Experiments on two question matching datasets and detailed analyses demonstrate the effectiveness of our model.


Introduction
Semantic text matching is among the most fundamental tasks in natural language processing. Given two sentences, the goal is to predict their semantic relationship. In this work, we focus in particular on question matching (QM) benchmarks.
Recently, the availability of large-scale annotated datasets has led to a proliferation of deep neural architectures for text matching (Williams et al., 2018;Chen et al., 2017;Wang et al., 2017). Most existing neural models fall into two categories, namely the sentence encoding and the sentence interaction approaches (Lan and Xu, 2018). The former encodes sentences as fixed-length vector representations, which are then consulted to make the final prediction. The latter considers interactions between two sequences to identify their semantic connections, which tends to yield better results. The original attention mechanism (left) and the proposed context-aware attention (right). w * represents the two sequences (more generally, they can be regarded as query and key). C * denotes the contextual features.
Attention mechanisms are widely adopted for the sentence-interaction approaches, relying on a word-by-word attention matrix to obtain alignment information between two sequences. This has proven fruitful in modeling sentence pair relationships (Parikh et al., 2016;Rocktäschel et al., 2015;Wang and Jiang, 2016). Nonetheless, when computing the cross-sentence attention, existing models mostly focus on word-level local matching and fail to fully account for the overall semantics: each value of the attention matrix is based on just two individual tokens from the sequences without full consideration of the context. As shown in Figure 1, in the original attention mechanism, each token individually attends to the other tokens without accounting for important contextual information. However, accurate matching may require a deeper understanding of the two sentences along with pertinent linguistic patterns and constructions (Storks et al., 2019). Yang et al. (2019a) show that contextualizing the self-attention network may improve the original representations, but they do not consider the scenario of sentence pairs with crossattention.
In this work, we aim to generalize the notion of cross-sentence attention by enabling it to incorporate rich contextual signals. We propose a COntext-aware Interaction Network (COIN) with a novel context-aware attention layer. This layer enables the model to consult contextual information while computing the cross-attention matrix to measure the word relevance, yielding better contextualized alignments for semantic reasoning. We leverage the self-alignment on each sequence to produce contexts that represent salient features for each token. The subsequent gate fusion layer is designed to enable the model to selectively integrate the aligned representations and control to what extent the new information is to be passed to the following layers, which is similar to a skip connection in mitigating the additional model complexity coming from the deeper structure. Finally, an aggregation layer and a multi-head pooling layer are adopted to infer high-level semantic representations for the sequences and predict the result based on the refined representations.
To validate the effectiveness of our method, we conduct extensive experiments on the Quora and LCQMC datasets, along with further analyses of model components and a case study visualizing the alignment. The results show that by incorporating rich context into cross-attention, our model outperforms state-of-the-art methods without the huge number of model parameters and pre-training on extrinsic data of BERT models.

Method
Question matching can be viewed as a classification task that seeks a label y ∈ Y = {DUPLICATE, NON-DUPLICATE} for a given sentence pair (S a , S b ). Figure 2 illustrates our novel sentence interaction approach for this task. In the following, we describe the individual ingredients of this approach.

Input Representation Layer
The input representation layer converts each sentence into matrix representations with an embedding and encoding layer. We invoke word embeddings without additional lexical features and adopt a multi-layer convolutional encoder on top of the embedding layer. In addition, we concatenate the contextual representations with the original embeddings to produce better alignments in the following interaction blocks. This serves a similar purpose as skip connections to represent words at different levels (Wang et al., 2018).

Context-aware Interaction Block
Our proposed interaction block consists of a context-aware cross-attention and a gate fusion layer. Several such interaction blocks are stacked to obtain refined alignments.

Cross-Attention Layer
We first review the original cross-sentence attention before introducing our context-aware form of attention. Assume the two inputs of the current layer are H a = (h a 1 , ..., h am ) and H b = (h b 1 , ..., h bn ), where m and n are the corresponding sequence lengths. The word-by-word attention matrix is first calculated as follows: where F 1 is a feed-forward neural network. Then the similarity matrix E is used to compute aligned representations of each sequence as a weighted summation with regard to the other sentence: Limitation. It is evident in Eq. 1 that each value of the attention matrix is governed by the parameters of the feed-forward layer with respect to only the individual token pairs, so the layer does not take advantage of valuable contextual signals.

Context-Aware Cross-Attention Layer
We propose a novel context-aware cross-attention layer by incorporating contextual representations into the cross-attention. The goal is to enable the model to identify salient contextual features for each token, and consider these features when computing the cross-attention matrix E. Given C a = (c a 1 , ..., c am ), C b = (c b 1 , ..., c bn ) as contextual representations for the two sentences, we modify the attention mechanism from Eq. 1 to be able to draw on these as additional inputs when computing the word-by-word attention matrix: By incorporating the contextual vectors, the model is able to take advantage of the full context and enable better alignments. Contextual Representations. In order to compute such representations of the contexts, given each sequence, we adopt a self-alignment layer to aggregate pertinent contextual information. Each contextual vector is computed by attending to the input hidden states and conducting a weighted summation. Formally, for the input H = (h 1 , ..., h n ): Here, W c is a trainable parameter. Leveraging the self-alignment to produce contextual signals also mirrors human behavior in the sense that when matching two sentences, people tend to first process each sentence paying attention to the important contents, and then compare the two sentences and connect relevant elements (words or phrases) with contextual features to identify their relationship, rather than just comparing individual words.

Gate Fusion Layer
Subsequently, a gate fusion layer compares the original sequences against the aligned representations and blends them together as new sequence representations. Specifically, we first compare the original representation (H a ) with the aligned one (H a ) from three perspectives, and then combine them with a non-linear transformation: Then a gated connection is applied to enable the model to selectively integrate the aligned features: Here σ is a Sigmoid nonlinear transformation, while W * and b g are trainable parameters. The same operation is conducted on sentence S b , thereby yielding the outputs H a and H b . With these operations, the model can flexibly interpolate the aligned information by controlling the gate, especially when multiple interactions are applied.

Aggregation Layer
To obtain high-level semantic representations for each sentence, we apply another convolutional neural network on top of the interaction blocks to obtain the aggregated sentence representations V a , V b , serving as the inputs for the prediction layer.

Pooling and Prediction Layer
We compute a weighted summation of the hidden states to get sentence vectors. To allow the model to represent each sequence in different representation subspaces, we adopt multi-head pooling following Liu and Lapata (2019). For each head z, we first transform the sequence into attention scores S z and values V z : where W z a ∈ R 1×d and W z u ∈ R d h ×d are trainable parameters, with d h = d/n h as the dimensionality of each head and n h as the number of heads. The pooling vector of head z is computed as where s z i and v z i denote the calculated attention scores and values. The pooling vectors of all heads are concatenated to form the final vector representations of each sequence V a and V b . We combine V a and V b to produce the overall representation by concatenating the different operations: Finally, the prediction layer takes the representation V and passes it to a fully-connected network component to predict the ultimate target scores.  (Chen et al., 2017) 82.0 84.0 BiMPM (Wang et al., 2017) 83.3 84.9 GMN (Chen et al., 2020) 84 (2019), we avoid word segmentation and instead use a randomly initialized character embedding matrix. The kernel size is 3 for convolutional layers with padding. We tune the dimensionality of the feed-forward layers from 150 to 300. The batch size is tuned from 32 to 128. Adam optimization is used with an initial learning rate of 0.001 and exponential decay. We use ReLU (Glorot et al., 2011) as the activation function in all feed-forward networks. To prevent over-fitting, dropout with a retention probability of 0.8 is applied. We apply 3 context-aware interaction blocks for Quora and 2 interaction blocks for LCQMC. For BERT (Devlin et al., 2019), we choose the BERT-base version (12 layers, 768 hidden dimensions and 12 attention heads). Further training details are given in the appendix.

Experimental Results
We compare our model against recent prior work, including state-of-the-art neural models and BERT based methods. ESIM (Chen et al., 2017) and BiMPM (Wang et al., 2017) are two strong sentence-interaction baselines. GMN (Chen et al., 2020) is a neural graph matching network with Model Acc. (%) Params BiMPM (Wang et al., 2017) 88.2 1.6M DIIN (Gong et al., 2017) 89.0 4.4M CAFE (Tay et al., 2018) 88.7 4.7M OSOA-DFN  89.0 10.0M RE2 (Yang et al., 2019b) 89.2 2.8M ESAN (Hu et al., 2020) 89.3 3.9M Enhanced-RCNN  89.3 7.7M COIN (ours) 89.4 6.5M BERT (Devlin et al., 2019) 90.1 109.5M SBERT (Reimers and Gurevych, 2019) 90.6 109.5M COIN (ensemble) 90.7 32.5M multi-granular input information. DIIN (Gong et al., 2017) extracts semantic features from the interaction space. OSOA-DFN  uses multiple original semantics-oriented attention, and RE2 (Yang et al., 2019b) adopts richer features for alignment processes to improve the performance. ESAN (Hu et al., 2020) is a sentence-interaction model with gated feature augmentation. For pretrained methods, we consider BERT (Devlin et al., 2019) and SBERT (Reimers and Gurevych, 2019). We also include ensemble results of our method where we consider the majority vote of the results given by 5 runs of the same model under different random parameter initialization. Results on LCQMC are listed in Table 1. Our single model achieves better accuracy and F1-score than all non-pretrained baselines, and the results of COIN are fairly comparable to BERT despite not being pretrained on any extrinsic data. In fact, our ensemble model (5 runs) even outperforms BERT.
The results on Quora are given in Table 2. Our approach outperforms the non-pretrained baselines with 89.4% test accuracy, and our ensemble model again achieves better results than BERT and SBERT with fewer parameters (32.5M vs. 109.5M). This confirms our model's ability to be applied in real-world scenarios that require less computational complexity and a smaller model footprint.
Overall, the above results on two question matching datasets reflect our model's effectiveness at capturing semantic interactions between the sentences and properly interring their relationship. In-depth analyses of the model's efficiency are given in the appendix.

Model Analysis
Effect of Model Components. In Table 3, we study the contribution of different model components. Without context in cross-attention, the accuracy decreases by 0.5 and 0.6 percentage points,   respectively. This confirms that, by incorporating the context, our model can better capture sentence relationships in the alignments. We then replace the gate fusion with a simplified fusion layer, where we feed the concatenation of the two representations to a feed-forward network, observing a performance drop on both datasets. This shows the effectiveness of our context-aware interaction blocks. We then remove the aggregation layer, finding that the accuracy decreases to 89.2% and 84.9%. This confirms that the aggregation layer is useful to produce highlevel representations for the final prediction. In the last ablation, we replace the multi-head pooling by max-pooling to produce the sentence vector, and the results decrease on both datasets. Effect of Interaction Block Depth. Figure 3 plots the accuracy with varying numbers of interaction blocks. Evidently, a smaller number of interaction blocks may not suffice to fully capture the sentence relationships, and adding further such blocks may improve the model's ability to reason across the sequences and boost the performance. However, increasing the depth of interactions more than necessary harms the performance. Additionally, there is a trade-off between performance and efficiency since adding more interaction blocks increases the number of parameters. For computational cost reasons, we use at most three interactions blocks in our experiments. Case Study.
We analyze the context-aware interaction results by visualizing the attention to show how the model learns aligned features at different levels of interaction in Figure 4. We consider a sample from Quora with the target label DUPLICATE.
The left image shows the contextualized crossattention in the first interaction block. Aided by the context, the model learns to correctly align the salient phrase "new macbook pro" across the inputs. The attention results in the third interaction block are visualized in the right image. As we can observe, the model refines the alignment results with a sharper distribution on the salient phrases than in the first interaction block, and the structured phrase "what do you think of " is also connected. The model thus predicts the relationship between the two sentences correctly. This corroborates our model's ability to gradually refine and adjust the attention scores in higher layers.

Conclusion
In this work, we propose a context-aware interaction network for question matching. We improve the cross-attention by incorporating contextual cues, and further leverage a gate fusion layer to flexibly integrate the aligned features. Experiments on two datasets validate the effectiveness of our architecture and show that accounting for the context enhances the original cross-attention.

A Experiment Details
Data Statistics. Statistics of the datasets are given in Table 4. For LCQMC, we follow the same data split as in the original work (Liu et al., 2018), and for Quora we use the same split as Wang et al. (2017).  Preprocessing. We apply a hard cut-off of the sentence length on both datasets by cropping or padding. Recent work has shown that characterbased models typically outperform word-based models over Chinese NLP tasks (Li et al., 2019), so we apply character-based modeling for LCQMC. For Quora, we set the length as 32, and for LQCMC we set the length as 50. We mask the padding tokens during the experiments.
Embedding Details. For Quora, we use 300dimensional GloVe CommonCrawl 840B word embeddings (Pennington et al., 2014) and fix the weights during training. For LCQMC, following Li et al. (2019), we avoid word segmentation and instead use a randomly initialized character embedding matrix. We set the dimensionality of character embeddings to 200, and train the weights. For sentence preprocessing, we tokenize and lowercase all words. For efficiency and more generalizable results, we do not incorporate any additional lexical features in our experiments.
Training Details. The kernel size is 3 for convolutional layers with padding. We apply 2 layers of convolutional encoder and 1 layer of convolutional aggregation in all experiments. We tune the dimensionality of the feed-forward layers from 150 to 300, and the number of interaction blocks from 2 to 4. The batch size is tuned from 32 to 128. Adam optimization is used with an initial learning rate of 0.001 and exponential decay. We apply ReLU (Glorot et al., 2011) as the activation function in all feed-forward networks. To prevent over-fitting, dropout with a retention probability of 0.8 is applied. Cross-entropy serves as the loss function during training. Adam optimization is used with an initial learning rate of 0.001, and β 1 is set as 0.9 and β 2 as 0.999 during training. Exponential decay is also applied. Moreover, we add L2 regularization, and set the threshold for gradient clipping as 5. We apply 3 context-aware interaction blocks for Quora, and 2 interaction blocks for LCQMC. We implement our model using TensorFlow (Abadi et al., 2016) and train the models on NVIDIA Tesla V100 GPUs and NVIDIA Tesla P4 GPUs. For BERT (Devlin et al., 2019), we choose the BERT-base version (12 layers, 768 hidden dimensions and 12 attention heads), and fine-tune the model using the official implementation 1 . The Chinese pre-trained BERT is adopted from https://huggingface.co/bertbase-chinese. For SBERT (Reimers and Gurevych, 1 https://github.com/google-research/bert 2019), we utilize the original implementation 2 , and add a softmax classifier on top of the output of the two Transformer networks as in the original paper.

B Model Efficiency
Models parameter size time (s/batch) COIN 6.5M 0.12 ± 0.03 BERT 109.5M 1.19 ± 0.06 Pretrained language models such as BERT (Devlin et al., 2019) have drawn much attention for their substantial gains across a range of different natural language processing tasks. However, BERT is fairly demanding in terms of the computational requirements. For additional analysis, we compare our model efficiency with BERT-base on Quora. We set the sentence lengths as 32 (64 for BERT after concatenating the two sequences). Both models need to make predictions for a batch of 8 sentence pairs on a MacBook Pro with Intel Core i7 CPUs. For BERT, we add a linear layer on top of the [CLS] token for classification, as in the original paper. We report the average and the standard deviation of processing 1,000 batches.
As shown in Table 5, COIN contains far fewer parameters than BERT and is much faster in terms of the CPU inference speed. Additionally, our single model produces comparable results to BERT on both Quora and LCQMC. This shows that our proposed method is effective at tackling text matching tasks with substantially fewer parameters and high computational efficiency.