All Information is Valuable: Question Matching over Full Information Transmission Network

Question matching is the task of identifying whether two questions have the same intent. For better reasoning the relationship between questions, existing studies adopt multiple interaction modules and perform multi-round reasoning via deep neural networks. In this process, there are two kinds of critical information that are commonly employed: the representation information of original questions and the interactive information between pairs of questions. However, previous studies tend to transmit only one kind of information, while failing to utilize both kinds of information simultane-ously. To address this problem, in this paper, we propose a Full Information Transmission Network (FITN) that can transmit both representation and interactive information together in a simultaneous fashion. More specifically, we employ a novel memory-based attention for keeping and transmitting the interactive information through a global interaction matrix. Be-sides, we apply an original-average mixed connection method to effectively transmit the representation information between different reasoning rounds, which helps to preserve the original representation features of questions along with the historical hidden features. Experiments on two standard benchmarks demonstrate that our approach outperforms strong baseline models.


Introduction
Question Matching (QM) aims to identify whether two questions have the same intent, which is widely applied in Question Answering (QA) applications such as community QA and intelligent customer services. Typically, QM is regarded as a semantic matching task (Hu et al., 2021). To correctly infer the relationship of a given question pair, there are two kinds of information that should be considered: the representation information of questions that captures the semantics of the texts, and the inter- * Corresponding author. active information between questions that contains critical hints for relationship reasoning.
For better detecting the relationship between question pairs, it's far from being enough to conduct only one single round of reasoning. Existing methods commonly resort to multiple interaction modules to do deep reasoning, where each module is generally composed of an encoding layer (can be omitted (Gong et al., 2018)) to update the representation information of questions and an interaction layer for capturing the interactive information between questions (Kim et al., 2019;Hu et al., 2021). In such a multi-round reasoning procedure, both the representation and interactive information in history rounds play a vital role in guiding the future inference. However, previous studies either only transmit the representation information (Kim et al., 2019) or only the interactive information (Gong et al., 2018), while failing to utilize both kinds of information simultaneously.
As shown in Figure 1 (i), when performing multiround reasoning, if a model only transmits the representation information, the interactive information between questions will then be simply utilized to generate the representation of questions for future rounds. Consequently, the critical hints for relationship reasoning conveyed by interactive information are abandoned and cannot be directly used for future inferences. On the other side, if a model only transmits the interactive information, it is equivalent to conduct multi-round reasoning with only one single pass on question pairs, as shown in Figure 1 (ii). Admittedly, missing the representation information of original questions may lead to understanding deviation and thus bring cascading errors. Therefore, as shown in Figure 1 (iii), to better perform reasoning between question pairs, a desirable solution should be able to transmit both the representation and interactive information from historical rounds to the current round simultaneously.
To address the aforementioned problems, in this paper, we propose a Full Information Transmission Network (FITN) that learns to transmit both the representation and interactive information between each round of reasoning. In particular, we propose a novel Memory-based Attention (Mem-Att) to transmit the interactive information between question pairs. In the Mem-Att, we maintain a global interaction matrix as a memory for keeping the interactive information and do inference on top of it. Compared with traditional attentions that calculate the alignment score directly, the proposed interaction matrix keeps rich interactive information and is more stable in the update process due to its redundancy. Thanks to the global interaction matrix, each round of inference could benefit from the historical interactive information and the whole reasoning procedure is progressive.
Meanwhile, to effectively transmit the representation information of questions, we introduce an interesting connection method, namely the Original-Average Mixed Connection (OA-mixed Connection). Instead of feeding only the hidden features from the last reasoning round, when performing reasoning at the current round, we regard both the hidden features and the original representation embeddings of questions as the input. Such a connection method enables our model an ability to explicitly utilize the entire rich information of original texts when inference. In addition, the OA-mixed Connection employs the average operation over hidden features from the last two rounds to build the input hidden feature for the current reasoning round. Compared with the residual connection (He et al., 2016) that treats the information in each round equally, the average connection pays more attention to the information in the nearer rounds, and thus brings better discrimination ability.
We evaluate our proposed method on the Quora and LCQMC benchmarks. Experimental results show that FITN outperforms the non-pretrained baselines with considerable margins. Furthermore, compared with pre-trained models (small ones with comparable parameter sizes as FITN), our FITN also achieves better performance, which reveals the advantage of proposed method under resourceconstrained conditions. All these illustrates the effectiveness of our method.
In sum, our major contributions are three-fold: • We propose the Full Information Transmission Network (FITN) that can better utilize the historical information, capturing both the representation and interactive information of questions for question matching.
• We propose the memory-based attention for keeping and transmitting the interactive information and the original-average mixed connection to fully utilize the original embedding features of texts and historical hidden features.
• We evaluate the proposed FITN on two benchmark datasets, where considerable improvements are gained over strong baseline models.

Methodology
In this section, we introduce our proposed full information transmission network (FITN) in detail. As shown in Figure 2, FITN comprises three modules: the embedding module, the interaction module and the prediction module. In FITN, we first embed each question in the embedding module, then do inference in the interaction module and finally predict their relationship in the prediction module. We denote two input questions as S = {s 1 , s 2 , ..., s m } and T = {t 1 , t 2 , ..., t n } where s i /t j is the i th /j th token of question S/T and m/n is the token length of S/T .

Embedding Module
In the embedding module, we apply the word embedding along with the character embedding to embed tokens in each question. The character embedding is randomly initialized and then processed by a convolutional neural network (CNN) with a max-pooling operation. Formally, the final representation e s i of token s i is calculated as follows: (1) where [; ] denotes the concatenation operation, Emb is the word embedding and ChConv is the character-level CNN. Each word in S and T is treated in the same procedure and then S and T can be represented as E S ∈ R m×de and E T ∈ R n×de .

Interaction Module
The interaction module is the core of our FITN, composed of N same-structured blocks for doing N rounds of inference. Each block contains 3 components: the encoding layer, the memory-based attention layer and the original-average mixed connection layer. We denote I l S and I l T as the inputs of the l th block, where I 0 S = E S and I 0 T = E T .

Encoding Layer
We encode two questions through a Bi-LSTM encoder to extract the contextual representation of each token in questions, shown as: where H l S ∈ R m×d h and H l T ∈ R m×d h are the hidden representations of I l S and I l T in the l th round, respectively.

Memory-based Attention Layer
As shown in Figure 3, we maintain a global interaction matrix for keeping and transmitting the interactive information in the memory-based attention (Mem-Att) layer. The global interaction matrix is treated as a memory, which keeps all the historical interactive information and will be updated when getting the new one. For each pair of tokens, we keep an interactive vector instead of an attention score in the global interaction matrix. The interactive vector keeps richer information and is more stable in the update process due to its redundancy. In each round, we firstly update the global interaction matrix and then do attention based on this matrix. In this way, the interactive information in history can be transmitted into the current round and provides assistance on the soft-alignment and inference between the two questions.
Global Interaction Matrix Update The global interaction matrix is updated through two steps: current interaction matrix calculation and global interaction matrix combination.

Current Interaction Matrix Calculation
The current interaction matrix in the l th round M l ∈ R n×m×d h is calculated as follows: For each pair of tokens s i and t j in the question S and T , the interaction vector M l i,j ∈ R d h in M l is calculated through the element-wise multiplication operation, shown as: Global Interaction Matrix Combination After that, we combine the current interaction matrix M l and the global interaction matrixM l−1 in the previous round and feed the concatenation result of them into a fully-connected layer with a non-linear activation function as the global interaction matrix M l ∈ R n×m×dm in the l th round: whereM l i,j ∈ R dm inM is calculated as: where [; ] is vector concatenation across row, w l m ∈ R (d h +dm)×dm and b l m ∈ R dm correspond to the weight and bias respectively.
Attention over Interaction Matrix Next, we do inference and alignment through the global interaction matrix. We firstly adopt a dense-pooling method to extract an attention map from the global interaction matrix. More specifically, we utilize a fully-connected layer with a nonlinear function to convert each vector into the attention value. Each element Att l i,j in the attention map Att l ∈ R m×n is calculated as: where w l p ∈ R dm×1 and b l p ∈ R correspond to the weight and bias, respectively. Then, the attentive representation A l s i of s i in the l th round is weighted summed by H l t j , where the weights are calculated by the softmax operation over Att l i,j : Finally, we calculate the average and the difference between the attentive representation A S/T and the contextual representation H S/T , concatenate the results with themselves together, and then feed the concatenation result into a fully-connected layer to get the outputs of the block. where are the weight and bias respectively.

Original-Average Mixed Connection Layer
Finally, we transmit the representation information through the original-average mixed connectivity pattern (OA-mixed connection) in this layer. The question representation input to each round of inference can be divided into two parts: the original features from the initial embedding of questions and the hidden features extracted from previous inference rounds. Both of them play a vital role in each round of inference, where the original features can lead the model to make inference in the right direction, and the hidden features contain deeper contextual and interactive information. Besides, the hidden features can be seen as the information enhancement of the original features. Formally, the whole process can be shown as: where I l ∈ R m×(de+d h ) (l > 0) is the l-th round input, I E is the initial embedding, and I l H is the l-th round hidden input, calculated as: where O l are the hidden outputs of the interaction module before the average connection.
Here, instead of the residual connection, we apply the average connection to capture the hidden features. Compared with the residual connection that treats the information in each round equally, the average connection pay more attention to the information in the nearer rounds. Besides, the residual connection's summation operation may cause the variance of the vectors in the hidden part to go larger as the layers deepen. In comparison, the average connection can balance the variance between the two parts of the question representation.

Prediction Module
The final representations of the two questions in the interaction module are the last block's next inputs I N +1 S and I N +1 T . To extract a proper representation for each question, we apply the max-pooling operation over them, i.e.: where V S , V T ∈ R d h +de and max extracts the maximum value in each column of the inputs. Finally, we concatenate V S and V T to get the feature vector V and feed the feature vector V into a two-layer feed-forward network with one hidden layer and one softmax layer to make the prediction.

Implementation Details
In the original FITN, we initialize the word embedding with 300d Fasttext vectors (Bojanowski et al., 2017) for the English task and 300d Word2Vec vectors trained in Baidu Encyclopedia (Qiu et al., 2018) for the Chinese task, respectively. We randomly initialize the character embedding with a 25d vector and extract a 50d character representation by CNN. Then, we conduct three rounds of inference and set the hidden size of each layer to 100d in the interaction module. Finally, we set 500 hidden units for the 2-layer FFN in the prediction module. We apply an Adam (Kingma and Ba, 2015) optimizer with a learning rate of 1e-3. We train 100 epochs on the Quora dataset and 20 epochs on the LCQMC dataset. We run 5 times with 5 different randomly selected seeds and report the mean value with the standard deviation selected according to the best performance in the development set.

Experimental Results
The main experimental results are shown in Table 1. We compare our FITN with non-pretrained models at first. In particular, we employ the baselines including: 1) DIIN (Gong et al., 2018 We can see that our model outperforms all these baselines on the two benchmarks. More specifically, our model beats DIIN because we can keep updating the representation information based on the historical representation information during iterations. Compared with DRCN, our FITN utilizes the historical interactive information for inference and in return acquires performance improvements with fewer inference rounds. Besides, the historical interactive information can also benefit our model on deeper inference. Therefore, the performance of our model is unsurprisingly better than RE2. In addition, to further verify the effectiveness of our FITN under restricted computing resources, we compare our FITN with 4 publicly available tiny pre-trained models, which are distilled from large pre-trained models (BERT-tiny and BERTmini (Turc et al., 2019) that are distilled from BERT-base (Devlin et al., 2019)) or directly pretrained by large-scale datasets (AlBERT-tiny and AlBERT-base (Lan et al., 2020)). As shown in Table 1, our model can achieve competitive or even better performance than pre-trained models with similar model size. It demonstrates that our FITN can be a desirable choice compared with pretrained models in resource-constrained scenarios.

Analysis
In this subsection, we firstly verify the effectiveness of our proposed Mem-Att and OA-mixed connection, then show the impact of inference rounds on model performance, and finally further analyze the Mem-Att by a statistical analysis and a case study.

Effectiveness of the Mem-Att
We compare Mem-Att with three attention mechanisms to verify the ability of Mem-Att to maintain richer interactive information and leverage historical interactive information to aid future inference, containing 1) the scaled dot product attention (Dot-Att); 2) the scaled weighted dot product attention (wDot-Att); and 3) the interactive attention (Inter-Att), a variance of Mem-Att, which is only based on the current interaction matrix. The functions of these attentions are shown as following:  where d is the dimension of the question representation, S ∈ R d×m , T ∈ R d×n and W ∈ R d×d . The comparison results are shown in Table 2. With an intuition that the 3D interaction matrix can keep richer interactive information than the 2D attention map, the performance of the Int-Att is unsurprisingly better than those of the wDot-Att and the Dot-Att, which demonstrates that richer interactive information can bring benefit to the model on conducting more proper inference. Furthermore, the performance of the Mem-Att is better than that of the Int-Att, which reflects that the historical interactive information can provide assistance on the current and the future inference.

Effectiveness of the OA-mixed Connection
To illustrate the advantage of the OA-mixed connection, we compare our method with the following connective patterns: 1) residual connection (He et al., 2016); 2) dense connection (Huang et al., 2017), and 3) direct connection that directly treats the output of the previous round as the input. As shown in Table 3, the direct connection unsurprisingly performs worst. These results show that the historical representation information provides benefits for the current round of inference and it is critical to design advanced connectivity patterns to effectively transmit important information between different reasoning rounds. Moreover, our OA-mixed connection beats both the residual connection and the dense connection. We attribute it to the fact that our method can preserve the entire information of original texts. Meanwhile, the average connection we proposed can help the model to focus more on the information conveyed by the surrounding reasoning rounds. All these bring rich information and helpful hints to determine the relationships between the question pair.

Impact of the Inference Rounds
In this part, we design a comparison experiment to demonstrate the impact of the inference rounds. We set the inference round in our FITN from 1 to 5 and compare their performance on Quora's development and test set. The comparison result is shown in Figure 4. Obviously, as the number of the inference round increases, our model's accuracy increases, verifying the utility of the multi-round inference. However, the increasing trend of the accuracy gradually slows down as the number of inference rounds grows. Continue stacking layers may not bring further significant improvements. We attribute this to the model capturing enough information from a limited multi-round inference under the assistance of our proposed Mem-Att and OA-mixed connection. There is no need to stack too many inference modules.

Analysis of the Mem-Att
In order to further analyze how the Mem-Att works, we compare our Mem-Att with the Dot-Att and conduct a statistical analysis along with a case study to verify that the Mem-Att can pay higher attention to the critical word pairs and the inference round in the Mem-Att is progressive.

Statistical Analysis
We conduct the statistical analysis on the development set of Quora and compare our Mem-Att with the Dot-Att. We calculate the mean value and the standard deviation of the attention distributions in each inference round to observe the distribution characteristics. Then, we calculate the Pearson correlation coefficient (Benesty et al., 2009) to quantify the relevance between two attention distributions in adjacent rounds. We take the average of the above metrics among all samples as the final metrics. As shown in Table 4, the standard deviation of the attention distributions in the Mem-Att is larger than that in the Dot-Att and the distribution of the Dot-Att tends to be uniform. It demonstrates that our Mem-Att is more discrete and pays more attention to the specific token pairs. Besides, the Pearson correlation coefficient between the attention distributions of the Mem-Att in adjacent rounds is higher than that in the Dot-Att, which denotes that the inference between adjacent rounds has more relevance in the Mem-Att. The inference procedure in the Mem-Att is progressive.
Case Study Then, we take a pair of similar questions "What is the cost of a Snapdragon 2100 SoC ?" and "What is the Snapdragon 2100 SoC pricing ?" as an example and visualize the attention distributions of the Mem-Att and the Dot-Att in each round of inference. Here, the Mem-Att predicts right and the Dot-Att predicts wrong.
As shown in Figure 5, both the Dot-Att and the Mem-Att can align all pairs of the same word in the first inference round, where the Mem-Att focuses more on these word pairs than the Dot-Att. The distribution of the Mem-Att is more concentrated than that of the Dot-Att, which denotes that  the Mem-Att has obvious tendency to pay attention. With the increase in the number of inference rounds, the Mem-Att's distribution does not tend to be uniform. Furthermore, the change of the Mem-Att's distribution is continuous, where the Mem-Att gradually deepens its focus on "cost" and "pricing". It demonstrates that the inference in the Mem-Att is progressive. The Mem-Att can gradually align word pairs with similar semantics.

Related Work
Question Matching can be regarded as a semantic matching task, which core lies in how to model the vector representation of texts Reimers and Gurevych, 2019;Gao et al., 2021) and reason about the semantic relationship between text pairs. ESIM (Chen et al., 2017) encodes texts through BiLSTM or TreeL-STM (Socher et al., 2013) and applies the coattention to extract fine-grained alignment information for inference. BiMPM (Wang et al., 2017) matches texts from multiple perspectives by multiple kinds of attentions. For better inference, many studies tend to employ deeper models. DIIN (Gong et al., 2018) applies a dense-net on the interaction matrix extracted from two texts for deep inference. DRCN (Kim et al., 2019) iterates one same block multiple times for multi-turn inference. TIM-W (Zhou et al., 2020) is based on deep mutual in-formation estimation. ADIN (Liang et al., 2019) performs multiple rounds of asynchronous reasoning for the NLI task. In comparison, our FITN performs better due to the better utilization of historical information.
Thanks to the knowledge obtained from massive data, pre-trained models can greatly improve the performance of semantic matching, such as BERT (Devlin et al., 2019) AlBERT (Lan et al., 2020). However, the complexity of the model and the time consumption of reasoning are greatly increased, making them not suitable to resourceconstrained scenarios. Enhanced-RCNN (Peng et al., 2020) compares itself with BERT in inference speed and accuracy. Although the performance is relative low, its inference speed is 10 times faster than BERT-base. Under the resourceconstrained condition, directly using publicly available tiny pre-trained models is another solution. These models are commonly pre-trained with large-scale corpus (like AlBERT-tiny and AlBERTbase (Lan et al., 2020)) or distilled from large pretrained models (like BERT-tiny (Turc et al., 2019)). Compared with these tiny pre-trained models, our FITN achieves better performance.

Conclusion
In this paper, we study the task of question matching and propose a Full Information Transmission Network (FITN) that can utilize both the historical representation and the historical interactive information together in a simultaneous fashion. Specifically, the FITN employs a memory-based attention to keep and transmit the historical interactive information and an original-average mixed connectivity pattern to transmit the representation information. Experimental results on two benchmarks show that our FITN takes advantage of both kinds of information and outperforms strong baselines with considerable margin.