VIRT: Improving Representation-based Text Matching via Virtual Interaction

Text matching is a fundamental research problem in natural language understanding. Interaction-based approaches treat the text pair as a single sequence and encode it through cross encoders, while representation-based models encode the text pair independently with siamese or dual encoders. Interaction-based models require dense computations and thus are impractical in real-world applications. Representation-based models have become the mainstream paradigm for efficient text matching. However, these models suffer from severe performance degradation due to the lack of interactions between the pair of texts. To remedy this, we propose a Virtual InteRacTion mechanism (VIRT) for improving representation-based text matching while maintaining its efficiency. In particular, we introduce an interactive knowledge distillation module that is only applied during training. It enables deep interaction between texts by effectively transferring knowledge from the interaction-based model. A light interaction strategy is designed to fully leverage the learned interactive knowledge. Experimental results on six text matching benchmarks demonstrate the superior performance of our method over several state-of-the-art representation-based models. We further show that VIRT can be integrated into existing methods as plugins to lift their performances.


Introduction
Text matching aims to model the semantic correlation between a pair of texts, which is a fundamental problem in various natural language understanding applications.For instance, in community question answering (CQA) (Zhou et al., 2011;Patra, 2017) systems, a key component is to find similar questions from the database regarding a user question via question matching (Gupta et al., 2018;Sharma et al., 2019).Similarly, a dialogue agent (Welleck et al., 2019) needs to make logical inferences (Conneau et al., 2017;Gao et al., 2021) between a user statement and some pre-defined hypotheses by predicting their entailment relations.
Recently, the wide use of deep pre-trained Transformers (Vaswani et al., 2017) has made remarkable progress in text matching tasks (Raffel et al., 2020a;Ni et al., 2022;Tay et al., 2022).Two paradigms based on fine-tuned Transformer encoders are typically built: interaction-based models and representation-based models, as illustrated in Figure 1(a) & (b).Interaction-based models (e.g., BERT (Devlin et al., 2019)) jointly encode the text pair, which allows the two text sequences to attend each other from the bottom layer to the top layer, resulting in effective matching signals.However, full interaction leads to high computational cost with large inference latency.In addition, text embedding can not be cached or pre-computed, which makes them impractical in many real-world scenarios.For example, in an E-commerce search system, it will cost dozens of days to score millions of query-product pairs with interaction-based models (Chen et al., 2020).Representation-based models (Khattab and Zaharia, 2020;Ni et al., 2022) encode two texts independently with siamese or dual encoders (Cer et al., 2018;Reimers and Gurevych, 2019), which enable the offline-computing of text embeddings and thus significantly reduce the online latency.Unfortunately, independent encoding without any interaction fails to capture the correlation between the text pair, resulting in severe performance degradation.
To balance efficiency and efficacy, several works attempt to equip the siamese structure with late interaction modules.These late interactions are essentially light-weight interaction layers that fuse the two text embeddings from the individual encoders.A variety of late interaction strategies

Siamese encoders
Sequence 1 have been proposed, including MLP layers (Liu et al., 2021), cross-attention layers (Humeau et al., 2020) and Transformer layers (Cao et al., 2020), which obtain considerable improvements on different text matching tasks with reasonable costs.However, these interaction modules are added after Siamese encoders, while interactions in the encoding process of Siamese encoders are still ignored, leaving a large performance gap compared to the interaction-based models.
In this work, we propose a Virtual InteRacTion (VIRT) mechanism with interactive knowledge distillation for improving representation-based text matching while keeping its efficiency.Specifically, Siamese encoders learn interactive information between the pair of texts by mimicking the full interaction, with transferred knowledge from the interaction-based models as guidance.We employ the knowledge transfer as an attention map distillation during training, which is removed during inference to keep the Siamese property, and thus called "virtual interaction".Moreover, we design a VIRT-adapted interaction strategy after Siamese encoding to further leverage the learnt interactive knowledge.Our proposed VIRT is illustrated in Figure 1(c).Experimental results on six text matching benchmarks show the superior performance of VIRT over several state-of-the-art baselines.We summarize the main contributions of this work as follows: • We propose a novel virtual interaction encoder for representation-based text matching, which effectively models the correlation between a pair of texts without additional inference cost.To the best of our knowledge, it is the first work that introduces interaction into the encoding process of Siamese encoders.
• We develop an interactive knowledge distillation module, which enables deep interaction by transferring knowledge from the interaction-based model.In addition, we design a VIRT-adapted interaction layer to further leverage the learnt interactive knowledge.
• Extensive experiments show that the proposed VIRT outperforms previous SOTA representation-based models, and maintains inference efficiency.The results also indicate that VIRT can be easily integrated into any representation-based text matching models for boosting their performance.

Related Work
Text Matching Models Text matching models typically take two textual sequences as input and determine their semantic relationship.Early works perform keyword-based matching such as TF-IDF and BM25 (Pérez-Iglesias et al., 2009).These methods rely on manually defined discrete features, thus usually fail to evaluate the semantic relevance of texts.With the development of deep learning, a large variety of neural models have been proposed for text matching, which use recurrent neural networks (Wu et al., 2017;Mitra et al., 2017;Yang et al., 2016) and convolutional neural networks (Hu et al., 2014) as the backbone, and encode textual sequences into semantic embeddings for fine-grained matches.
Recently Transformer-based models (Bao et al., 2019;Li et al., 2020) leverage self-attention to achieve promising performance on several text matching tasks (Tang et al., 2021;Qu et al., 2021;Xiong et al., 2021).Generally, these models can be classified into interaction-based models (Logeswaran and Lee, 2018;Devlin et al., 2019) and representation-based models (Reimers and Gurevych, 2019).As a typical interaction-based model, BERT (Devlin et al., 2019) concatenates the text pair as the input and uses its [CLS] token embedding to predict the matching (Nogueira and Cho, 2019).In contrast, representation-based models utilize dual encoders to encode the pair of texts individually, which achieve high inference efficiency by pre-computing and storing all text embeddings in the database.However, there is usually a large performance degradation compared to interaction-based models.More recently, late interactions with light attention layers (Humeau et al., 2020;Khattab and Zaharia, 2020;Cao et al., 2020) have been introduced after dual encoders to balance efficiency and efficacy.However, rich interactive information between the text pair is still ignored during encoding.
Knowledge Distillation Knowledge distillation (Hinton et al., 2015;Tang et al., 2019) is to transfer knowledge from a teacher model with better quality to a less complex student model.Various works (Jiao et al., 2020;Sanh et al., 2019;Sun et al., 2019Sun et al., , 2020) ) have been proposed to compress BERT to a tiny structure with fewer Transformer layers and smaller hidden size through distilling predicted logits and hidden states.There are several recent distillation works that are closely related to our work.DiPair (Chen et al., 2020) performs extra interaction through a light Transformer layer, and distills predicted logits from the interaction-based model.Deformer (Cao et al., 2020) adopts multiple Transformer-based interaction layers and distills the representations as well as the predicted logits from the interaction-based model.However, these methods merely distill logits/representations from interaction-based models to the late interaction layer of representation-based models.In contrast, VIRT distills the attention map from the interactionbased model directly to the encoding process of Siamese encoders, which transfers interactive knowledge more effectively.

Preliminaries
Interaction-based Models Given two textual sequences X = [x 1 ; . . .; x m ] and Y = [y 1 ; . . .; y n ] as input, the interaction-based models concatenate X and Y into [X; Y ], and encode [X; Y ] with a Transformer encoder (Devlin et al., 2019): Each layer of Transformer consists of two residual sub-layers: a multi-head attention operation (MHA) (i.e., Eq. 1a, Eq. 1b) and a feed-forward network (FFN) (i.e., Eq. 1c): where These models are very efficient, especially for downstream retrieval tasks: 1) they do not need to conduct pairwise encoding.2) text embedding for the corpus can be pre-computed.However, since there is no interaction between X and Y during encoding, fine-grained interactive information would be lost in representation-based models, resulting in significant performance degradation.

VIRT
The major weakness of representation-based models is lacking interaction when individually encoding two input sequences.
Essentially, the interaction-based models perform interaction through the attention mechanism, and compute a unified attention map using both X and Y .On the other hand, the representation-based models compute two disjoint attention maps from X and Y respectively.In the following sections, we first present the details of the difference between these two types of models in terms of the MHA operation.Next, we introduce the VIRT mechanism which improves the representation-based models without extra inference cost.

MHA Analysis
The MHA operation in interaction-based models is illustrated by the blue attention map in Figure 2(b).Specifically, the input representations H of the l-th layer in interactionbased models could be decomposed to the X-part and the Y -part, i.e., H = [H x ; H y ], where Note that we omit the superscript l here for the simplicity of the presentation.In the attention map computation, the query and key matrices could also be rewritten as the combination of the X-part and the Y -part, i.e., Q = According to Eq. 1a, the final attention score before the softmax(•) operation (denoted as S) could be decomposed as the following partitioned matrix: (2) In particular, S x→x ∈ R m×m and S y→y ∈ R n×n are the MHA operations performed in X or Y only, which correspond to the MHA operations in representation-based models.S x→y ∈ R m×n and S y→x ∈ R n×m represent the interactions between X and Y in interaction-based models, which are responsible for enriching the representations with interactive information.However, these interactions are missing in representation-based models, as illustrated by the missing attention maps in Figure 2(b).
Interactive Knowledge Transfer In order to bring the missing interaction back and bridge the performance gap, we let representation-based models mimic the interactions as: where M x→y denotes the attention map which is generated by H x attending to H y , and similar for M y→x .These two additional attention maps represent the missing interactive signals in representation-based models, which are responsible for updating the representations.However, they cannot be directly calculated from the dual encoders in representation-based models, resulting in less effective text embeddings.
To close the performance gap between representation-based and interaction-based models, we propose to align the missing attention maps with their counterparts that have already existed in interaction-based models.Intuitively, the attention maps in the interaction-based models can guide the learning of the representations to evolve towards an interaction-rich direction as if the representations have interacted with each other during the encoding process.By this means, we distill the knowledge in interaction and transfer it into the dual encoders without any extra computational cost in inference.That is why we call the mechanism "virtual interaction".
Concretely, we employ a trained interactionbased model as the teacher and distill the knowledge to a representation-based student model.In each layer, we obtain the attention maps M x→y and M x→y from the interaction-based model and transfer these supervised interactive knowledge to guide the learning of the representation-based model.Formally, the goal is to minimize the L 2 distance across all layers between ( M x→y , M y→x ) and (M x→y , M y→x ): .
(4) Note that the above distillation is only applied in the training stage to learn better dual encoders.This preserves the Siamese property of representationbased models without extra inference cost.

VIRT-Adapted Interaction
Through VIRT, interactive knowledge could be incorporated deeply into each encoding layer of the representation-based models.However, after Siamese encoding, the representations of the last layer, i.e., H x and H (L) y , still cannot see each other, and thus lack explicit interaction.To make full use of the learnt interactive knowledge, we further design a VIRT-adapted interaction strategy, which fuses H (L) x and H (L) y under the guidance of the attention map learnt by VIRT.
Specifically, we perform VIRT-adapted interaction between the H (L) x and H (L) y following the process in Eq.3.The generated attention maps are formulated as follows: x , H where Pool(•) denotes the mean pooling operation.Eq. 5 employs the same interaction strategy as VIRT, and further utilizes learnt attention maps to update representations explicitly.Finally, we utilize simple fusion to make predictions: where (, ) is the concatenate operation, and MLP denotes the Multi-Layer Perceptron.The overall training objective is minimizing the combination of the task-specific supervision loss L task and the distillation loss L virt : where α is a hyper-parameter to weight the influence of virtual interaction.It is noteworthy that VIRT is a general strategy, and can be used to enhance any representation-based matching models, as will be shown in experiments.

Datasets
We conduct an extensive set of experiments on three types of datasets, including three sentencesentence matching tasks (MNLI, QQP, RTE), one question answering task (BoolQ) and two realworld query-passage matching tasks (Q2P, Q2A).
An overview of all the datasets is provided in (For GLUE and SuperGLUE, the results on development sets are reported since they do not distribute labels for test sets.For Q2P and Q2A datasets, we construct development sets, which is non-overlapping with the training sets.))MNLI (Williams et al., 2018) is a large-scale entailment classification dataset.The objective is to predict the relationship between a pair of sentences as entailment, neutral, or contradiction.RTE (Bentivogli et al., 2009) dataset comes from a series of annual competitions on textual entailment.The objective is to predict whether a given hypothesis is entailed by a given premise.QQP (Sharma et al., 2019) is a large-scale sentence similarity dataset with question pairs from Quora.The task is to determine if the two questions have the same meaning.BoolQ (Clark et al., 2019)  Table 2: Performance comparison on six datasets.Note that we only report online parts of inference latency, since the representation-based embeddings could be computed offline and online latency in real-world scenarios is more concerning.Since models on these six datasets take a similar input setup, we report inference latency on BoolQ and omit the other five.Results are statistically significant with p-value < 0.001.
et al., 2016) containing 110K query passage pairs.Given a (query, passage) pair, the goal is to predict whether the passage contains the answer for the query.The original dataset does not contain labeled negative samples.For each query, we sample the negative passage from the top-100 passages retrieved by BM25.
Q2A is our internal dataset containing a huge amount of query-advertisement pairs.All the data are crawled from a Chinese E-commerce website and manually annotated.Given a (query, advertisement) pair, the goal is to predict the relevance between the advertisement and the query.

Baselines
We adopt several state-of-the-art representationbased matching models as our baselines.
Siamese BERT (Devlin et al., 2019) et al., 2020b).The output embeddings of two sequences and their difference are concatenated to give final predictions.

Experimental Setup
VIRT setup We use BERT-base (Devlin et al., 2019) as the encoder backbone of VIRT.The parameters are initialized with the pre-trained BERT-base model (uncased).We share all parameters between Enc x (•) and Enc y (•).We also take BERT-base as the interaction-based model, which is finetuned first, and used as the teacher model to transfer interaction knowledge to representation-based models.The pooling strategy of BERT-base at the prediction layer is fixed to mean pooling (instead of [CLS]), as we observe better performance on both BERT-base and all VIRT-enhanced representation-based models.
Implementation Details All baselines are initialized with pre-trained BERT-base parameters, and fine-tuned to achieve the best results on the validation sets.It is worth noting that we fix the total number of transformer layers for all models at 12 to make a fair comparison, though some of the baselines such as DiPair (Chen et al., 2020) take fewer layers for extreme efficiency at the cost of performance.The first 8 and first 16 output token embeddings of X and Y are picked out as DiPair's input, which is the best setting reported from its paper.The number of context vectors in Poly-encoders is 360.For MNLI and QQP, we use the standard partition and metrics on the GLUE benchmark1 .For RTE and BoolQ, we follow the SuperGLUE2 .For Q2P and Q2A, we construct the dataset from MSMARCO Passage Ranking data and real-world E-commerce data using AUC-ROC as the evaluation metric.We split 10% of the training set for tuning hyper-parameters in these tasks, and report results on the original development split.We implement all models with Tensorflow 1.15 on Tesla V100 GPU (32GB memory).We set α as 1 and the batch size as 28.Training epochs for six tasks are set to 5, 30, 5, 30, 5, 5 respectively.Sequence length of two texts for six tasks are set to (128, 128), (64, 328), (128, 128), (64, 328), (200, 200), (16, 256) respectively.The learning rate is set to 5e − 5, with the warm-up ratio set to 0.1.All models are optimized by Adam optimizer with β 1 = 0.9, β 2 = 0.999, ϵ = 1e − 8.For measuring the online inference latency, we run the inference with the batch size set to 28.We repeat each experiment 10 times and report the metrics based on the average over these runs.

Main Results
The performance comparison of different methods is presented in Table 2. BERT-base shows its effectiveness as a powerful interaction-based model.
Siamese BERT has a significant performance decline compared with BERT.De-Former, DiPair, Poly-encoder and Sentence-T5 achieve considerable improvement compared with Siamese BERT.Finally, VIRT achieves the best performance, outperforming all the representationbased baselines.It even obtains competitive results compared with the interaction-based BERT model.These results validate that VIRT is able to approximate the deep interaction modeling ability of the interaction-based models.
We further compare the inference latency on the BoolQ dataset across different models, which is also listed in Table 2.According to the result, all representation-based models show significant speedup compared with the interaction-based models.The speedup mainly benefits from the Siamese encoder, which enables embeddings computed offline.Siamese BERT achieves the fastest inference speed, yet suffers from a severe performance decline.DeFormer gets relatively higher latency, due to the computation complexity of the extra interaction layers.Dipair truncates the sequence to a shorter length before the interaction layer, which produces an excellent speed-up in terms of online latency.Polyencoder and Sentence-T5 considerably improve the performance, at the cost of slightly increased computations.Compared with all the baselines, our model shows superiority in terms of performance while keeping the high efficiency at the same time.Note that the inference latency is computed based on the average of all example pairs in an online manner.However, representation-based methods are able to pre-compute the embeddings of the corpus offline, and therefore dramatically reduce the inference time for downstream applications.

Ablation Study
To understand the impact of different components in VIRT, we conduct an ablation study by removing each component and retrain the models.In particular, "w/o distillation loss" means removing the optimization goal of Eq. 4.
"w/o adapted interaction" means removing the adapted interaction in Eq. 5, and using simple fusion for representation at the last layer as Eq. 6. "w/o both" means remove both strategies simultaneously.The results are shown in Figure 3.The drop in performance without distillation or adapted interaction indicates the effectiveness of these two architectures.For MNLI and RTE, the performance drop caused by removing adapted interaction is more severe.Our hypothesis is that MNLI and RTE are natural language inference tasks, which require more fine-grained matching signals and rely heavily on explicit interaction.For QQP, BoolQ and Q2A, adapted interaction has less effect.However, distillation still brings substantial improvement, which further validates the effectiveness of incorporating interaction.

Layer Importance
In this set of experiments, we apply VIRT to different selected layers in the dual encoder to understand the importance of the interaction   (3) VIRT-Skip: applying VIRT to 1-in-k layers.(4) VIRT-All: applying VIRT to all layers.The results on MNLI and BoolQ are shown in Figure 4.It is not surprising to see that VIRT-All achieves the best performance over all the compared settings, showing the importance of the interaction for all layers.We observe that VIRT-First performs better than VIRT-Last and VIRT-Skip when all activating 6 layers, which indicates that interaction knowledge from the bottom layers plays a crucial role.We also applied VIRT at the last one layer, referring to (Wang et al., 2020) who claims distilling the last layer is enough.However, we find that when the teacher model and the student model are heterogeneous, merely distilling the information of the last one layer faces great performance degradation.

Impact of VIRT Distillation
To verify the generality and effectiveness of the proposed VIRT distillation, we further import it into the aforementioned representation-based models by applying the knowledge distillation to different baselines.The results are reported in Table 3.According to the results, we can observe that VIRT distillation could be easily integrated into other representation-based text matching models to lift their performances.Note that the results in Table 3 are different from the results of w/o adapted interaction in the ablation study.In the ablation study, we always leverage the fusion layer from Eq. 6, which yields much better performances.Similar observations have been found in Sentence-T5 (Ni et al., 2022).

Different Model Configurations
We apply VIRT (including VIRT distillation and VIRT-adapted interaction) to pre-trained models with different sizes to show its robustness on different numbers of encoder layers.We conduct experiments using BERT-Tiny(2/128), BERT-Mini(4/256), BERT-Small(4/512), BERT-Medium(8/512), BERT-Base(12/768) and BERT-Large(24/1024) on the MNLI dataset, where a/b means the number of encoder layers is a and the dimension of hidden representation is b.The results are reported in Table 4.It can be seen from the results that VIRT yields better performance on all size of the pre-trained models, which is consistent with the observations from the main results.
The experimental results are shown in Figure 5.
From the results, it is clear that VIRT with α = 1 achieves the best performance among all the α values, which illustrated that the L virt is as important as L task .We also observe that the performance of VIRT is relatively stable with a wide range of α, e.g., from 0.6 to 1.

Case Study
To show the effect of VIRT distillation in a more intuitive way, we visualize the attention matrices of different models.Specifically, we choose an example from the MNLI dataset and plot the corresponding attention matrices of the interaction-based model and the representationbased model with/without VIRT distillation.As shown in Figure 6(a)-6(c), the attention matrix with VIRT distillation is more consistent to the interaction-based model than the model without VIRT.In particular, the interaction-based model aligns "peaceful" with "peace" which can be learnt by VIRT whereas the representation-based model misses this information.As a result, the representation-based model without VIRT fails to predict the two sentences as "neutral" relationship.

Conclusion
Representation-based models are widely used in text matching tasks due to their high efficiency while under-performing the interaction-based ones caused by lacking interaction.Previous works often introduce extra interaction layers while the interaction in Siamese encoders is still missing.In this paper, we propose a virtual interaction (VIRT) mechanism , which could approximate the interactive modeling ability by distilling the attention map from interaction-based models to the Siamese encoders of representation-based models, with no additional inference cost.The proposed VIRT, which employs knowledge distillation as well as adapted interaction strategy, achieves state-of-theart performance among existing representationbased models on several text matching tasks.

Limitations
Although the proposed VIRT mechanism enhances the performance of dual encoder architectures  and achieves new SOTA on several datasets, two limitations are presented and discussed in this section.First, in comparison to the vanilla dual encoder models such as Sentence-BERT, the training cost of VIRT is higher due to its introduction of virtual interaction distillation computation (i.e., the computational cost of distillation loss).Second, the performance of VIRT is highly correlated with the performance of the interaction-based teacher.Stronger teacher usually leads to the dual encoder student with higher performance.

Figure 1 :
Figure 1: Schematic diagrams illustrating paradigms of text matching.The figure contrasts existing approaches (sub-figures (a) and (b)) with the proposed model (sub-figure (c)).

==
are the query, key and value matrices.LN(•) refers to the Layer-Normalization operation.The interaction-based models are able to encode interactive information into the representations of X and Y through the full attention mechanism.Representation-based Models In contrast to interaction-based models, representation-based models encode X and Y individually through two independent Siamese Transformer encoders: H L x Enc x (X), and H L y Enc y (Y ).

Figure 2 :
Figure 2: The proposed VIRT model architecture.(a) Interactive knowledge transfer procedure by distilling the attention map from the interaction-based model.(b) VIRT mechanism details.

Figure 3 :
Figure 3: Ablation analysis for different components on all datasets.

Figure 4 :
Figure 4: Ablation study of applying VIRT to different encoder layers on MNLI and BoolQ.
(a) The attention matrix of interaction-based model (b) The attention matrix of representation-based model with VIRT distillation.(c) The attention matrix of representation-based model without VIRT distillation.

Figure 6 :
Figure 6: Visualization of the attention matrices.

Table 1 .
The detailed statistics and average text lengths are presented.Note that the average length of Chinese is based on characters, and English is based on words.

Table 3 :
Performance gain of applying VIRT distillation to different representation-based models.↑ represents the performance gain.

Table 4 :
Performance gain of applying VIRT distillation to models with different configurations.