Improving Semantic Matching through Dependency-Enhanced Pre-trained Model with Adaptive Fusion

Transformer-based pre-trained models like BERT have achieved great progress on Semantic Sentence Matching. Meanwhile, dependency prior knowledge has also shown general benefits in multiple NLP tasks. However, how to efficiently integrate dependency prior structure into pre-trained models to better model complex semantic matching relations is still unsettled. In this paper, we propose the \textbf{D}ependency-Enhanced \textbf{A}daptive \textbf{F}usion \textbf{A}ttention (\textbf{DAFA}), which explicitly introduces dependency structure into pre-trained models and adaptively fuses it with semantic information. Specifically, \textbf{\emph{(i)}} DAFA first proposes a structure-sensitive paradigm to construct a dependency matrix for calibrating attention weights. It adopts an adaptive fusion module to integrate the obtained dependency information and the original semantic signals. Moreover, DAFA reconstructs the attention calculation flow and provides better interpretability. By applying it on BERT, our method achieves state-of-the-art or competitive performance on 10 public datasets, demonstrating the benefits of adaptively fusing dependency structure in semantic matching task.


Introduction
Semantic Sentence Matching (SSM) is a fundamental technology in multiple NLP scenarios. The goal of SSM is to compare two sentences and identify their semantic relationship. It is widely utilized in recommendation systems (Zeng et al., 2021), dialogue systems (Yu et al., 2014), search systems (Li and Xu, 2014), and so on (Gao et al., 2018).
Across the rich history of semantic matching research, there have been two main streams of studies for solving this problem. One is to utilize a sentence encoder to convert sentences into lowdimensional vectors in the latent space, and apply a * Equal contribution. † Corresponding author. obj S1: Which company has Apple exceeded in profit . parameterized function to learn the matching scores between them (Conneau et al., 2017;Reimers and Gurevych, 2019). Another paradigm tends to align phrases and aggregate the integrated information into prediction layer to acquire similarity and make a sentence-level decision (Chen et al., 2016;Tay et al., 2017). After the emergence of large-scale pre-trained language models (PLMs), recent work attempts to integrate external knowledge (Miller, 1995;Bodenreider, 2004) into PLMs. For example, SemBERT  concatenates semantic role annotation to enhance BERT. UER-BERT (Xia et al., 2021) chooses to inject synonym knowledge. SyntaxBERT (Bai et al., 2021) integrates the syntax tree into transformer-based models. Meanwhile, leveraging external knowledge to enhance PLMs has been proven to be highly useful for multiple NLP tasks (Kiperwasser and Ballesteros, 2018;Bowman et al., 2016).
Although previous work has achieved great progress in SSM, existing models (e.g., BERT, RoBERTa) still cannot efficiently and explicitly utilize dependency structure to identify semantic differences, especially when two sentences are literally similar. To illustrate that, we display an example of misjudgment by BERT (Devlin et al., 2018) in Figure 1. In the first sentence, the dependency between "exceeded" and "company" is obj, between "exceeded" and "Apple" is nsubj. Its dependency structure is completely opposite to the second sentence. Although the literal similarity of these two sentences is extremely high, the semantics are still quite different. Based on the above observations, we intuitively believe that the dependency structure needs to be considered in the process of semantic matching. From a deeper perspective, the MLM training approach of most existing PLMs is optimizing the co-occurrence probability statistically (Yang et al., 2017), but dependency structure can reflect the dependency relationship within the sentence and integrate prior language knowledge to enhance interaction features. Combined with the actual attention alignment process, we believe that constructing a dependency matrix, strengthening the attention weight of same dependency and reducing the attention weight of different dependency, will further improve the performance of existing PLMs. Therefore, two systemic questions arise naturally: Q1: How to construct a dependency matrix that contains dependency prior knowledge? Inconsistent dependency structures can lead to severe semantic divergence even between sentences with similar text. To capture the dependency structure features, we propose a structure-aware paradigm to construct the dependency matrix. Our paradigm utilizes three distinct modules, including the dependency similarity between words, the matching of dependency subgraphs, and the tf-idf weights.
Q2: How to integrate the introduced dependency signals provided by dependency matrix? To maximize the benefits of the dependency knowledge, we integrate the dependency structure to calibrate our attention alignment. Therefore, our model can not only understand sentence semantics, but also further enhance structural alignment awareness. However, a hard fusion of dependency and semantic signals by simple structure may break the original representing ability of PLMs. How to inject the obtained dependency information softly remains a hard issue. In this paper, we propose an Adaptive Fusion module: (i) It first inter-aligns these two signals through distinct attentions, and generates vectors describing sentence matching details. (ii) Then, multiple gates are utilized to extract meaningful information adaptively. (iii) Moreover, our vectors are further scaled with another fusegate to reduce the possibility of noise introduced by dependency features. Eventually, this soft aggre-gation method can adaptively fuse these collected information and obtain the fusion signals.
Overall, our contributions are mainly as follows: • We discuss in detail the methodology of further leveraging dependency and explicitly integrating dependency structure into PLMs. • We propose a novel dependency calibration and fusion network DAFA, which is a meaningful practice combining semantic and dependency information and provides better interpretability. DAFA leverages dependency structure to calibrate attention alignment and constructs a fusion module to adaptively integrate distinct features. • To verify the effectiveness of DAFA, we conduct extensive experiments on 10 datasets in SSM and achieves state-of-the-art or competitive performance over other strong baselines. It proves the effectiveness of our method.

Approach
In this section, we introduce DAFA in detail and the overall architecture is presented in Figure 2.

How to Build the Dependency Matrix
To construct a dependency matrix that contains dependency knowledge, we propose a structureaware paradigm with three modules: (1) we first use the similarity of dependency trees to build our matrix.
(2) Then, we introduce subgraph matching to further align the dependency substructures.
(3) Moreover, we also add tf-idf weights to reallocate more attention to keywords and their dependency.
We utilize trigrams to model dependency trees. One trigram unit denotes one branch. In the first sentence of Figure 1, {"exceeded", nsubj, "Apple"} is a trigram unit. Apart from literal similarity, two similar trigrams indicate closer semantics. We set n and m as the lengths of sentence A and B. A i denotes the i-th word of A and D i A denotes the trigram with A i as the tail node. h i A , t i A , and r i y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4

Sentence1
x 1 x 2 x 3 x 4  stant that determines the effect of dependency type match. However, M may over-focus on dependency match and neglect the comparison of consecutive dependency trigrams. Therefore, by adopting the subgraph matching mechanism, we can also align the substructures of two dependency trees to acquire the continuous dependency similarity S: ν is a decay factor in case S extremely increase and α is a fixed score to the child nodes of matching trigram pair. T i A denotes the set of child nodes of A i and A i x is the x-th child. S recursively compares the child nodes of matching pair.
However, our dependency matrix still ignores the difference in importance between words in same sentence. Keywords and their dependency relationship should be allocated more attention. Therefore, we rely on the tf-idf weights (Ramos et al., 2003) to obtain the informative scores of distinct sentence components and align the tf-idf weights of tail nodes in two trigrams: where tf a denotes the tf-idf weight of a. And M F ∈ R n * m is our final dependency matrix.

How to Integrate Dependency Information
To better utilize the gained dependency information, we propose to inject our dependency matrix into the original transformer attention module and apply the dependency structure to calibrate the attention alignment. Attention module can be consid-ered as a mapping from query vector Q and a set of key-value vector pairs (K, V) to the attention distribution. Each layer has multiple parallel attention heads. By introducing M F , the calculation flow of each head is as follows: where d k , d v is the dimension of K, V and d seq is the input length. We change the dimension of M F by adding 1, and ensure each element is in the corresponding position in the sentence alignment. ⊙ is the element-wise dot product, and Dep ∈ R dv * dseq denotes the dependency signals from the attention matrix calibrated by our dependency matrix. However, simple concatenation and fusion seem to underestimate the deep interaction between these two signals and ignore the potential noise introduced by dependency structure. Incorrect structural information may produce noisy outputs and give wrong predictions. Therefore, to further improve the fault tolerance rate of our model and reduce the problem of error propagation, we propose an adaptive fusion module. As shown in Figure 3, (i) it first interacts and aligns two signals flexibly with semantic-guided attention and dependency-guided attention. (ii) And then, it adopts multiple gate modules to selectively extract useful features. (iii) Finally, due to the possibility of noise, a filtration gate is utilized to adaptively filter out inappropriate information.
Firstly, we update the dependency signals through semantic-guided attention. We use s i and d i to denote the i-th feature of Sem and Dep respectively. We provide each semantic vector s i to interact with the dependency signals matrix Dep and obtain the new dependency feature d * i . Then, based on d * i , we can in turn acquire the new semantic feature s * i through dependency-guided attention. The calculation process is as follows: where are weights and bias of our model, and ⊕ denotes the concatenation of signal matrix and feature vector.
Secondly, to adaptively capture and fuse useful information from novel semantic and dependency features, we introduce our gated fusion modules: where i , bŝ i are parameters and d hid is the size of hidden layer. σ is the sigmoid activation and g i is the gate that determines the transmission of these two distinct representations. By this way, we get the fusion feature v i that fused the new semantic and dependency signals adaptively.
Eventually, considering the potential noise problem, we propose a filtration gate to selectively leverage the fusion feature. When v i tends to be beneficial, the filtration gate will incorporate the fusion features and the original features. Otherwise, the fusion information will be filtered out: where are trainable parameters and l i is our final dependency-enhanced semantic feature.

Experiments
The datasets, baselines and all details of our experiments are shown in the Appendix B.

Results
In our experiments, we implemented DAFA in the initial transformer layer of BERT.
The Main Results of GLUE Datasets We first

Dependency-guided
Adaptive Fusion Attention internal structure. Taking S i as an example. Figure 3: The overall structure of Adaptive Fusion Attention.
fine-tune our model on 6 GLUE datasets. Table 1 shows the performance of DAFA and other competitive models. It can be seen that using only non-pretrained models performs obviously worse than PLMs due to their strong context awareness and data fitting capabilities. When the backbone model is BERT-base and BERT-large, the average accuracy of DAFA respectively improves by 2.23% and 2.37% than vanilla BERT. Such great improvement demonstrates the benefit of adaptive fusion dependency structure for mining semantics, and also proves that our framework can help BERT to perform much better in SSM.
Moreover, some previous works such as Sem-BERT, UERBERT and SyntaxBERT also outperform vanilla BERT by injecting external knowledge, but DAFA still maintains the best performance. Specifically, our model outperforms Syn-taxBERT, the top-performing model in previous work leveraging external knowledge, with an average relative improvement of 0.91% based on BERT-large. Especially on the QQP dataset, the accuracy of DAFA is significantly improved by 2% over SyntaxBERT. There are two main reasons: ⋄ On the one hand, we use subgraph matching and keyword weights to enhance the ability of DAFA to capture dependency knowledge. DAFA obtains interactive information that is more conducive to fusing fine-grained features. ⋄ On the other hand, for the latent noise introduced by external knowledge, our adaptive fusion module can selectively filter out inappropriate signals to suppress the propagation of noise, while previous work seems to have not paid enough atten-    90.9 92.5 87.9 86.8 SyntaxBERT-base † (Bai et al., 2021)   tion to this issue. However, we still notice that SyntaxBERT achieves slightly better accuracy on few datasets. We consider this to be a result of the instability of noise.
The Results of Other Datasets Second, to verify the general performance of our method, we also conduct experiments on 4 other popular datasets. The results are shown in Table 2. DAFA still outperforms vanilla BERT and some representative models on almost all datasets. It is worth noting that although DAFA outperforms MT-DNN (Liu et al., 2019) on SNLI, it does not perform as well as MT-DNN on Scitail. This is because MT-DNN (Liu et al., 2019) uses more model parameters and a large amount of cross-tasks training data, which makes MT-DNN more advantageous in this regard. But MT-DNN will also require more training time and computational cost. Besides, the data volume of Scitail is relatively small, which makes the vari-ance of the model prediction results larger. However, DAFA still shows a very competitive performance on Scitail, which also shows from the side that our method can make up for the lack of generalization ability with fewer parameters by endowing BERT with dependency structure awareness.
Overall, consistent conclusions can be drawn from such results. Compared to previous works, our dependency framework is highly competitive in further judging semantic similarity, and the experimental results also confirm our thoughts.

Ablation Study
To assess the contribution of each component in our approach, we have ablation experiments on the QQP dataset based on BERT-large. The experiment results are shown in Table 8.
Dependency Matrix Our dependency matrix is jointly constructed by three components: (a) We first remove the dependency tree similarity and subgraph matching module respectively, and the model performance dropped by nearly 0.7% and 0.5%. This suggests that simple dependency structure alignment can further describe the interactions between words to achieve better semantic similarity. (b) Then, subgraph matching can align the dependency substructures and enrich the contextual representation by introducing finer-grained comparison information. (c) Besides, due to the different importance of words in the sentence, tf-idf can weight each word to readjust the attention distribution. The accuracy also dropped slightly after the model removed tf-idf. label:0 S1:Sorry, I got sick yesterday and couldn't have lunch with you. label:1 label:0 filter gate:0.15 label:1 S2:Sorry, I was taken ill yesterday and unable to meet you for lunch.
label:1 S1:The largest lake in America is in my hometown called Lake Superior. label:1 label:0 filter gate:0.11 label:1 S2:Lake Superior is the largest lake in America, and it's in my hometown.
label:1  demonstrate the effectiveness of each component of our dependency matrix. Adaptive Fusion We also conduct multiple experiments to verify the effect of adaptively fusing the original semantic signals and the dependency signals. (a) We first remove the cross-attention module, and the performance drops to 91.2%. Cross-attention can capture the interaction information between two signals, and interactivity information is crucial for semantic matching. (b) Moreover, we remove multiple gate mechanisms, only relying on the attention modules to integrate our signals. And the accuracy drops to 91.4%. It shows that the ability of the model to suppress noise is weakened without filter gates. We also replace the overall adaptive fusion module with simply averaging and the performance drops sharply to 91.0%, indicating that soft aggregation and governance can better integrate semantic and dependency signals.
Sub-Module Analysis To further verify the contribution of each submodule to DAFA, we separately assemble the respective sub-modules in dependency matrix computation and adaptive fusion, the results are shown in Figure 4. First, we can find that after adding any of sub-modules, the performance of the model is improved over the baseline. Second, the aggregation of dependency similarity and fusion attention provides the most significant improvements, which intuitively reflects their cor- nerstone roles as core modules of dependency matrix and adaptive fusion respectively. Such results confirm the necessity of each sub-module again.
Overall, due to the efficient combination of each component, DAFA can adaptively fuse dependency aligned features into pre-trained models and leverage their powerful contextual representation capability to better inference semantics.

Case Study
To visually demonstrate the validity of our approach, we also conduct a qualitative study using multiple cases in Table 7. S1 and S2 in the first two cases are literally similar and differ only in the dependency between words, but they express two quite different semantics. In the first example, BERT attempts to capture interaction information from these two sentences, but ignores the dependency between "New York" and "Seattle". It fails to distinguish the semantic difference and gives wrong prediction results. By adopting the dependency structure, our method can capture dependency dislocation information and comprehend fine-grained semantics. As the results show, DAFA gives correct predictions.
However, the injection of dependency structure may generate noise and interference. For example, in the third case, "got sick" and "was token ill" express the same semantics, but their dependency trees diverge significantly and may mislead the model. By simply averaging the semantic and dependency signals, DAFA-avg instead gives the wrong answers in the last two examples. Therefore, we propose an adaptive fusion module to reduce the possibility of the noise or useless signals. The filter gate reflects the degree to which the model adopts the dependency structure. In the first two examples, our model learned the important impact of dependency on semantics by adaptively fusing distinct information. DAFA automatically sets the filter gate to 0.91-0.93 and improves perception of dependency structure. However, in the last two examples, our adaptive fusion module identifies the latent noise in dependency structure. To alleviate the possible misleading effect, DAFA correspondingly sets the filter gate to about 0.1, which weakens the model's sensitivity to dependency.
Eventually, as the results show, our DAFA makes correct predictions in all of the above cases and increases the fault tolerance of the model. Such results again demonstrate the effectiveness and necessity of our components.

Qualitative Analysis
Attention Distribution and Interpretability To verify the calibration effect of the dependency structure alignment as well as to perform a visual interpretability analysis, we display the attention distribution between two sentences that are literally similar but semantically distinct. The weights for one of our attention heads are shown in Figure 5.
Obviously, vanilla BERT is heavily influenced by the same words in sentences. It ignores deep semantic associations and instead over-focuses on shallow literal features, which may lead to wrong predictions. However, after being calibrated by our method, the attention weights not only learn the shared word in sentences, but also pay more attention to the alignment between the dependency structures. For example, DAFA not only increases the weight of the same "exceeded" in the two sentences, but also increases the weight between "Apple" in the first sentence and "company" in the second one. This is because "Apple-exceeded" and "company-exceeded" are subject-predicate dependency structures in the two sentences respectively.
Meanwhile, attention modules are often used to explore the interpretability of the model (Clark et al., 2019;Hao et al., 2020;Lin et al., 2019). In Figure 5(a), we can observe that it is difficult to determine the deep interaction information between sentences by simple contextual features. However, dependency can align sentence structure at the word level. In Figure 5(b), DAFA significantly mitigates the strong influence of the same word and enhances the sensitivity to dependency structures. The calibrated attention distribution is more in line with human cognition and validates our methodology. Since dependency and semantics are linguistic expressions with different perspectives and granularities, their combination can further improve the model's awareness to discern subtle semantic differences and reduce error propagation problems. In addition, the results of ablation experiments also confirm DAFA at each component level and provide results consistent with our predictions. Therefore, our method provides better interpretability. Data Scenarios and Structural Analysis To further verify the generalization ability of our method, we conduct a range of experiments under different data scenarios on SNLI and Scitail. We use 1% to 100% of the train set corpus to fine-tune our model, and then examine it on the test set. As illustrated in Figure 6(a), our approach obviously improves the performance of BERT in data sparsity scenarios (1%) and always surpasses BERT at different amounts of data (from 1% to 100%). This shows that dependency prior knowledge provides highly salient performance gain happens when the train data is few, and further proves that DAFA can effectively enhance BERT on distinct data scenarios.
To explore which layer most requires the dependency structure, we implement DAFA in the initial transformer layer and in all 12 transformer layers of BERT respectively. Our experiments use 1% to 100% data of MRPC and the results are shown in Figure 6(b). The effect of our method (initial layer) significantly exceeds vanilla BERT and the  approach that implements DAFA in all 12 layers of BERT. The main reason is that BERT pays more attention to word-level features at the bottom layers and semantic features at the top layers. Stability Study We also conducted extensive experiments on 4 datasets for exploring the stability of our method. To minimize the effect of randomness in BERT training, performance levels are averaged across 10 different runs on the dev set. The performance distribution box plot is shown in Figure 7. The median and mean levels of our model surpass vanilla BERT on all 4 datasets, and the performance fluctuation range of our method is within ±1% around the mean levels, which indicates that our method has better stability relative to BERT on different data distributions.

Related Work
Semantic Sentence Matching is a fundamental task in NLP. In recent years, thanks to the appearance of large-scale annotated datasets (Bowman et al., 2015;Williams et al., 2017), neural network models have made great progress in SSM (Qiu and Huang, 2015;Wan et al., 2016), mainly fell into two categories. The first one (Conneau et al., 2017;Choi et al., 2018) focuses on encoding sentences into corresponding vector representations without any cross-interaction and applies a classifier layer to obtain similarity. The second one (Liang et al., 2019a;Chen et al., 2016;Xue et al., 2023) utilizes cross-features as an attention module to express the word-level or phrase-level alignments, and aggregates these integrated information to acquire similarity.
Recently, the shift from neural network architecture engineering to large-scale pre-training has achieved outstanding performance in SSM and many other tasks. Meanwhile, leveraging external knowledge (Miller, 1995;Bodenreider, 2004)  hance PLMs has been proven to be highly useful for multiple NLP tasks (Kiperwasser and Ballesteros, 2018). Therefore, recent work attempts to integrate external knowledge into pre-trained language models, such as AMAN, DABERT, UERBERT, SyntaxBERT, and so on (Liang et al., 2019b;Wang et al., 2022;Xia et al., 2021;Bai et al., 2021).
Dependency Syntax As a crucial prior knowledge, dependency tree provides a form that is able to indicate the existence and type of linguistic dependency relation among words, which has been shown general benefits in various NLP tasks (Bowman et al., 2016). Therefore, many approaches that adopted syntactic dependency information have been proposed and attained multiple great results (Duan et al., 2019). For example, Strubell et al. (2018) present a linguistically-informed selfattention (LISA) in a multi-task learning framework. Sachan et al. (2020) investigate popular strategies for incorporating dependency structure into PLMs. Liu et al. (2023) used a grammarguided dual-context architecture network (SG-Net) to achieve SOTA effects on span-based answer extraction tasks.

Conclusions
In this paper, we propose a Dependency-Enhanced Adaptive Fusion Attention (DAFA), which can adaptively utilize dependency alignment features for semantic matching. Based on the context representation capability of BERT, DAFA enables the model to learn more fine-grained comparison information and enhances the sensitivity of PLMs to the dependency structure. The experiment results on 10 public datasets indicate that our approach can achieve better performance than multiple strong baselines. Since DAFA is an end-to-end trained component, it is expected to be applied to other large-scale pre-trained models in the future.

A.1 Effect of Different Training Data Volumes
We randomly select 10% to 100% of the data from the training data and conduct data scene analysis experiments on eight other datasets. We show the results in Figure 8. For BERT and DAFA, we have trained 5 times for each training scale of each dataset. The changing curves reveal many interesting patterns. First, the performance of our proposed method outperforms vanilla BERT almost uniformly across all training data sizes. Second, on datasets such as RTE, STS-B, and SICK, dependency provide the most significant performance improvement when the training data is small. These findings suggest that if training data is scarce, it is wise to consider injecting dependency knowledge into BERT.

B.1 Implementation Details of Our Experiments
Implementation Details DAFA is based on BERTbase and BERT-large. For distinct targets, our hyper-parameters are different. We use the AdamW optimizer and set the learning rate in {1e − 5, 2e − 5, 3e − 5, 8e − 6}. We set warm-up 0.1, L2 weight decay 1e − 8 and constant θ is 2.0. Our epoch is between 3 and 5, and the batch size is selected in {16, 32, 64}. We also set dropout at 0.1-0.3 and gradient clipping in {7.5, 10.0, 15.0}. Our experiments are performed one A100 GPU. Besides, our dependency parser is the biaffine parser proposed by Dozat and Manning (2016). We use original phrasestructure Penn Treebank (PTB) (Marcinkiewicz, 1994) to convert by the Stanford Parser v3.3.02 1 to retrain a parser model. The dependency parser is not updated with our framework.

B.2 Datasets Statistics
The statistics of all 10 datasets is shown in Table 9.
• MRPC is a dataset that automatically extract sentence pairs from online news sources and manually annotate whether the sentences in sentence pairs are semantically equivalent. The task is to determine whether there are two categories of interpretation: interpretation or not interpretation. • QQP comes from the famous community Q&A website quora. Its goal is to predict which of the provided question pairs contains two questions with the same meaning. • STS-B is a collection of sentence pairs extracted from news headlines, video titles, image titles and natural language inference data. Each pair is annotated by humans, and its similarity score is 0-5. The task is to predict these similarity scores, which is essentially a regression problem, but it can still be classified into five text classification tasks of sentence pairs. • MNLI-m/mm is a crowd-sourced collection of sentence pairs annotated with textual entailment information. Given the promise statement and hypothesis statement, the task is to predict whether the premise statement contains assumptions (entailment), conflicts with assumptions (contradiction), or neither (neutral). • QNLI is a question and answer data set composed of a question paragraph pair, in which the paragraph is from Wikipedia, and a sentence in the paragraph contains the answer to the question. The task is to judge whether the question and sentence (sentence, a sentence in a Wikipedia paragraph) contain, contain and do not contain, and classify them. • RTE is a series of datasets from the annual text implication challenge. These data samples are constructed from news and Wikipedia. All these data are converted into two categories. For the data of three categories, neutral and contradiction are converted into not implication in order to maintain consistency.
• SNLI(Bowman et al., 2015) is a popular dataset used for entailment classification (or natural language inference). The task is to determine whether two sequences entail, contradict or are mutually neutral. • Scitail (Khot et al., 2018) is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. • SICK (Marelli et al., 2014) is a dataset for semantic textual similarity estimation. The task is to assign a similarity score to each sentence pair. • TwitterURL (Lan et al., 2017) is a collection of sentence level paraphrases from Twitter by linking tweets through shared URLs. Its goal is to discriminate duplicates or not.