An Empirical Revisiting of Linguistic Knowledge Fusion in Language Understanding Tasks

Though linguistic knowledge emerges during large-scale language model pretraining, recent work attempt to explicitly incorporate human-defined linguistic priors into task-specific fine-tuning. Infusing language models with syntactic or semantic knowledge from parsers has shown improvements on many language understanding tasks. To further investigate the effectiveness of structural linguistic priors, we conduct empirical study of replacing parsed graphs or trees with trivial ones (rarely carrying linguistic knowledge e.g., balanced tree) for tasks in the GLUE benchmark. Encoding with trivial graphs achieves competitive or even better performance in fully-supervised and few-shot settings. It reveals that the gains might not be significantly attributed to explicit linguistic priors but rather to more feature interactions brought by fusion layers. Hence we call for attention to using trivial graphs as necessary baselines to design advanced knowledge fusion methods in the future.


Introduction
Recently large-scale pretrained language models (Devlin et al., 2019;Liu et al., 2019;Raffel et al., 2020) have shown to gain linguistic knowledge from unlabeled corpus and achieve strong performance on many downstream natural language processing (NLP) tasks.Though probing analysis indicate that, to some extent, they can implicitly capture syntactic or semantic structures (Hewitt and Manning, 2019;Goldberg, 2019;Tenney et al., 2018;Hou and Sachan, 2021), whether they can further benefit from more explicit linguistic knowledge remains an open problem.Attempts have been made to inject syntactic biases into language model pretraining (Kuncoro et al., 2020;Wang et al., 2021;Xu et al., 2021b) or infuse finetuning with semantic information (Zhang et al., 2020a;Wu et al., 2021), and positive results are reported on downstream tasks.
However, the concerns about the effect or viability of linguistic knowledge have been raised.On the one hand, the performance gains highly rely on the availability of human-annotated dependency parsers (Sachan et al., 2021) or oracle semantic graphs (Prange et al., 2022), which limits the real-world applications.Developing accurate semantic graph parsers is yet challenging (Oepen et al., 2019;Bai et al., 2022).On the other hand, incorporating trees induced from pretrained language models (Wu et al., 2020) can outperform the ones fused with dependency-parsed trees for aspect-level sentiment analysis (Dai et al., 2021).This discovery is in line with the similar findings of trivial trees for tree-LSTM encoders in sequence modeling tasks (Shi et al., 2018).In this work, we push the envelop and answer the following two questions.Do knowledge fusion methods in Wu et al. (2021) benefit from trivial graphs that contain no linguistic information?If that's the case, where might the performance gains come from?
With the above questions, we empirically revisit the effectiveness of linguistic knowledge fusion in language understanding tasks.Motivated by Shi et al. (2018), we compare the performance between original dependency-parsed trees and balanced trees for syntax fusion, and compare the results between parsed semantic graphs and sequential graphs for semantic fusion.To our surprise, trivial graphs outperform syntactic trees or semantic graphs in full-supervised setting and achieve competitive results in few-shot setting.All the evidence again shows that the linguistic inductive bias might not be the major contributor of consistent improvements over baselines.Additional analysis gives some clues that the possible reasons are extra model parameters and feature interactions from fusion modules.This work encourages future research to add trivial graphs as necessary baselines when designing more advanced knowledge fusion methods for downstream tasks.Our experimental code is available at https://github.com/HKUST-KnowComp/revisit-nlu-linguistic-knowledge.

Study Design
In this section, we briefly introduce two linguistic graphs, i.e., syntactic dependency trees and semantic graphs.As a comparison, we manually construct two trivial graphs to infuse with task-specific finetuning.

Linguistic Graph
Graphs have intuitively represented various linguistic phenomena in natural languages including sentence structures (Chomsky, 1957) and meanings (Koller et al., 2019).Syntactic Dependency Tree.Syntactic trees are one of the most commonly-used linguistic structures and have long been shown useful for many NLP tasks.Syntactic dependency mainly models head-dependent relations between words.Dependency parsers parse the sentence into tree structures, which are further incorporated into LMs via syntax-aware attention (Nguyen et al., 2019) or graph neural networks (GNN Sachan et al. 2021).Semantic Graphs.Different from syntactic dependency, semantic graphs aim to map sentences to high-order meaning representations with more complex structures.Normally semantics concern about predicate-argument relations, where predicates evoke relations of various arity and arguments filled with semantic roles that are related to each specific predicate. 1One example is shown in Figure 1, and the characteristics of semantic graphs are the following: 1) Argument sharing leads to nodes whose in-degrees are more than one.2) Some tokens do not contribute to meaning and not appear in the graphs.3) There exist multiple roots.Complex semantic structures enable them to capture information that is not explicit in the single-rooted syntactic trees.Semantics could be formalized by different frameworks with respect to special linguistic assumptions.Some representative semantic formalisms are AMR (Abstract Meaning Representation, Banarescu et al., 2013) and UCCA (Abend and Rappoport, 2013).Recently Wu et al. (2021) proposed semantics-infused finetuning (SIFT) to infuse DM (DELPH-IN Minimal Recursion Semantics, Ivanova et al., 2012) graphs and achieved consistent improvements over RoBERTa (Liu et al., 2019) baselines on the GLUE (Wang et al., 2019).
DM graphs (Ivanova et al., 2012) define 59 types to characterize predicate-argument relationships.In order to investigate the effect of different semantic relations, we consider to only keep six common relation types, which appear in most parsed graphs, named skeleton graphs.These relations include ARG1, ARG2, ARG3, ARG4, compound and BV.We are interested in whether downstream tasks would still benefit from the core semantics instead of the entire linguistic graphs.

Trivial Graph
Though linguistic graphs convey useful structures, high-quality parsers are not easily available due to limited annotated graph banks (Oepen et al., 2019).If structure priors are unavailable, Shi et al. (2018) demonstrated that trivial trees, such as gumbel tree outperform syntactic trees when they are incorporated into tree LSTM encoders (Tai et al., 2015).
However, infusing trivial linguistic graphs with pretrained transformer models has not been explored.Similarly, we also create two types of trivial trees or graphs, which rarely contain linguistic inductive bias, to reproduce knowledge fusion experiments in Wu et al. (2021).Binary Balanced Tree.Compared with syntactic trees, binary balanced trees are shallower and possibly easier to propagate information from leaves to the root.We assume GNN layers might benefit from the shallowness of balanced trees.Sequential Bidirectional Graph.As the most natural and straight-forward way, tokens in the sentence are connected in the sequential order, which combines left-to-right and right-to-left chains.By doing so, GNN layers only aggregate local information rather than potentially long dependency from linguistic graphs.

Encoding Graph Structures
Structural information can be incorporated into pretrained transformer models by two typical strategies: adopt GNN on top of the output of transformers (Wu et al., 2021;Peng et al., 2021) and fuse structures with transformer attention layers (Nguyen et al., 2019;Zhang et al., 2020b).Followed Wu et al. (2021), in this work we use the former one, where the linguistic effects are easily disentangled for analysis.
Formally, given an input sentence x i = {w 1 , w 2 , ..., w L } with the length L and the corresponding graph G i (either linguistic graph or trivial graph), we obtain the last hidden representation, H after pretrained transformer layers (Vaswani et al., 2017), which also serves as the node embedding initialization of relational graph neural networks ( RGCN Schlichtkrull et al. 2018) to encode G i .At RGCN layers, each node's representation would be updated by aggregating its neighbors' features with relational bias.We max-pool over the final RGCN layer's output as the graph representation O g : (1) And the final classification feature is the concatenation of [CLS] token embedding H 0 and pooled graph representation O g .Note that vanilla transformer-based models only take the [CLS] embedding as the classification feature.For sentence pair tasks, two graphs are separately encoded by RGCN with inner-attention and then aggregated to one representation as Wu et al. (2021).

Implementation Details
We use the GLUE benchmark (Wang et al., 2019), a general natural language understanding test suite which contains eight datasets for text classification tasks (Details listed in Appendix A).Following the common practice, we report the averaged results of development sets over multiple seeds.
We directly adopt parsed semantic graphs from Wu et al. ( 2021)2 for each dataset, using the rankedfirst parser (Che et al., 2019) in the CoNLL2019 shared task (Oepen et al., 2019).For the construction of trivial graphs, we randomly sample edge labels from the relation list due to their unavailability.To be comparable with Wu et al. (2021), we apply the same model architecture and tuned the same set of parameters such as learning rate, RGCN graph layer, RGCN hidden dimension, when infusing graph structures into LMs.

Main results
Table 1 shows the main results of infusing with linguistic and trivial graphs.We can draw the following observations.First, sequential graphs and balanced trees achieve consistent improvements over corresponding linguistic ones.Also trivial graphs further improve RoBERTa baselines by more than 1.0 averaged point.Moreover, the random relation types in the trivial graphs have no effect on performance, again suggesting that linguistic structures and relations are not major contributors of the improvements.Second, skeleton graphs surprisingly improve the whole linguistic graphs, which indicates the fine-grained sentential semantics might not be necessary for language understanding tasks in GLUE.Note that skeleton graphs keep 75%-90% edges of parsed graphs on average (Detailed statistics in Appendix B).
To examine the impact of trivial graphs with different sizes of training data, especially in lowresource scenarios, we randomly sample 5%, 10%, 20% and 50% training data.The trivial graphs yield competitive or even better results for CoLA and RTE datasets in Fig 2 across different sizes.

Effect of Model Components
Given the strong performance of trivial graphs, we conduct the ablation study of model components in Section 2.3 besides graph structures.We first investigate whether pooling over transformer outputs can replace graph encoders, i.e., passing zero layer RGCN.The results in  1 .Considering the nature of sequential modeling ability of transformers, one question is whether additional transformer layers over sequences could learn better graph representations and improve the performance.The difference between transformer and RGCN is that the transformer captures complete graphs while RGCN takes sequential graphs as inputs.We stack more randomly-initialized transformer layers over the pretrained encoder outputs (comparable number of parameters with RGCN).Training with additional transformer layers yields similar results with RGCN encoders.From this perspective, structure biases make little difference and the gains of trivial graphs might be the results of additional token embedding features as well as their interactions via fusion modules (RGCN or transformer).

More Discussions
Besides the specific architecture discussed in §2.3, our study can be generalized to more knowledge fusion methods.For example, Zhang et al. (2020a) incorporated semantic role information by combining token embeddings and role type embeddings, which improved the performance of GLUE benchmark over BERT baselines.Similar experiments like replacing parsed role sequences with random sequences, are left for future work.
When it comes to entity knowledge-augmented methods, Raman et al. (2021) also observed similar findings as ours that using perturbed KGs can maintain the downstream performance of the original 10067 KG though the perturbed KGs are significantly different in terms of semantics and graph structures.It again demonstrate that the way those methods use knowledge does not align with human priors.Both our findings can guide the future work on robust evaluation and explainability analysis of knowledge fusion methods.

Conclusion
Our study demonstrates that GLUE tasks can benefit from both trivial graphs and linguistic graphs, indicating that the performance gains of previous fusion methods should not be attributed to linguistic bias entirely.We argue that comparisons merely between methods with and without knowledge fusion may not be able to capture the whole picture.For example, without baselines considering trivial graph structures, the quality of various fused knowledge may not be accurately assessed.More careful evaluations of the effectiveness claims in existing work (Sachan et al., 2021;Liu et al., 2020;Peng et al., 2021, inter alia) may be encouraged in the same spirit.In addition, tasks and evaluation benchmarks are also crucial to investigate when linguistic structures help.Our study contributes to the broader question of how to accurately evaluate models that integrate external knowledge for downstream tasks, such as world or commonsense knowledge (Xu et al., 2021a;Zhu et al., 2022).

Limitations
We outline two limitations in our study.First, our comparison experiments require high-quality parsers to obtain accurate linguistic graphs.This assumption does not always hold especially for complex semantic parsing.In such case, it is rather difficult to quantify the effect of parsing results on downstream tasks' performance.Note that our constructed trivial graphs need no parser.Second, although we systematically evaluate on the GLUE benchmark, more diverse tasks like structure prediction tasks, more knowledge fusion methods, and more linguistic structures like phrase structures (Kong et al., 2015) remain to be explored in the future work.

Figure 1 :
Figure 1: An example of dependency tree (blue) and DM semantic graph (red).
Figure 2: Performance comparison of RoBERTa, SIFT and SIFT (SEQ) with different training data sizes.The rest of datasets follow similar trends.
Table 2 show that the