Better Combine Them Together! Integrating Syntactic Constituency and Dependency Representations for Semantic Role Labeling

Structural syntax knowledge has been proven effective for semantic role labeling (SRL), while existing works mostly use only one singleton syntax, such as either syntactic dependency or constituency tree. In this paper, we explore the integration of heterogeneous syntactic representations for SRL. We ﬁrst consider a TreeLSTM-based integration, collaboratively learning the phrasal boundaries from the constituency and the semantic relations from dependency. We further introduce a label-aware GCN solution for simultaneously modeling the syntactic edges and labels. Experimental results demonstrate that by effectively combining the heterogeneous syntactic representations, our methods yield task improvements on both span-based and dependency-based SRL. Also our system achieves new state-of-the-art SRL performances, meanwhile bringing explainable task improvements.


Introduction
Semantic role labeling (SRL) aims to disclose the predicate-argument structure of a given sentence. Such shallow semantic structures have been shown highly useful for a wide range of downstream tasks in natural language processing (NLP), such as information extraction (Fader et al., 2011;Bastianelli et al., 2013), machine translation (Xiong et al., 2012;Shi et al., 2016) and question answering (Maqsud et al., 2014;Xu et al., 2020). Based on whether to recognize the constituent phrasal span or the syntactic dependency head token of an argument, prior works categorize SRL into two types: the span-based SRL popularized in CoNLL05/12 shared tasks (Carreras and Màrquez, 2005;Pradhan et al., 2013), and the dependency-based SRL introduced in CoNLL08/09 shared tasks (Surdeanu * Corresponding author. et Hajič et al., 2009). By adopting various neural network methods, two types of SRL have achieved significant performances in recent years (FitzGerald et al., 2015;He et al., 2017;Fei et al., 2021a) Syntactic features have been extensively verified to be highly effective for SRL (Pradhan et al., 2005;Punyakanok et al., 2008;Strubell et al., 2018;. In particular, syntactic dependency features have gained a majority of attention, especially for the dependency-based SRL, considering their close relevance with the dependency structure (Roth and Lapata, 2016;Xia et al., 2019;Fei et al., 2021b). Most existing works focus on designing various methods for modeling the dependency representations into the SRL learning, such as TreeLSTM Xia et al., 2019) and graph convolutional networks (GCN) . On the other hand, some efforts try to encode the constituency representations for facilitating the span-based SRL Marcheggiani and Titov, 2020).
Yet almost all the syntax-based SRL methods use one standalone syntactic tree, i.e., either dependency or constituency tree. Constituent and dependency syntax actually depict the syntactic structure from different perspectives, and integrat-ing these two heterogeneous representations can intuitively bring complementary advantages (Farkas et al., 2011;Yoshikawa et al., 2017;Zhou and Zhao, 2019). As exemplified in Figure 1, the dependency edges represent the inter-relations between arguments and predicates, while the constituency structure 1 locates more about phrase boundaries of argument spans, and then directs the paths to the predicate globally. Interacting these two structures can better guide the system to focus on the most proper granularity of phrasal spans (as circled by the dotted box), while also ensuring the route consistency between predicate-argument pairs. Unfortunately, we find that there are very limited explorations of the heterogeneous syntax integration in SRL. For instance, Li et al. (2010) manually craft two types of discrete syntax features for statistical model, and recently Fei et al. (2020a) implicitly distill two heterogeneous syntactic representations into one unified neural model.
In this paper, we present two innovative neural methods for explicitly integrating two kinds of syntactic features for SRL. As shown in Figure 2, in our framework, the syntactic constituent and dependency encoders are built jointly as a unified block (i.e., Heterogeneous Syntax Fuser, namely HeSyFu), and work closely with each other. In the first architecture of HeSyFu (cf. Figure 3), we take two separate TreeLSTMs as the structure encoders for two syntactic trees. Based on our framework, we try to answer the following questions: Q1. Whether the combination of constituent and dependency syntax can really improve SRL?
Q2. If yes, how much will such improvements be for the dependency-and span-based SRL?
We further propose Const GCN and Dep GCN encoders to enhance the syntax encoding in HeSyFu, where the syntactic labels (i.e., dependent arc types and constituency node types) are modeled in a unified manner within the label-aware GCN, as illustrated in Figure 4. With this, we can dig deeper: Q3. How different will the results be by employing the TreeLSTM or GCN encoder?
Q4. Can SRL be further improved by leveraging syntactic labels?
Q5. What kind of associations can be discovered between SRL structures and these heterogeneous syntactic structures?
To find the answers, we conduct extensive experiments on both span-and dependency-based SRL benchmarks (i.e., CoNLL05/12 and CoNLL09). The results and analyses show that, A1. combining two types of syntax information is more helpful than just using either one of them; A2. the improvement for span-based SRL is more obvious than dependency-based one; A3. GCN performs better than TreeLSTM; A4. syntactic labels are quite helpful for SRL; A5. SRL and both kinds of syntactic structures have strong associations and should be exploited for mutual benefits.
In our experiments, our SRL framework with two proposed HeSyFu encoders achieves better results than current best-performing systems, and yield more explainable task improvements.

Related Work
The SRL task, uncovering the shallow semantic structure (i.e. 'who did what to whom where and when') is pioneered by Gildea and Jurafsky (2000), and popularized from PropBank (Palmer et al., 2005) and FrameNet (Baker et al., 1998). SRL is typically divided into the span-based one and dependency-based one on the basis of the granularity of arguments (e.g., phrasal spans or dependency heads). Earlier efforts focus on designing hand-crafted features with machine learning methods (Pradhan et al., 2005;Punyakanok et al., 2008;Zhao et al., 2009b,a). Later, SRL works mostly employ neural networks with distributed features for the task improvements (FitzGerald et al., 2015;Roth and Lapata, 2016;Strubell et al., 2018). Most high-performing systems model the task as a sequence labeling problem with BIO tagging scheme for both two types of SRL (He et al., 2017;Ouchi et al., 2018;Fei et al., 2020c,b).
On the other hand, syntactic features are a highly effective SRL performance enhancer, according to numbers of empirical verification in prior works Swayamdipta et al., 2018;, as intuitively SRL shares much underlying structure with syntax. Basically, the syntactic dependent feature is more preferred to be injected into the dependency-based SRL (Roth and Lapata, 2016;Kasai et al., 2019), while other consider the constituent syntax for the span-based SRL Marcheggiani and Titov, 2020).
Actually, the constituent and dependency syntax depict the structural features from different angles, while they can share close linguistic relevance. Related works have revealed the mutual benefits on integrating these two heterogeneous syntactic representations for various NLP tasks (Collins, 1997;Charniak, 2000;Charniak and Johnson, 2005;Farkas et al., 2011;Yoshikawa et al., 2017;Zhou and Zhao, 2019;Strzyz et al., 2019;Kato and Matsubara, 2019). Unfortunately, there are very limited explorations for SRL. For example, Li et al. (2010) construct discrete heterogeneous syntactic features for SRL. More recent work in Fei et al. (2020a) leverage knowledge distillation method to inject the heterogeneous syntax representations from various tree encoders into one model for enhancing the span-based SRL. In this work, we consider an explicit integration of these two syntactic structures via two neural solutions. To our knowledge, we are the first attempt performing thorough investigations on the impacts of the heterogeneous syntax combination to the SRL task.
Various neural models have been proposed for encoding the syntactic structures, such as attention mechanism (Strubell et al., 2018;, TreeLSTM Xia et al., 2019), GCN Marcheggiani and Titov, 2020), etc. In this work, we take the advantages of the TreeLSTM and GCN models for encoding the constituent and dependency trees, as two solutions of our HeSyFu encoders. It is worth noticing that prior works using GCN to encode dependency  and constituent (Marcheggiani and Titov, 2020), where however the syntactic labels are not managed in a unified manner. We thus consider enhancing the syntax GCN by simultaneously modeling the syntactic labels within the structure.

Task Modeling
Following prior works (Tan et al., 2018;Marcheggiani and Titov, 2020), our system aims to identify and classify the arguments of a predicate into semantic roles, such as A0, A1, AM-LOC, etc. We denote the complete role set as R. We adopt the BIO tagging scheme. And given a sentence s={w 1 ,· · · ,w n } and a predicate w p , the model assigns each word w i a labelŷ ∈ Y, where Y=({B, I}×R) ∪ {O}. 2 Note that each semantic argument corresponds to a word span of {w j ,· · · ,w k } (1≤j≤k≤n). 3

Framework
As illustrated in Figure 2, our SRL framework consists of four components, including input representations, Transformer encoder, heterogeneous syntax fuser layer and CRFs decoding layer.
Given an input sentence s and a predicate word w p (p is the position), the input representations x i are the concatenation (⊕) of word embeddings x w i and predicate binary embeddings x (i==p) indicating the presence or absence of w p : (1) Afterwards, we adopt Transformer (Vaswani et al., 2017) as our base encoder for yielding contextualized word representations. Transformer (Trm) works with multi-head self-attention mechanism: where Q, K and V are the linear projections from the input representation x i . We simplify the flow: Next, based on the hidden representation r i , our heterogeneous syntax fuser (HeSyFu) layer, which will be elaborated in Section §4, integrates the constituency and dependency syntax, and yields the syntax-aware hidden representation: {s 1 , · · · , s n } = HeSyFu({r 1 , · · · , r n }) . Based on the syntax-aware hidden representation s i , we use CRFs (Lafferty et al., 2001) to compute the probability of each candidate output y = {y 1 , · · · , y n }: where W and T are the parameters and Z is a normalization factor. The Viterbi algorithm is used to search for the highest-scoring tag sequenceŷ.

Integration of Syntactic Constituency and Dependency Structure
We present two neural heterogeneous syntax fusers (a.k.a., HeSyFu), including a TreeLSTM-based HeSyFu (cf. Figure 3), and a label-aware GCNbased HeSyFu (cf. Figure 4). HeSyFu is stacked with total L layers for a full syntax interaction. We design the architecture with the constituency (denoted as const.) encoding in front of the dependency (denoted as dep.) encoding, based on the intuition that the boundary recognition helped by const. syntax should go before the semantic relation determination aided by dep. syntax.

TreeLSTM Heterogeneous Syntax Fuser
Our TreeLSTM-based HeSyFu (Tr-HeSyFu) is comprised of the N-ary TreeLSTM for const. trees and the Child-Sum TreeLSTM for dep. trees motivated by Tai et al. (2015).
Constituency tree encoding The flow in TreeL-STM is bidirectional, i.e., bottom-up and top-down, for a full information interaction. For each node u in the tree, we denote the hidden state and memory cell of its v-th (v ∈ [1, M ]) branching child as h ↑ uv and c uv . The bottom-up one computes the representation h ↑ u from its children hierarchically: where W , U and b are parameters. r u , i u , o u and f uv are the input token representation, input gate, output gate and forget gate. Analogously, the top-down N-ary TreeLSTM calculates the representation h ↓ u the same way. We concatenate the representations of two directions: h const Note that the constituent tree nodes include terminal word nodes and non-terminal constituent nodes, and we only take the representations (i.e., h const i ) corresponding to the word node w i for any usage.
Dependency tree encoding Slightly different from N-ary TreeLSTM for const. tree, the nonterminal nodes in dep. tree encoded by Child-Sum TreeLSTM are all the word nodes. We also consider the bidirectional calculation here. The bottomup TreeLSTM obtains h ↑ i of the word w i via: where C(i) is the set of child nodes of w i . r i is the input token representation consulting the foregoing constituent output representation: To fully make use of the heterogeneous syntactic knowledge, we fuse these two resulting syntactic representations. We apply a fusion gate to flexibly coordinate their contributions:

Label-aware GCN-based Heterogeneous
Syntax Fuser Compared with TreeLSTM, GCN is more computationally efficient on performing the structural propagation among nodes, i.e., with O(1) complexity. On the other hand, it is also crucial to leverage the syntactic labels (i.e., dependent arc types, and constituent phrasal types) into the SRL learning. For example, within the dependency tree, the information from the neighboring nodes under distinct types of arcs can contribute in different degrees. However we note that current popular syntax GCNs Titov, 2017, 2020) do not encode the dependent or constituent labels with the nodes in a unified manner, which could be inaccurate to describe the syntactic connecting attributes between the neighbor nodes. Based on their syntax GCNs, we newly propose label-aware constituency and dependency GCNs which are able to explicitly formalize the structure edges with syntactic labels simultaneously, and normalize them unitedly. 4 As illustrated in Figure 4, our label-aware GCN-based HeSyFu (denoted as LG-HeSyFu) has a similar assembling architecture to TreeLSTMbased HeSyFu, and will finally be navigated via the gate mechanism as in Eq. (8).

Constituency tree encoding
The constituent tree is modeled as a graph G (c) =(U (c) , E (c) ), where U (c) is the node set and E (c) is the edge set. We denote e (c) uv =1 if there is an edge between node u and node v, and e uv =0 vice versa. We enable the edges to be bidirectional. µ u represents the con- 4 We note that the constituent labels are attached onto nodes, while dependent labels are attached onto edges. stituent label of node u, such as S, NP and VP, etc. We take the vectorial embedding v (c) u for the node label µ u . Our constituent GCN (denoted as Const GCN) yields the node representations h where r b v is the initial node representation of the node v via span-boundary bridging operation, i.e., adding the start and end token representation of the phrasal span, r b v = r start + r end . And α uv is the constituent connecting distribution: where z (c) u . This distribution α uv encodes both the syntactic edge and label information, and thus comprehensively reflects the connecting strengths between neighbors. We then perform span-boundary inverse-bridging to restore the token node representation h const i for each word Dependency tree encoding Likewise, the dependent tree is modeled as a graph ij =1/0 denotes the dependency arc existence. π ↔ ij represents the edge label between w i and w j , which is also bidirectional. Besides of the pre-defined dependency labels, we additionally add a 'self ' label as the self-loop edge π ↔ ii , and a 'none' label representing no edge between w i and w j . We use the embedding form v (d) ij for π ↔ ij . The update in dependent GCN (denoted as Dep GCN) is written as: where r j = r j + h const i . β ij is the neighbor connecting-strength distribution: where z also can be denoted as h dep i , which navigates the dependent arc and label information in a more unified way.

Setups
We conduct experiments on the span-based SRL datasets (CoNLL05 & CoNLL12), and dependency-based SRL dataset (CoNLL09). Each dataset has its own training, development, and test sets. We convert the constituency syntax annotations in CoNLL05&12 into dependency annotations by following the standard of Stanford Typed Dependency (v3.3.0). 5 We obtain the constituency annotations for CoNLL09 from the PTB data. We adopt the CoNLL05 evaluation scripts 6 to evaluate the performances, with precision (P), recall (R) and F1 score as the metrics. We conduct significance tests via Dan Bikel's evaluation comparer. 7 The Transformer hidden size is 768. The hidden sizes in TreeLSTM and GCN encoders are in [250,300,350]. We adopt the Adam optimizer with an initial learning rate of 2e-5. We train the model 8 by mini-batch size in [16,24,32] with early-stop strategy. We also load the pre-trained LG-HeSyFu /L CoNLL05 CoNLL12 CoNLL09 parameters 9 from the RoBERTa language model (Liu et al., 2019) to our Transformer encoder for boosting the performance. The environment is with Intel i9 CPU and NVIDIA RTX 3090Ti GPU.

Development Experiments
We first perform preliminary experiments based on the development sets.
Layer of syntax encoder From Figure 5 we see that either too larger or fewer layers of HeSyFu does no benefits to the overall performances. When L=2 for Tr-/LG-HeSyFu, the performances be-  come universally the best.

Order of the heterogeneous syntax encoding
We design the architecture with constituency encoding before dependency encoding, as described earlier.
If we exchange this encoding order, we see from Table 2 that the drops come out. Also the drops are more severe on the span-based SRL data. This verifies the correctness of our model design.

Main Results
Our aim is to answer the research questions as listed in Section §1, based on the main experimental results in Table 1 and Table 3. [ Answer to Q1] Our first observation is that leveraging syntax knowledge, e.g. either the dependency or constituency, benefits both the span-based and dependency-based SRL, while the integration of two heterogeneous syntax contributes the most, more than any one of the standalone syntax. However we see that the improvements from this syntax integration is slightly different between span-based and dependency-based SRL.
[ Answer to Q2] In particular, the improvements for span-based SRL are more notable than dependency-based SRL, which can be learned by the comparisons between 'Trm+RoBERTa' and 'LG-HeSyFu+RoBERTa' on two tables. Our conjecture is that the the constituent structure knowledge will additionally help the span boundary detection of span-based SRL, compared with dependency-based SRL. Also we find that using only constituency syntax contributes more spanbased SRL, while the dependency-based SRL benefits more from dependency syntax.
Looking into the specific results, within the scope of heterogeneous syntax integration methods, our systems (both Tr-HeSyFu and LG-HeSyFu) outperform Fei et al. (2020a), demonstrating the advances of our heterogeneous syntax integrating methods. Overall, our LG-HeSyFu model wins  the new state-of-the-art performances on the used datasets, and with the help of the RoBERTa language model, the superiority is still maintained.
[ Answer to Q3] Also we show that our LG-HeSyFu based system consistently outperforms Tr-HeSyFu based one. Even LG-HeSyFu without using the syntax label features can still keep better. It is also clear that the GCN based encoders show consistently higher scores than the TreeL-STM based ones, verifying the effectiveness of leveraging GCN encoding syntax . [ Answer to Q4] Meanwhile, the ablation of syntax label information reveals the importance of its leverage for the SRL learning.

Analysis and Discussion
Correlations between SRL and syntax structures We explore the correlations between the SRL structure and the two syntax structures. We reach this by analyzing the SRL prediction with the neighbor connecting weights, i.e., α uv of Const GCN and β ij of Dep GCN. We visualize the results (on CoNLL05) in Figure 6. [ Answer to Q5] We learn that our framework indeed has captured the underlying inter-dependency between the SRL structures and the syntactic structure from the diversified visualizations. By accurately modeling (2) Figure 6: Discovered correlations of (1) SRL vs. dependent structure, (2) SRL vs. constituent structure.
such correlations, our LG-HeSyFu system naturally yields prominent meanwhile explainable SRL performances. Also some interesting patterns can be observed. Actually, not all the syntactic elements contribute the SRL learning. For example, the semantic roles A0, A1 and A2 relates more to the dependent edge nsubj and csubj, and more to the constituent phrase NP. We believe this can lay a crucial foundation for the direction of unsupervised semantic role labeling that relies on the syntactic structures.
Span boundary detection We now investigate the influences of the heterogeneous syntax integration to the span boundary match 10 on span-based SRL, i.e., on CoNLL05/12 data. From Figure 7 we learn that the heterogeneous syntax integration can improve the boundary detection over any standalone syntax leverage, while actually the constituency syntax contributes more significantly than dependency feature. And our LG-HeSyFu shows the best helpfulness than Tr-HeSyFu.
Label prediction We next evaluate the role label prediction. We only measure the correctly extracted arguments on whether its label further matches the gold annotation. We show the F1 score in Figure 8. Similar to the span boundary identification, the heterogeneous syntax integration can con-10 A correct match means both the start and end boundary of an argument span is correct, regardless of its label. tribute the most than that with any single syntax usage. Interestingly, the standalone dependency syntax shows more improvements on the dependencybased SRL, while the phrasal constituency features benefit more the span-based SRL.
Error breakdown To analyze which error types different syntax-aided SRL models tend to make, we follow prior works (He et al., 2017;Strubell et al., 2018), manually fixing the errors by applying oracle transformations incrementally based on CoNLL05. 11 The analysis is shown in Figure 9. Specifically, constituency syntax methods perform better than dependency-aided methods, w.r.t. the span boundary errors ('Merge Spans', 'Split Spans' and 'Fix Span Boundary'). Most importantly, it is quite clear that our heterogeneous syntax integrated systems (Tr-HeSyFu and LG-HeSyFu) makes fewer errors than baseline standalone syntaxaware methods, demonstrating the necessity to combine both two types of syntax.
Syntax distribution By observing the gate values g i (in Eq. 8) we can analyze the distributions of dependency and constituency features required by span-based and dependency-based SRL. From Figure 10 we see that span-SRL relies more on constituency feature, while dependency-SRL needs more dependency-aware feature. Such finding quite coincides with the foregoing quantitative analysis, as well as our intuition.

Conclusion and Future Work
We investigated the integration of constituency and dependency syntax for the SRL task. We first introduced TreeLSTM-based heterogeneous syntax fusing encoders, and further proposed innovative label-aware syntax GCN encoders for the integration. Experimental results showed that combining the heterogeneous syntax brought better results on both span-based and dependency-based SRL, than any one standalone syntax knowledge. As future work, we investigate other kinds of structural knowledge integration besides syntax, such as Semantic Dependency Structure, Abstract Meaning Representation (AMR), and explore the possibility of extending our model to incorporating such structured information. Besides, integrating the heterogeneous syntax knowledge into pre-training language models will be a promising direction.