Semantic Dependency Parsing with Edge GNNs

Second-order neural parsers have obtained high accuracy in semantic dependency parsing. Inspired by the factor graph representation of second-order parsing, we propose edge graph neural networks (E-GNNs). In an E-GNN, each node corresponds to a dependency edge, and the neighbors are defined in terms of sibling, co-parent, and grandparent relationships. We conduct experiments on SemEval 2015 Task 18 English datasets, showing the superior performance of E-GNNs 1 .


Introduction
Traditional syntactic dependency parsing aims to produce a tree structure for a given sentence, which has been well-studied.However, tree-structured representation is ill-suited for producing meaning representation, which motivates the proposal of semantic dependency parsing (SDP) (Oepen et al., 2014).SDP aims to produce a directed acyclic graph (DAG) instead of a tree to enable representing more complex semantic relationships.
Graph-based methods have obtained high accuracy in SDP (Peng et al., 2017;Dozat and Manning, 2018;Wang et al., 2019).Notably, Wang et al. (2019) propose a second-order neural CRF parser and show superior performance compared to the first-order Biaffine Parser (Dozat and Manning, 2018).To optimize the intractable CRF objective, they leverage approximate inference algorithms such as loopy belief propagation (LBP), unrolling several inference steps as recurrent neural networks (Zheng et al., 2015) for end-to-end approximationaware training (Gormley et al., 2015).However, their model suffers from the following two problems: (i) Second-order dependency parsing can be formulated in terms of factor graphs (Smith and Eisner, 2008).However, the corresponding factor graph for second-order parsing is highly loopy (Fig. 1).It is known that on loopy graphs, LBP can easily get stuck on bad local optima, leading to sub-optimal results and thus undermining the parsing performance.(ii) First-order and second-order scores are produced based solely on contextualized word representations, which is deemed to be sub-optimal (Gan et al., 2022).
In this work, we propose edge GNNs (E-GNNs) to address the aforementioned limitation of Wang et al. (2019).Inspired by factor graph representations of second-order SDP where each variable node corresponds to a dependency edge (Fig. 1), we take edges as GNN nodes and define neighbors in terms of sibling, co-parent, and grandparent relationships.The benefit is shown as follows.
(i) Previous work suggests that GNNs outperform LBP on loopy graphs (Yoon et al., 2018;Satorras and Welling, 2021).Thus we can expect that using E-GNNs will improve the inference quality and thereby result in better parsing performance.(ii) E-GNNs are more expressive since they incorporate edge-level features instead of just word-level features as in Wang et al. (2019).Edge nodes propagate features among neighbors via GNN layers, iteratively refining their representations to be more context-aware, and thereby capturing more information regarding long-range dependencies, which is shown experimentally.
We conduct experiments on SemEval 2015 Task 18 English datasets of SDP, showing superior performance compared with Wang et al. (2019).

Model
Word representations.Given a sentence w = w 0 • • • w n (w 0 is the root), we feed it into BERT (Devlin et al., 2019) to obtain contextualized word embeddings and apply mean-pooling to the last layer of BERT to obtain word-level embedding c = c 0 • • • c n .We concatenate c with POS tag and lemma embeddings and then feed e 0 • • • e n into a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) The final word representation is Initial edge representation.To obtain initial edge representations, we adopt low-rank bilinear pooling (Kim et al., 2017) in order to capture pairwise interactions of parent and child word representations: where σ is the activation function and we choose tanh in this work; • is Hadamard (element-wise) product; U, W, V are linear layers (bias terms are omit for brevity).
E-GNN encoding.Inspired by second-order parsing, we take edges as GNN nodes and define neighbors in terms of sibling (sib), co-parent (cop), grand-parent (grd) relationships (Fig. 1).Since grandparent relationship is not symmetric, we consider both grandpa (grdp) and grandson (grds) relationships.We define rel(i, j), the neighbor set2 of edge (i, j) with respect to relationship rel ∈ {sib, cop, grdp, grds} as follows: For each rel, we use a deep biaffine scoring function (Dozat and Manning, 2017) to compute the un-normalized attention scores from edge (i, j) to its neighbor (m, n) ∈ rel(i, j): Note that we do not need to compute scores for every pair of (i, j) and (m, n), which needs O(n 4 ) time.We only need to compute scores for adjacent edges under specific relationship and thereby need only O(n 3 ) time.The normalized attention scores for each relation types are computed as follows: We compute the feature aggregated from neighbors as: Next, we update GNN node representations based on their last iteration's representations and the aggregated feature: Training loss.After l iterations of GNN update, we obtain e l ij for each edge.We use an MLP to map e l ij into a q-dimensional vector d ij , where q is the label set size (including the special NULL label).We can associate each edge with a label index, which is either the index of NULL if the edge does not exist in the gold SDP graph, or the index of the gold edge label.We denote this label index as y ⋆ ij .Then we use cross-entropy to compute the loss: 3 Experiment

Setup
We conduct experiments on the SemEval 2015 Task 18 English datasets (Oepen et al., 2015).Sentences in the datasets are annotated with three formalisms: DM (Flickinger et al., 2012), PAS (Miyao and Tsujii, 2004), and PSD (Hajič et al., 2012).We use the standard data splitting as used in previous works (Martins and Almeida, 2014;Du et al., 2015), which contains 33,964 sentences in the training set, 1,692 sentences in the development set, 1,410 sentences in the in-domain (ID) test set and 1,849 sentences in the out-of-domain (OOD) test set from the Brown Corpus (Francis and Kucera, 1982).We use bert-base-cased (Devlin et al., 2019) to obtain contextualized word embedding.The number of GNN layers is set to 3. Other hyperparameters can be found in App. A. We report the labeled F1 scores (LF1) in the ID and OOD test sets for each formalism.The reported results are averaged over three runs with different random seeds.

Main result
Table 1 shows the experimental results.We reimplement the LBP-based second-order parser of Wang et al. (2019) as the baseline (LBP hereafter for short), using the same neural encoder and the same settings (e.g., hyper-parameters) as E-GNN for fair comparison.As we can see, LBP has already surpassed Pointer (Fernández-González and Gómez-Rodríguez, 2020), a strong model, by 0.2 and 0.4 average F1 scores in ID test sets and OOD test sets.E-GNN further outperforms LBP by 0.2 average F1 scores on both ID and OOD test datasets.

Ablation study
We conduct ablation studies on PAS.First, we study the importance of using different relationship types to define neighbors in GNNs.As we can see from Table 2, removing sib/cop/grd (both grdp and grds) results in 0.17, 0.20, 0.18 LF1 score drops, respectively, showing that all these relationships are beneficial to the final performance, which is consistent with the intuition in second-order parsing.Second, we conduct an ablation study on the effect of the number of GNN layers.Table 2 shows that using 0/1/2 layers leads to 0.32/0.18/0.15F1 score drops, validating the importance of using a three-layer E-GNN.

Error analysis
Fig. 2 shows the change of LF1 scores with the length of dependency edges.We can see that when  the edge length is small (1-5), LBP and E-GNN have almost identical performance.However, when the edge length is large (>10), E-GNN outperforms LBP by a large margin, especially when the edge length is larger than 20.We hypothesize that E-GNN can model long-range dependencies more effectively.Neural encoders such as BiLSTMs have difficulty in capturing long-range dependencies, so relying solely on word representations to produce first/second-order edge scores, as in Wang et al. (2019), would have difficulty in predicting long edges.In comparison, for E-GNN, although the initial edge representation may also have difficulty in capturing long-range dependencies, during iterative GNN update, a long edge can gather information from all its neighbors, refining its representation to be more context-aware and thus capturing more long-range information.

Related work
Dependency parsing with GNNs.Ji et al. (2019) used GNNs for dependency parsing.However, they view words instead of edges as GNN nodes.Consequently, it is tricky to define neighbors and thus tricky to design node vector update schemes.In our model, we view edges as GNN nodes, so we can define neighbors and design node vector update schemes more naturally by following second-order dependency relationships.In addition, our model captures edge-level features and thus is more expressive.
Algorithmic alignment.One can view the GNN layers of our model as a learnable inference decoder, which mimics the behavior of LBP.Xu et al. (2020) propose the concept of algorithmic alignment, finding that neural networks-whose structures resembling classical algorithms for certain problems-are easier to train and have better performance.The design of our model follows the principle of algorithmic alignment, as E-GNN nodes resemble variable nodes in the factor graph of second-order parsing, and the message passing mechanism of the GNN resembles LBP inference steps.We can find other successful models complying with the algorithmic alignment principle in the field of NLP.Taking DIORA (Drozdov et al., 2019) for example, it mimics the classical inside-outside algorithm to design the network and achieves good performance in unsupervised constituency parsing.

Conclusion
We proposed E-GNNs in the spirit of the factor graph representation of second-order dependency parsing.Experiments and ablation studies on Se-mEval 2015 Task 18 English datasets of SDP validated the effectiveness of E-GNNs.

Limitations
E-GNN needs O(n 3 ) time to update edge representations in each GNN layer, while the Biaffine Parser only needs O(n 2 ) time to score all edges.Besides, E-GNN needs to store O(n 2 ) edge embeddings in each GNN layer, consuming more GPU memories than the Biaffine Parser.

Figure 1 :
Figure 1: Factor graph representation of second-order semantic dependency parsing.The figure is plotted by Wang et al. (2019).Edge node (i, j) represents an edge from x i to x j .Three kinds of second-order relationships: sibling, co-parent, and grandparent are shown in the figure.

Figure 2 :
Figure 2: LF1 of different edge lengths on three semantic formalisms.

Table 2 :
Ablation study on PAS test id.set.