Incorporating EDS Graph for AMR Parsing

AMR (Abstract Meaning Representation) and EDS (Elementary Dependency Structures) are two popular meaning representations in NLP/NLU. AMR is more abstract and conceptual, while EDS is more low level, closer to the lexical structures of the given sentences. It is thus not surprising that EDS parsing is easier than AMR parsing. In this work, we consider using information from EDS parsing to help improve the performance of AMR parsing. We adopt a transition-based parser and propose to add EDS graphs as additional semantic features using a graph encoder composed of LSTM layer and GCN layer. Our experimental results show that the additional information from EDS parsing indeed gives a boost to the performance of the base AMR parser used in our experiments.


Introduction
Semantic parsing has long been considered a difficult task and an important step to natural language understanding. A number of meaning representation formalisms have been proposed. Well-known ones include EDS (Elementary Dependency Structures; , UCCA (Universal Conceptual Cognitive Annotation; Abend and Rappoport, 2013), and AMR (Abstract Meaning Representation; Banarescu et al., 2013). Among them, AMR is more abstract from surface tokens and tries to capture the meaning of a sentence using concepts that may not appear in the sentence. If one views an AMR encoding as a graph, the AMR graph is always composed of fewer nodes than other meaning representations and some nodes in the AMR graph cannot be anchored to tokens or strings of tokens in the sentence. But EDS tries to build a meaning representation using lexical terms that are presented in the sentence, and nodes in their parse trees are anchored. In comparison, AMR has a much more fine-grained classification for the named entities, total of 124 entity types (Lin and Xue, 2019). Thus not surprisingly, AMR parsers do not perform as well as the ones for EDS. Currently the parsing accuracies for AMR are in low 80s, while they can be high 90s for EDS. In this paper, we propose to use EDS improve the performance of the AMR parser. Figure 1: AMR and EDS graph for "Imports were at $50.38 billion, up 19%.", #20011008 sentence from the WSJ Corpus, Penn Treebank (Marcus et al., 1993). Take node #3 in AMR as an example. "percentageentity" is the node label, "value" is the property of this node, and "19" is the specific value. For node #10 in EDS, "<35:37>" indicates the span of the corresponding surface string; "card" is the node label, "CARG" which means "constant argument" is the property, and "19" is the value.
To see how information from EDS parsing can be of use to AMR, consider the following sentence "Imports were at $50.38 billion, up 19%." from the Wall Street Journal Corpus, Penn Treebank (Marcus et al., 1993). Its graph encodings in AMR and EDS are shown in Figure 1. We mentioned that AMR is more abstract. This can be seen in the example as the graph for AMR is a lot smaller, and the nodes are labeled with conceptual entities. Nevertheless, EDS and AMR edges are labeled using the same semantic roles (e.g., ARG1, ARG2), indicating the relationship between a predicate and its arguments (Lin and Xue, 2019). In this example, there are some correspondences between their nodes. For example, the AMR nodes "percentageentity", "dollar", and "import-01" correspond to the EDS nodes " percen n of", " dollar n 1", and " import n of", respectively. In our task, the most important feature of EDS is anchoring. From the EDS graph, each node has a corresponding span of text. Conversely, we can find all related EDS nodes for each token based on the indexes. This suggests that EDS parsing may serve as an intermediate to AMR parsing, which motivated this work.
To incorporate EDS parsing into an AMR parser, we propose an EDS encoder composed of LSTM networks that capture the contextual information and a Graph Convolutional Network (GCN, Kipf and Welling, 2017) that extracts the structure knowledge. We feed EDS into our proposed encoder and produce token-level features. These EDS token-level features are concatenated to word embedding of tokens and participate in the AMR parsing process. To demonstrate the effectiveness of our approach, we use the AMR dataset from MRP 2019  and take as our baseline model the HIT-SCIR (Che et al., 2019), which was the best overall system at MRP 2019 and the 2nd best for AMR. Our experimental results show that our EDS-enhanced parsers clearly outperform the baseline model. In fact, some of our new models beat the best score of the official submitted AMR parsers in this benchmark. We also observed that the biggest improvements happened to be on those test data that are least similar to the training data.
The rest of this paper is organized as follows: Second 2 gives a brief overview on AMR parsers; Section 3 is concerned with the baseline system we adopt and our EDS-enhanced model. We present experimental settings and experimental results in Section 4 and conclude in Section 5.

Related Work
We classify AMR parsing systems into grammarbased, graph-based, and transition-based ones. The grammar-based ones generate AMR graphs directly from grammar trees. Several early AMR parsing systems were of this type. For example, Artzi et al. (2015) used combinatory categorial grammar (CCG) parsing to construct AMR, while Peng et al. (2015) made use of synchronous Hyperedge Replacement Grammar (SHRG). Generally speaking, grammar-based ones suffer from information loss during the processes of both grammar tree generation and AMR conversion. They predated the current deep learning approaches.
Modern AMR parsers use deep learning methods. Depending on how the eventual AMR graphs are generated, we can divide them into graph-based and transition-based. Both approaches are popular and their performances are competitive. Briefly, a graph-based system splits AMR parsing into two tasks, concept identification and edge prediction, and then combines them to generate a final AMR graph. The idea seems to appear first in Flanigan et al. (2014), and is used in Lyu and Titov (2018); Zhang et al. (2019a); Cai and Lam (2020); Zhou et al. (2020). A transition-based system, however, uses a sequence of transition actions to construct the graph incrementally. We can include the systems in Wang et al. (2015); Ballesteros and Al-Onaizan As we mentioned, our work is about incorporating EDS information into AMR parsing. We note that Brandt et al. (2016) considered adding preposition semantic role labeling to an AMR parser but found that the extra information did not seem to help. Hershcovich and Arviv (2019) used a multitask learning model but found multi-task TUPA consistently falls behind the single-task one for AMR. Arviv et al. (2020) used multi-task learning on EDS and UCCA parsing, however, EDS didn't bring any benefits to UCCA parsing. Adding extra semantic information like EDS is not easy. It matters how EDS graphs are encoded and incorporated into AMR parsing. We conduct our work with the AMR dataset from MRP 2019, and pick one of the best performing systems there, HIT-SCIR (Che et al., 2019), as our baseline model. Our experimental results show that adding EDS information can indeed give a significant boost to the baseline model. We believe our method is general and can be applied to other AMR parsing systems.

Baseline: A Transition-based Parser
Our baseline model is a transition-based system HIT-SCIR (Che et al., 2019). However, in our experiments, we use BERT-base instead of BERTlarge for word embeddings (Devlin et al., 2019) due to our constraints on computing resources. Nevertheless, when the BERT-base baseline model is enhanced with EDS information, it still outperforms the best AMR parser at MRP 2019.

Task Formalization
The main task of a transition-based model is to generate a sequence of actions to construct an AMR graph. The sequence of actions is predicted one at a time, and the graph is also constructed incrementally.
A state in HIT-SCIR is a tuple (S, L, B, E, V ), where S is a stack holding processed words, L is a list holding tokens popped out of S that will be pushed back in the future, and B is a buffer holding tokens waiting to be processed. E is the sets of labeled dependency edges and V is a set of graph nodes include concept nodes and surface tokens. The initial state of AMR parser was Oracle An action sequence bridges the input sentence and the AMR graph. So the basic requirement for the transition-based method is alignments. Given a gold AMR graph and alignments, one can convert the graph to an action sequence for model training. For each state s, HIT-SCIR decides one of the actions to apply and this is what we called oracle parser. To solve the problem of parsing concept nodes from surface strings, HIT-SCIR extends the basic oracle following previous work . The transition inventory is the following: • MERGE is to connect the top two tokens in the buffer to a single token waiting for being converted to a concept node.
• CONFIRM X is for converting the top element of buffer to a concept node X.
• NEW X generates a new node X and pushing into the buffer.
• ENTITY X does the same thing as CONFIRM X but adding internal properties of entity X, such as year of a date-entity.
• LEFT-EDGE X and RIGHT-EDGE X add an edge with label X between w j and w i , where w i is the top element of stack and w j is the top element of buffer. But they can be performed only when the top of buffer is a concept node.
• SHIFT is performed when no dependency exists between w j and any word in S other than w i , which pushes all words in list and w j into stack S. It is only allowed to perform when the top of buffer is a concept node.
• REDUCE is performed only when w i has head and is not the head or child of any word in buffer, which pops w i out of stack.
• PASS will be chosen when neither SHIFT or REDUCE can be performed, which moves w i to the front of list.
• DROP pops the top of buffer when it is a token.
• FINISH pops the root node and marks the state as terminal.

Stack-LSTM HIT-SCIR follows Ballesteros and
Al-Onaizan (2017) and uses Stack-LSTM to model AMR states. The output vector of this LSTM will consider the stack pointer instead of the rightmost position of the sequence.
The system models S, L, B and action history with multiple stack-LSTMs, which supports PUSH and POP operations. Parsing states from multiple stack LSTMs are fed into the action oracle classifier at once. The possibility of action under state s is calculated as where the set A represents the actions listed in the previous paragraph; STACK LSTM(s) encodes the state s into a vector, g a is the embedding of action a and b a is the bias vector for action.
In our model, items in S, L and B are the combined embedding of tokens that concatenate the original BERT word embeddings and EDS encoding for the tokens, introduced in the following section.

EDS Incorporation
In order to incorporate the EDS annotation information in the AMR parsing, we extend the EDS graph to include tokens. We feed the extended EDS graph to our proposed EDS encoder and obtain tokenlevel EDS features. Afterward, we concatenate token-level EDS features with word embedding and input them into the transition-based model.

EDS Extension
Each node in the EDS graph has an explicit many-to-many anchoring onto substrings of the input sentence. It means that corresponding related EDS nodes for each token can be found based on the nodes' span. Therefore, we add a bottom layer consisting of the input sentence tokens. In this way, the updated embedding of tokens in this layer can be extracted as EDS features for each token.
In the preprocessing, the edges labeled as contain are added between token nodes and original nodes if their spans of strings intersect. Figure 2 is the example of an updated EDS graph for the sentence "Not this year.", #20010002 from WSJ. We only care about EDS labels in our experiments. We show contain edges as dash lines and original edges as solid lines. In Figure 2, the bottom four nodes are token nodes, whose embeddings are used as EDS features. For token node v t , the hidden state at layer k is h k vt . The calculation details can be found in the following paragraph. BERT splits each token into several pieces. The system extracts the first piece as its word embedding, denoted as BERT(t). Therefore, for each token t, we get embedding from two parts, BERT embedding BERT(t) and final hidden state h vt of correspond-ing new added node v t . Then the concatenation of two vectors (BERT(t) h vt ) would be pushed in the buffer, waiting for the next step of processing.

EDS Encoder
The emergence of neural networks has had tremendous impacts on many fields, including graph data parsing systems. GCNs (Kipf and Welling, 2017;Marcheggiani and Titov, 2017) have emerged to be the neural networks of choice for encoding graphs. Our proposed EDS encoder consists of an LSTM layer to capture context information and GCN layers to encode structural knowledge.
EDS represents the meaning of a sentence in a directed graph where nodes represent logical predicates and edges to labeled arguments. The defini- and L V , L E are vocabularies for node labels and edge labels respectively.
To reinforce relations between nodes through layers, we add self edges (v i , v i ) for every node in the graph and inverted edges (v j , v i ) with label inv e for each directed edge (v i , v j ) with label e , including the new added contain edges. There- where T is the set of token nodes. E = E ∪ {contain, self } ∪ I, where I is the set of inverted edges.
The goal of our EDS encoder is to update representation of each node considering the whole EDS graph. First, we adopt GCN to update word embedding based on their neighbors. Directed edges in the EDS graph represent the relationship of nodes, so we make the same assumption that the GCN parameters are label-specific as Marcheggiani and Titov (2017). Therefore, we calculate the hidden state of node v at k-th layer h k v as: represents the neighbor nodes of v; ReLU is the rectifier linear unit activation function. However, to reduce the size of parameters and simplify the calculation, we classify edges into three kinds: self edge, edges in the original direction including contain and inverted edges. Therefore, instead of using W L(u,v) , we define them as W L(u,v) = V dir (u,v) , where dir(u, v) specifies the kind of edge.
EDS annotation in this experiment is automatically generated by EDS parser, so accepting all information from the EDS graph is risky. To solve this problem, we adopt gate schema. We calculate a scalar gate for each edge node pair in the form as: where σ is the logistic sigmoid function;v k dir (u,v) andb k L(u,v) are weights and a bias for the gate. Therefore, the final formalism of the hidden state calculation is: L(u,v) )).
GCN introduced so far learns effective representation on the structure. Still, there is the limitation in that nodes can only be updated based on their immediate neighbors on each GCN layer. Nodes far away from each other with n-order in the graph are hard to encode on GCN models. Adding an LSTM layer can compensate for this limitation. The hidden states of LSTM instead of embedding of EDS nodes are fed into GCN layers, that is, The structure of the EDS encoder is illustrated in Figure 3. The embeddings of EDS nodes (hollow circles) are first fed into an LSTM layer. After processing in the LSTM layer, contextual information is included in the light gray circles. After several GCN layers, dark gray circles that hold edges and neighbors information are the final hidden states of GCN layers.

Experimental Setup
Our experiments were done using the toolkit Al-lenNLP (Gardner et al., 2018).

EDS Parser
In this study, we adopted opensource distribution LOGON  to generate EDS annotations. LOGON 1 package contains ERG parsers and the ERG-to-EDS converter. Compared to the purely data-driven parsers, general-purpose grammatical knowledge encoded in the ERG aids EDS parsing (Oepen and Flickinger, 2019). We applied ERG release 1214 and use LOGON in one-best mode. However, LO-GON failed to parse part of sentences due to limitation of search tree or other reasons (about 15% of data), so we used the EDS model of Che et al. (2019) to parse those sentences.
Baseline Model As we mentioned, we adopt HIT-SCIR as our baseline model. However, we use the smaller pre-trained model, BERT-base, for word embeddings due to GPU limitation. For alignments, the baseline model uses an enhanced rulebased aligner TAMR  to generate transition actions for AMR graph. More details on hyper-parameters can be found Table 3 in appendix. Our experiments were done using GeForce RTX 2080 Ti GPU. During model training, each epoch took about 4 hours on one GPU.
Dataset We use the dataset from MRP 2019 so that we can compare our models with the officially submitted models there. The shared task has constraints on which additional data or pretrained models can be used for reasons of comparability and fairness. Our models meet the requirements as both our baseline model (HIT-SCIR) and the EDS parser that we use to generate EDS graphs for use by the baseline model satisfy them.
There are 56,240 sentences in MRP 2019 AMR training set. The test set contains 1,998 sentences, and among them are 100 randomly selected sentences from the novel The Little Prince. MRP 2019 provided results for AMR parsing models on both the entire test set (called All Data) and the subset of the setences from The Little Prince (called Lpps). The reason for the special interest on the latter was that the sentences from the novel are presumably least similar to the training data, which are mostly from the WSJ-corpus. Indeed, most models at MRP 2019 have lower scores on Lpps than on All Data. We will have more to say about this in the next section on our experimental results.
Metrics MRP 2019 used two metrics to evaluate the models: the standard SMATCH scorer  included in the open-source mtool software (the Swiss Army Knife of Meaning Representation) 2 , and an MRP 2019 specific scorer that is similar to SMATCH but can compare two meaning representation graphs (the ground truth and the model output) according to certain fine-grained attributes such as edges, node labels and so on. We'll mainly use the SMATCH metric but also give MRP metric for the Lpps test set. We refer the reader to  for more details on MRP 2019 datasets and metrics.

Results
Results on Different Structures Our SMATCH experimental results are summarized in Table 1. To see the effects of different encodings of EDS graphs, we tried five EDS-enhanced systems. Among the five, three use only GCNs ([G1],[G2],[G3]), from single layer to three layers, and two with a single BiLSTM layer plus one or two GCN layers ([LG1],[LG2]). As can be seen from Table 1, LG1 achieves the highest F1 score, outperforming Amazon (Cao et al., 2019), the best overall AMR parser at MRP 2019.
We note that all our five EDS-enhanced systems perform better than HIT-SCIR, our reference baseline model. Interestingly, the number of GCN layers matters and it's not necessarily the more the better. The reason for GCN performance degradation in our work is possible to be over-smoothing, which was discussed in previous work (Li et al., 2018).
Among the three with GCN layers only ([G1],[G2],[G3]), the best is G2. When BiLSTM is added, one layer of GCN ([LG1]) is better than two ([LG2]). It's possible there is some theoretical explanation for this but we suspect it also has something to do with the dataset. For example, for 2 https://github.com/cfmrp/mtool All Data, the F1 score of BiLSTM plus one layer of GCN is the same as the one with two layers.
Results on Lpps Some interesting observations can be made on the Lpps test set. As we mentioned, this test set contains 100 random sentences from the book The Little Prince. These sentences seem to be quite different from those in the training set given at MRP 2019. So not surprisingly, most models have poorer performance on this test set except for Saarland (Donatelli et al., 2019) which somehow performs better in this test set than the All Data set. As for HIT-SCIR, its performance on Lpps is a lot worse than that on All Data. What is worthwhile noting is that our models, which are basically HIT-SCIR enhanced with EDS in various ways, boost its performance on Lpps significantly. Our best model is even better than Saarland on this test set. Compared to HIT-SCIT, it increases its F1 scores by 4.6% (from .680 to .726) on Lpps while only 1.1% on All Data (from .725 to .736). So the extra EDS information really pays off on this test set.
SMATCH is a general tool for computing the overall differences between two answers. To give a more fine-grained comparison between two meaning representations, MRP has its own scorer to compute what are called the Tops, Labels, Properties, Edges, and All scores, where the All scores are close to the SMATCH score. Table 2 gives MRP F1 scores for the Lpps test set for our baseline model HIT-SCIR and our EDS-enhanced models. Again we see that our models improve the performance of the baseline model significantly in all subtasks, especially in Labels and Properties. For F1 score on Tops, Labels, Properties, the model LG1 performs best, 5%, 6% and 9% improvement respectively. Whereas, G2 performs best in Edge F1 score, about 3% improvement.
Our model can handle these "out of domain" cases better because we have accurate EDS parsing for them. Consider the verb "look" in the sentence "I shall look as if I were suffering." In the baseline model, the predicted AMR node for it is "look-01", which is wrong. Our model with the extra information from EDS correctly labels it "look-02". The possible reason why the baseline model selects "look-01" is that "look-01" appears nearly twice as often as "look-02" in training data: the former 198 times and the latter 103. However, the EDS subgraph for the phrase "sb. look as if " is node(pron)-edge(arg1)-node(look-v)-
Results on gold EDS annotations Finally, we notice that MRP provided gold EDS annotation for Lpps test data. We tried our models using these gold EDS annotations on the Lpps test set, and observed that this actually resulted in a minor reduction in F1 scores, around 0.0001 worth than the model using silver EDS annotations. The reason the gold label actually performed a bit worse is because the model was trained using the actual EDS parsing results, so seems to "adapted" to the bias of the EDS parser used.
As a footnote, we remark here that we are aware of the new results on AMR parsers at MRP 2020 that were released in late November 2020. This work was done prior to MRP 2020. While MRP 2020 also used Lpps as a test set, the results there and our results here are not directly comparable as they were done using different training sets. Furthermore, the main purpose of this paper is about incorporating EDS graphs into AMR parsing, so the comparison with the baseline model is more meaningful.

Conclusions
In this study, we incorporate the EDS, a meaning representation that is more accessible than AMR, to improve the performance of AMR parsing. To encode EDS graphs for AMR parsing, we used both LSTM and GCN layers. As a case study, we enhanced a transition-based AMR parser with EDS graphs, and showed that on the AMR benchmarks that the baseline model already performs well, our EDS-enhanced parsers can further improve its performance. The improvements are especially noticeable on the Lpps (The Little Prince) test set where the baseline parser performs poorly and lags behind other AMR parsers at MRP 2019. In fact, on the Lpps test set, our EDS enhanced parsers outperform even the best one submitted there.
We can also see some other implications of this work. For us, the ultimate goal of semantic parsing is to use it in downstream tasks such as question answering, reasoning, and knowledge extraction from texts. Given that almost all meaning representations are graph-based, we believe our encoding of EDS graphs with LSTM and GCN layers can be applied in these downstream tasks. We are currently exploring this as a future work. Another insight from this work is about possible connections among different meaning representations. We have demonstrated the usefulness of EDS graphs for AMR parsing. It is likely they can also be useful for other frameworks, even vice versa. More generally, whether there is a universal semantic parser that can take advantages of information from each framework is an interesting question worth investigating.