Levi Graph AMR Parser using Heterogeneous Attention

Coupled with biaffine decoders, transformers have been effectively adapted to text-to-graph transduction and achieved state-of-the-art performance on AMR parsing. Many prior works, however, rely on the biaffine decoder for either or both arc and label predictions although most features used by the decoder may be learned by the transformer already. This paper presents a novel approach to AMR parsing by combining heterogeneous data (tokens, concepts, labels) as one input to a transformer to learn attention, and use only attention matrices from the transformer to predict all elements in AMR graphs (concepts, arcs, labels). Although our models use significantly fewer parameters than the previous state-of-the-art graph parser, they show similar or better accuracy on AMR 2.0 and 3.0.


Introduction
Abstract Meaning Representation (AMR) has recently gained lots of interests due to its capability in capturing abstract concepts (Banarescu et al., 2013). In the form of directed acyclic graphs (DAGs), an AMR graph consists of nodes as concepts and edges as labeled relations. To build such a graph from plain text, a parser needs to predict concepts and relations in concord.
While significant research efforts have been conducted to improve concept and arc predictions, label prediction has been relatively stagnated. Most previous models have adapted the biaffine decoder for label prediction (Lyu and Titov, 2018;Zhang et al., 2019a;Cai and Lam, 2019;Lindemann et al., 2020). These models assign labels from the biaffine decoder to arcs predicted by another decoder, which can be misled by incorrect arc predictions during decoding.
The enhancement of message passing between decoders for arc and label predictions has shown to be effective. Among these works, Cai and Lam (2020) emerge with an iterative method to exchange embeddings between concept and arc predictions and feed the enhanced embeddings to the biaffine decoder for label prediction. While this approach greatly improves accuracy, it complicates the network architecture without structurally avoiding the error propagation from the arc prediction.
This paper presents an efficient transformerbased (Vaswani et al., 2017) approach that takes a mixture of tokens, concepts, and labels as inputs, and performs concept generation, arc prediction, and label prediction jointly using only attentions from the transformer without using a biaffine decoder. Its compact structure ( §3.3) enables crossattention between heterogeneous inputs, providing a complete view of the partially built graph and a better representation of the current parsing state. A novel Levi graph decoder ( §3.4) is also proposed that reduces the number of decoder parameters by 45% (from 5.5 million to 3.0 million) yet gives similar or better performance. To the best of our knowledge, this is the first text-to-AMR graph parser that operates on the heterogeneous data and adapts no biaffine decoder.

Related Work
Recent AMR parsing approaches can be categorized into four classes: (i) transition-based parsing which casts the parsing process into a sequence of transitions defined on an abstract machine (e.g., a transition system using a buffer and a stack) (Wang et al., 2016;Damonte et al., 2017;Ballesteros and Al-Onaizan, 2017;Peng et al., 2017;Guo and Lu, 2018;Liu et al., 2018;Naseem et al., 2019;Fernandez Astudillo et al., 2020;Lee et al., 2020), (ii) seq2seq-based parsing 2 which transduces raw sentences into linearized AMR graphs in text form (Barzdins and Gosko, 2016;Konstas et al., 2017;van Noord and Bos, 2017;Peng et al., 2018;Xu et al., 2020;Bevilacqua et al., 2021), (iii) seq2graph-based parsing which incrementally and directly builds a semantic graph via expanding graph nodes without resorting to any transition system (Cai and Lam, 2019;Zhang et al., 2019b;Lyu et al., 2020). (iv) graph algebra parsing which translates an intermediate grammar structure into AMR (Artzi et al., 2015;Groschwitz et al., 2018;Lindemann et al., 2019Lindemann et al., , 2020. Our work is most closely related to seq2graph paradigm while we extend the definition of node to accommodate relation labels in a Levi graph. We generate a Levi graph which is a linearized form originally used in seq2seq models for AMRto-text (Beck et al., 2018;Guo et al., 2019;Ribeiro et al., 2019). Our Levi graph approach differs from seq2seq approaches in its attention based arc prediction, where arc is directly predicted by attention heads instead of brackets in the target sequence.
3 Approach 3.1 Text-to-Graph Transducer Figure 1 shows the overview of our Text-to-Graph Transduction model. Let W = {w 0 , w 1 , . . . , w n } be the input sequence where w 0 is a special token representing the target node and w i is the i'th token. W is fed into a Text Encoder creating embeddings {e w 0 , e w 1 , . . . , e w n }. In parallel, NLP Tools produce several features for w i and pass them to a Feature Encoder to generate {e f 0 , e f 1 , . . . , e f n }. Embeddings {e w i ⊕e f i : i ∈ [0, n]} are put to a Text Transformer, which generates E t = {e t 0 , e t 1 , . . . , e t n }.
Seq2seq-based parsing is sometimes categorized into "translation-based methods"  possibly due to the prevalence of seq2seq model in Neural Machine Translation, while we believe that translation refers more to the transduction between languages while AMR is neither a language nor an interlingua. E t and E v are fed into a Graph Transformer that predicts the target node as well as its relations to all nodes in V . The target node predicted by the Graph Transformer gets appended to V afterwards. 4

Concept + Arc-Biaffine + Rel-Biaffine
Our first graph transformer generates {v 1 , . . . , v m } where v i is a concept in the target graph, and predicts both arcs and labels using a biaffine decoder.
α j indicates the probability of w j being aligned to the target node, and β ⊕ is the embedding representing the node. Let C be the list of all concepts in training data and L be the list of lemmas for tokens in W such that |W | = |L|. Given X = C W L, α and β ⊕ are fed into a Node Decoder estimating the score of each x i ∈ X being the target node: is the gate probability of the target node being in C|W |L, respectively (W C|W |L ∈ R d×1 ).
p(x i ) is estimated by measuring the probabilities of  For arc and label predictions, the target embedding β ⊕ is used to represent a head and the embeddings of previously predicted nodes, {e v 1 , . . . , e v m }, are used to represent dependents in a Biaffine Decoder, which creates two output layers, o arc ∈ R 1×m and o rel ∈ R 1×m×|R| , to predict the target node being a head of the other nodes, where |R| is the list of all labels in training data (Dozat and Manning, 2017).

Concept + Arc-Attention + Rel-Biaffine
Our second graph transformer is similar to the one in §3.2 except that it uses an Arc Decoder instead of the Biaffine Decoder for arc prediction. Given A = {α 1 , . . . , α h } in §3.2, α ⊗ ∈ R 1×(m+1) is created by first applying dimension-wise maxpooling to A and slicing the last m + 1 dimensions as follows: Notice that values in α ⊗ are derived from multiple heads; thus, they are not normalized. Each head is expected to learn different types of arcs. During decoding, any v i ∈ V whose α ⊗ i ≥ 0.5 is predicted to be a dependent of the target node. During training, the negative log-likelihood of α ⊗ is optimized. 5 5 This model still uses the Biaffine Decoder for label prediction.
The target node, say v t , may need to be predicted as a dependent of v i , in which case, the dependency is reversed (so v t becomes the head of v i ), and the label is concatenated with the special tag _R (e.g.,

Levi Graph + Arc-Attention
Our last graph transformer uses the Node Decoder for both concept and label generations and the Arc Decoder for arc prediction. In this model, v i ∈ V can be either a concept or a label such that the original AMR graph is transformed into the Levi graph (Levi, 1942;Beck et al., 2018) (Figure 3). Unlike the node sequence containing only concepts in the AMR graph ordered by breadth-first traverse, used as the output sequence for the models in §3.2 and §3.3, the node sequence in this model is derived by inserting the label of each edge after head concept during training. This concepts-labels alternation has two advantages over a strict topological order: (i) it can handle erroneous cyclic graphs, (ii) it is easier to restore relations as each label is connected to its closest concept. The heterogeneous nature of node sequences from Levi graphs allows our Graph Transformer to learn attentions among 3 types of input, tokens, concepts, and labels, leading to more informed predictions.
Let V be the output sequence consisting of both predicted concepts and labels. Let C be the set of all concepts and labels in training data. Compared to V and C in §3.2, V is about twice larger than V because every concept has one or more associated labels that indicate relations to its heads. However, C is not so much larger than C because the addition from the labels is insignificant to the number of concepts that are already in C. By replacing V |C with V |C respectively, the Node Decoder in §3.2 can generate both concepts and labels. α ⊗ in §3.3 then gives attention scores among concepts and labels that can be used by the Arc Decoder to find arcs among them.

Experimental Setup
All models are experimented on both the AMR 2.0 (LDC2017T10) and 3.0 datasets (LDC2020T02). AMR 2.0 has been well-explored by recent work, while AMR 3.0 is the latest release about 1.5 times larger than 2.0 that has not yet been explored much. The detailed data statistics are shown in Table A.1.2. The training, development, and test sets provided in the datasets are used, and performance is evaluated with the SMATCH (F1)  as well as fine-grained metrics (Damonte et al., 2017). The same pre-and post-processing suggested by Cai and Lam (2020) are adapted. Section A.2 gives the hyper-parameter configuration of our models.

Results
All our models are run three times and their averages and standard deviations are reported in Table 1. Compared to CL20 using 2 transformers to decode arcs & concepts then apply attention across them, our models use 1 transformer for the Node Decoder achieving both objectives simultaneously. All models except for ND+BD reaches the same SMATCH score of 80% on AMR 2.0. ND+AD+LV shows a slight improvement over the others on AMR 3.0, indicating that it has a greater potential to be robust with a larger dataset. Considering that this model uses about 3M fewer pa-rameters than CL20, these results are promising. ND+BD+BD consistently shows the lowest scores, implying the significance of modeling concept generation and arc prediction coherently for structure learning. ND+AD+LV shows higher scores for SRL and Reent whereas the other models show advantage on Concept and NER on AMR 2.0, although the trend is not as noticeable on AMR 3.0, implying that the Levi graph helps parsing relations but not necessarily tagging concepts.

Case Study
We study the effect of our proposed two improvements: heterogeneous Graph Transformer and Levi graph, from the view of attention in Figure 4. Figure 4a shows that the core verb "wants" is heavily attended by every token, suggesting that our Graph Transformer successfully grasps the core idea. Figure 4b presents the soft alignment between nodes and tokens, which surprisingly overweights " boy", "girl" and "believe" possibly due to their dominance of semantics. Figure 4c illustrates the arc prediction, which is a lower triangular matrix obtained by zeroing out the upper triangle of stacked α ⊗ . Its diagonal suggests that self-loop is crucial for representing each node.

Conclusion
We presented two effective approaches which achieve comparable (or better) performance comparing with the state-of-the-art parsers with significantly fewer parameters. Our text-to-graph trans-(a) Token token (b) Node token (c) Node node Figure 4: Self-and cross-attention for tokens "The boy wants the girl to believe him" and nodes "want believe ARG1 boy ARG1 ARG0 girl ARG0".
ducer enables self-and cross-attention in one transformer, improving both concept and arc prediction. With a novel Levi graph formalism, our parser demostrates its advantage on relation labeling. An interesting future work is to preserve benefits from both approaches in one model. It is also noteworthy that our Levi graph parser can be applied to a broad range of labeled graph parsing tasks including dependency trees and many others.