AMR Parsing with Causal Hierarchical Attention and Pointers

Translation-based AMR parsers have recently gained popularity due to their simplicity and effectiveness. They predict linearized graphs as free texts, avoiding explicit structure modeling. However, this simplicity neglects structural locality in AMR graphs and introduces unnecessary tokens to represent coreferences. In this paper, we introduce new target forms of AMR parsing and a novel model, CHAP, which is equipped with causal hierarchical attention and the pointer mechanism, enabling the integration of structures into the Transformer decoder. We empirically explore various alternative modeling options. Experiments show that our model outperforms baseline models on four out of five benchmarks in the setting of no additional data.


Introduction
Abstract Meaning Representation (Banarescu et al., 2013) is a semantic representation of natural language sentences typically depicted as directed acyclic graphs, as illustrated in Fig. 1a.This representation is both readable and broad-coverage, attracting considerable research attention across various domains, including information extraction (Zhang and Ji, 2021;Xu et al., 2022), summarization (Hardy and Vlachos, 2018;Liao et al., 2018), and vision-language understanding (Schuster et al., 2015;Choi et al., 2022).However, the inherent flexibility of graph structures makes AMR parsing , i.e., translating natural language sentences into AMR graphs, a challenging task.
The development of AMR parsers has been boosted by recent research on pretrained sequenceto-sequence (seq2seq) models.Several studies, categorized as translation-based models, show that fine-tuning pretrained seq2seq models to predict linearized graphs as if they are free texts (e.g., examples in Tab.1.ab)can achieve competitive or even superior performance (Konstas et al., 2017;Xu et al., 2020;Bevilacqua et al., 2021;Lee et al., 2023).This finding has spurred a wave of subsequent efforts to design more effective training strategies that maximize the potential of pretrained decoders (Bai et al., 2022;Cheng et al., 2022;Wang et al., 2022;Chen et al., 2022), thereby sidelining the exploration of more suitable decoders for graph generation.Contrary to preceding translation-based models, we contend that explicit structure modeling within pretrained decoders remains beneficial in AMR parsing.To our knowledge, the Ancestor parser (Yu and Gildea, 2022) is the only translation-based model contributing to explicit structure modeling, which introduces shortcuts to access ancestors in the graph.However, AMR graphs contain more information than just ancestors, such as siblings and coreferences, resulting in suboptimal modeling.
In this paper, we propose CHAP, a novel translation-based AMR parser distinguished by three innovations.Firstly, we introduce new target forms of AMR parsing.As demonstrated in Tab.1.c-e, we use multiple layers to capture different semantics, such that each layer is simple and concise.Particularly, the base layer, which encapsulates all meanings except for coreferences (or reentrancies), is a tree-structured representation, enabling more convenient structure modeling than the graph structure of AMR.Meanwhile, coreferences are presented through pointers, circumventing several shortcomings associated with the variablebased coreference representation (See Sec. 3 for more details) used in all previous translation-based models.Secondly, we propose Causal Hierarchical Attention (CHA), the core mechanism of our incremental structure modeling, inspired by Transformer Grammars (Sartran et al., 2022).CHA describes a procedure of continuously composing child nodes to their parent nodes and encoding new nodes with all uncomposed nodes, as illustrated in Fig. 2. Un-   c) is for ⇓single(c) (Fig. 3c).( d) is for ⇓double(c) (Fig. 3b).(e) is for ⇑ (Fig. 3d).Red pointers , which represent coreferences, constitute the coref layer.Blue pointers , which point to the left boundaries of subtrees, constitute the auxiliary structure layer.Texts, which encapsulate all other meanings, constitute the base layer.

Property
Translation-based Transition-based Factor.-based Ours like the causal attention in translation-based models, which allows a token to interact with all its preceding tokens, CHA incorporates a strong inductive bias of recursion, composition, and graph topology.Thirdly, deriving from transition-based AMR parsers (Zhou et al., 2021a,b), we introduce a pointer encoder for encoding histories and a pointer net for predicting coreferences, which is proven to be an effective solution for generalizing to a variable-size output space (Vinyals et al., 2015;See et al., 2017).We propose various alternative modeling options of CHA and strategies for integrating CHA with existing pretrained seq2seq models and investigate them via extensive experiments.Ultimately, our model CHAP achieves superior performance on two in-distribution and three out-ofdistribution benchmarks.Our code is available at https://github.com/LouChao98/chap_amr_parser.

AMR Parsing
Most recent AMR parsing models generate AMR graphs via a series of local decisions.Transitionbased models (Ballesteros and Al-Onaizan, 2017;Naseem et al., 2019;Fernandez Astudillo et al., 2020;Zhou et al., 2021a,b) and translation-based models (Konstas et al., 2017;Xu et al., 2020;Bevilacqua et al., 2021;Lee et al., 2023) epitomize local models as they are trained with teacher forcing, optimizing only next-step predictions, and rely on greedy decoding algorithms, such as greedy search and beam search.Particularly, transitionbased models predict actions permitted by a transition system, while translation-based models predict AMR graph tokens as free texts.Some factorization-based models are also local (Cai andLam, 2019, 2020), sequentially composing subgraphs into bigger ones.We discern differences in four properties among previous local models and our model in Tab.2: Trainability Whether additional information is required for training.Transition-based models rely on word-node alignment to define the gold action sequence.
Structure modeling Whether structures are modeled explicitly in the decoder.Transition-based models encode action histories like texts without considering graph structures.Besides, translationbased models opt for compatibility with pretrained decoders, prioritizing this over explicit structure  modeling.
Pretrained decoder Whether pretrained decoders can be leveraged.
Variable-free Whether there are variable tokens in the target representation.Transition-based models, factorization-based models and ours generate coreference pointers, obviating the need of introducing variables.

Transformer Grammar
Transformer Grammars (TGs; Sartran et al., 2022) are a novel class of language models that simultaneously generate sentences and constituency parse trees, in the fashion of transition-based parsing.
The base layer of Tab.1.c can be viewed as an example action sequence.There are three types of actions in TG: (1) the token "(" represents the action ONT, opening a nonterminal; (2) the token ")" represents the action CNT, closing the nearest open nonterminal; and (3) all other tokens (e.g., a and :arg0) represent the action T, generating a terminal.TG carries out top-down generation, where a nonterminal is allocated before its children.We will also explore a bottom-up variant in Sec.3.4.
Several studies have already attempted to generate syntax-augmented sequences (Aharoni and Goldberg, 2017;Qian et al., 2021).However, TG differentiates itself from prior research through its unique simulation of stack operations in transition-based parsing, which is implemented by enforcing a specific instance of CHA.A TG-like CHA is referred to as ⇓double in this paper and we will present technical details in Sec.3.3 along with other variants.

Structure Modeling
We primarily highlight two advantages of incorporating structured modeling into the decoder.Firstly, the sequential order and adjacence of previous linearized form mismatch the locality of real graph structures, making the Transformer decoder hard to understand graph data.Specifically, adjacent nodes in an AMR graph exhibit strong semantic relationships, but they could be distant in the linearized form (e.g., person and tour-01 in Fig. 1a).
Conversely, tokens closely positioned in the linearized form may be far apart in the AMR graph (e.g., employ-01 and tour-01 in Fig. 1a).Secondly, previous models embed variables into the linearized form (e.g., b in Tab.1.a and <R1> in Tab.1.b) and represent coreferences (or reentrancies) by reusing the same variables.However, the literal value of variables is inconsequential.For example, in the PENMAN form, (a / alpha :arg0 (b / beta)) conveys the same meaning as (n1 / alpha :arg0 (n2 / beta)).Furthermore, the usage of variables brings up problems regarding generalization (Wong and Mooney, 2007;Poelman et al., 2022).For instance, in the SPRING DFS form, <R0> invariably comes first and appears in all training samples, while <R100> is considerably less frequent.

Multi-layer Target Form
As shown in Tab.1.c, we incorporate a new layer, named the coreference (coref) layer, on top of the conventional one produced by a DFS linearization, named the base layer1 .The coref layer serves to represent coreferences, in which a pointer points from a mention to its nearest preceding mention, and the base layer encapsulates all other meanings.From a graph perspective, a referent is replicated to new nodes with an amount equal to the reference count that are linked by newly introduced coref pointers, as illustrated in Fig. 1b.We argue that our forms are more promising because our forms can avoid meaningless tokens (i.e., variables) from cluttering up the base layer, yielding several beneficial byproducts: (1) it shortens the representation length; (2) it aligns the representation more closely with natural language; and (3) it allows the base layer to be interpreted as trees, a vital characteristic for our structure modeling.Tab.1.d and e are two variants of Tab.1.c.These three forms are designed to support different variants of CHA, which will be introduced in Sec.3.3 We consider two modeling options of top-down generation, ⇓single (Fig. 3c and ⇓double (Fig. 3b), varying on actions triggered by ")".More precisely, upon seeing a ")", ⇓single executes a compose, whereas ⇓double executes an additional expand after the compose.For other tokens, both ⇓single and ⇓double execute an expand operation.Because the decoder performs one attention for each token, in ⇓double, each ")" is duplicated to represent compose and expand respectively, i.e., ")" becomes ") 1 ) 2 ".We detail the procedure of generating M CHA for ⇓single in Alg. 1.The procedure for ⇓double can be found in Sartran et al.'s (2022) Alg. 1.
The motivation for the two variants is as follows.In a strict leaf-to-root information aggregation procedure, which is adopted in many studies on tree encoding (Tai et al., 2015;Drozdov et al., 2019;Hu et al., 2021;Zhou et al., 2022), a parent node only aggregates information from its children, remaining unaware of other generated structures (e.g., beta is unaware of alpha in Fig. 2b).However, when new nodes are being expanded, utilizing all available information could be a more reasonable approach (e.g., gamma in Fig. 2c).Thus, an expand process is introduced to handle this task.The situation with CHA becomes more flexible.Recall that all child nodes are encoded with the expand action, which aggregates information from all visible nodes, such that information of non-child nodes is leaked to the parent node during composition.⇓single relies on the neural network's capability to encode all necessary information through this leakage, while ⇓double employs an explicit expand to allow models to directly revisit their histories.

Bottom-up generation
In the bottom-up generation, the parent node is allocated after all child nodes have been generated.This process enables the model to review all yet-tobe-composed tokens before deciding which ones should be composed into a subtree, in contrast to the top-down generation, where the model is required to predict the existence of a parent node without seeing its children.The corresponding target form, as illustrated in Tab.1.e, contains no brackets.Instead, a special token ■ is placed after the rightmost child node of each parent node, with a pointer pointing to the leftmost child node.We execute the compose operation for ■ and the expand operation for other tokens.The generation of the attention mask (Fig. 3d) is analogous to ⇓single, Algorithm 1: M CHA for ⇓single.
Data: sequence of token t with length N Result: but we utilize pointers in place of left brackets to determine the left boundaries of subtrees.The exact procedure can be found in Appx.B.

Parsing Model
Our parser is based on BART (Lewis et al., 2020), a pretrained seq2seq model.We make three modifications to BART: (1) we add a new module in the decoder to encode generated pointers, (2) we enhance decoder layers with CHA, and (3) we use the pointer net to predict pointers.

Encoding Pointers
The target form can be represented as a tuple of (t, p)2 , where t and p are the sequence of the base layer and the coref layer, respectively, such that each p i is the index of the pointed token.We define p i = −1 if there is no pointer at index i.
In the BART model, t is encoded using the token embedding.However, no suitable module exists for encoding p.To address this issue, we introduce a multi-layer perceptron, denoted as MLP p , which takes in the token and position embeddings of the pointed tokens and then outputs the embedding of p. Notably, if p i = −1, the embedding is set to 0. All embeddings, including that of t, p and positions, are added together before being fed into subsequential modules.

Augmenting Decoder with CHA
We explore three ways to integrate CHA in the decoder layer, as shown in Fig. 4. The inplace architecture replaces the attention mask of some attention heads with M CHA in the original self-attention module without introducing new parameters.However, this affects the normal functioning of the replaced heads such that the pretrained model is disrupted.
Alternatively, we can introduce adapters into decoder layers (Houlsby et al., 2019).In the parallel architecture, an adapter is introduced in parallel to the original self-attention module.In contrast, an adapter is positioned subsequent to the original module in the pipeline architecture.Our adapter is defined as follows: where W Q , W K , W V are query/key/value projection matrices, FFN 1 /FFN 2 are down/up projection, h i is the input hidden states and h o is the output hidden states.

Predicting Pointer
Following previous work (Vinyals et al., 2015;Zhou et al., 2021b), we reinterpret decoder selfattention heads as a pointer net.However, unlike the previous work, we use the average attention probabilities from multiple heads as the pointer probabilities instead of relying on a single head.Our preliminary experiments indicate that this modification results in a slight improvement.
A cross-entropy loss between the predicted pointer probabilities and the ground truth pointers is used for training.We disregard the associated loss at positions that do not have pointers and exclude their probabilities when calculating the entire pointer sequence's probability.

Training and Inference
We optimize the sum of the standard sequence generation loss and the pointer loss: where α is a scalar hyperparameter.
For decoding, the probability of a hypothesis is the product of the probabilities of the base layer, the coref sequence, and the optional struct layer.We enforce a constraint during decoding to ensure the validity of M CHA : the number of ) should not surpass the number of (, and two constraints to ensure the well-formedness of pointer: (1) coreference pointers can only point to positions with the same token, and (2) left boundary pointers in bottom-up generation cannot point to AMR relations (e.g., :ARG0).

Setup
Datasets We conduct experiments on two indistribution benchmarks: (1) AMR 2.0 (Knight et al., 2017), which contains 36521, 1368 and 1371 samples in the training, development and test set, and (2) AMR 3.0 (Knight et al., 2020), which has 55635, 1722 and 1898 samples in the training, development and test set, as well as three outof-distribution benchmarks: (1) The Little Prince (TLP), (2) BioAMR and (3) New3.Besides, we also explore the effects of using silver training data following previous work.To obtain silver data, we sample 200k sentences from the One Billion Word Benchmark data (Chelba et al., 2014) and use a trained CHAP parser to annotate AMR graphs.

Metrics
We report the Smatch score (Cai and Knight, 2013) and other fine-grained metrics (Damonte et al., 2017) averaged over three runs with different random seeds3 .All these metrics are invariant to different graph linearizations and exhibit better performance when they are higher.Additionally, to provide a more accurate comparison, we include the standard deviation (std dev) if Smatch scores are close.
Pre-/post-processing Owing to the sparsity of wiki tags4 in the training set, we follow previous work to remove wiki tags from AMR graphs in the pre-processing, and use the BLINK entity linker (Wu et al., 2020) to add wiki tags in the postprocessing5 .In the post-processing, we also use the amrlib software6 to ensure graph validity.
Implementation details We use the BART-base model in analytical experiments and the BARTlarge model in comparison with baselines.We modify all decoder layers when using the BART-base model, while only modifying the top two layers when using the BART-large model7 .For the parallel and pipeline architectures, attention modules in adapters have four heads and a hidden size 512.For the inplace architecture, four attention heads are set to perform CHA.We reinterpret four self-attention heads of the top decoder layer as a pointer net.The weight for the pointer loss α is set to 0.075.We use a zero initialization for FFN 2 and MLP p , such that the modified models are equivalent to the original BART model at the beginning of training.More details are available at Appx.D Baselines SPRING (Bevilacqua et al., 2021) is a BART model fine-tuned with an augmented vocabulary and improved graph representations (as shown in Tab.1.b).Ancestor (Yu and Gildea, 2022) enhances the decoder of SPRING by incorporating ancestral information of graph nodes.BiBL (Cheng et al., 2022) and AMRBART (Bai et al., 2022) augment SPRING with supplementary training losses.LeakDistill (Vasylenko et al., 2023) 8 trains a SPRING using leaked information and then distills it into a standard SPRING.
All these baselines are translation-based models.Transition-based and factorization-based models are not included due to their inferior performance.

Results on Alternative Modeling Options
Structural modeling We report the results of different CHA options in Tab. 3. ⇓double exhibits a slightly better performance than ⇑ and ⇓single.Besides, we find that breaking structural localities, i.e., (1) allowing parent nodes to attend to nodes other than their immediate children (row 3, −0.13) and (2) allowing non-parent nodes to attend to nodes that have been composed (row 2, −0.07), negatively impacts the performance.We present the attention masks of these two cases in Appx.A.2.
Architecture In Tab. 4, we can see that the inplace architecture has little improvement over the baseline, w/o CHA.This suggests that changing the functions of pretrained heads can be harmful.We also observe that the parallel architecture performs slightly better than the pipeline architecture.Based on the above results, we present CHAP, which adopts the parallel adapter and uses ⇓double.

Main Results
Tab. 5 shows results on in-distribution benchmarks.
In the setting of no additional data (such that LeakDistill is excluded), CHAP outperforms all previous models by a 0.3 Smatch score on AMR 2.0 and 0.5 on AMR 3.0.Regarding fine-grained metrics, CHAP performs best on five metrics for AMR 3.0 and three for AMR 2.0.Compared to previous work, which uses alignment, CHAP matches LeakDistill on AMR 3.0 but falls behind it on AMR 2.0.One possible reason is that alignment as additional data is particularly valuable for a relatively small training set of AMR 2.0.We note that the contribution of LeakDistill is orthogonal to ours, and we can expect an enhanced performance by integrating their method with our parser.When using silver data, the performance of CHAP on AMR 2.0 can be significantly improved, achieving similar performance to LeakDistill.This result supports the above conjecture.However, on AMR 3.0, the gain from silver data is marginal as in previous work, possibly because AMR 3.0 is sufficiently large to train a model based on BART-large.
In out-of-distribution evaluation, CHAP is competitive with all baselines on both TLP and Bio, as shown in Tab.6, indicating CHAP's strong ability of generalization thanks to the explicit structure modeling.

Ablation Study
An ablation study is presented in The two rightmost columns in Tab.7 present the ablation study results on OOD benchmarks.We find that the three proposed components contribute variably across different benchmarks.Specifically, CHA consistently enhances generalization, while the other two components slightly reduce performance on TLP.

Conclusion
We have presented an AMR parser with new target forms of AMR graphs, explicit structure modeling with causal hierarchical attention, and the integration of pointer nets.We discussed and empirically compared multiple modeling options of CHA and the integration of CHA with the Transformer decoder.Eventually, CHAP outperforms all previous models on in-distribution benchmarks in the setting of no additional data, indicating the benefit of structure modeling.

Limitations
We focus exclusively on training our models to predict AMR graphs using natural language sentences.Nevertheless, various studies suggest incorporating additional training objectives and strategies, such as BiBL and LeakDistill, to enhance performance.These methods also can be applied to our model.
There exist numerous other paradigms for semantic graph parsing, including Discourse Repre-sentation Structures.In these tasks, prior research often employs the PENMAN notation as the target form.These methodologies could also potentially benefit from our innovative target form and structural modeling.Although we do not conduct experiments within these tasks, results garnered from a broader range of tasks could provide more compelling conclusions.tokens, such as :arg0.Unlike other translationbased models, we do not add predicates (e.g., have-condition-91) into the vocabulary because they have ingorable effects on performance according to our preliminary experiments.
We train our models for 50,000 steps, using a batch size of 16.This amounts to approximately 22 epochs on AMR 2.0 and around 15 epochs on AMR 3.0.We use an AdamW optimizer (Loshchilov and Hutter, 2019), accompanied by a cosine learning rate scheduler (Loshchilov and Hutter, 2017) with a warm-up phase of 5,000 steps.The peak learning rate is set at 5×10 −5 for base models and 3×10 −5 for large models.We use one NVIDIA TITAN V to train models based on BART base, costing about 6 hours, and use one NVIDIA A40 to train models based on BART large, costing about 15 hours.
Data: sequence of token t with length N , sequence of pointers s with lengthN Result: attention mask Figure 1: AMR of Employees liked their city tour.

Figure 2 :
Figure2: Demonstration of Causal Hierarchical Attention.We draw the aggregation on graphs performed at four steps in (a)-(d) and highlight the corresponding token for each step in (e) with green boxes.the The generation order is depth-first and left-to-right: alpha → beta → delta → epsilon → gamma → zeta → eta → theta.The node of interest at each step is highlighted in blue, gathering information from all solid nodes.Gray dashed nodes, on the other hand, are invisible.
An example tree.♦ denotes a non-leaf node.

Figure 3 :
Figure 3: The target forms and the attention mask of the three variants of CHA for the tree in (a).Orange cells represent the compose operation, while blue cells with plaid represent the expand operation.White cells are masked out.The vertical and horizontal axis represent attending and attended tokens, respectively.

Figure 4 :
Figure 4: Three architectures for applying CHA to pretrained decoder layers.Residual connections and layernorms are omitted.

Figure 5 :
Figure 5: The target forms and the attention mask of two variants of ⇓single.Black squares mean that the cell is changed.
pop() end M CHA [ij] ← 0 S.push(i) else S.push(i) for j ∈ S do £ expand M CHA [ij]The full tree in Fig.2without auxiliary tokens.The tree related to gamma in Fig.2in the ⇓single form.
(c) The tree related to gamma in Fig.2in the ⇑ form.

Figure 6 :
Figure 6: The structure modeled in the base layer.

Table 1 :
Graph representations.PM and S-DFS denote the PENMAN form and the SPRING DFS form, the DFS-based linearization proposed by Bevilacqua et al. (2021), respectively.(c)-(d) are our proposed representations.(

Table 2 :
Strengths, shortcomings of different types of models.

Table 3 :
The influence of different CHA.

Table 4 :
The influence of different architectures.

Table 5 :
Extra Data Smatch NoWSD Wiki.Conc.NER Neg.Unlab Reent.SRL Fine-grained Smatch scores on in-domain benchmarks.Bold and underlined numbers represent the best and the second-best results, respectively."A" in the Extra Data column denotes alignment.*Std dev is 0.04.

Table 6 :
Test results on out-of-distribution benchmarks.The scores represented in grey cells derive from a model trained on AMR 2.0, whereas the remaining scores come from a model trained on AMR 3.0.−: New3 is part of AMR 3.0, so these settings are excluded from OOD evaluation.α Std dev on TLP is 0.14.β Std dev on Bio is 0.46.

Table 7 .
The first four rows demonstrate that, for a model based on BART-base, when we exclude the encoding of pointers, the CHA adapters, and the separation