Probabilistic, Structure-Aware Algorithms for Improved Variety, Accuracy, and Coverage of AMR Alignments

We present algorithms for aligning components of Abstract Meaning Representation (AMR) graphs to spans in English sentences. We leverage unsupervised learning in combination with heuristics, taking the best of both worlds from previous AMR aligners. Our unsupervised models, however, are more sensitive to graph substructures, without requiring a separate syntactic parse. Our approach covers a wider variety of AMR substructures than previously considered, achieves higher coverage of nodes and edges, and does so with higher accuracy. We will release our LEAMR datasets and aligner for use in research on AMR parsing, generation, and evaluation.


Introduction
Research with the Abstract Meaning Representation (AMR; Banarescu et al., 2013), a broadcoverage semantic annotation framework in which sentences are paired with directed acyclic graphs, must contend with the lack of gold-standard alignments between words and semantic units in the English data. A variety of rule-based and statistical algorithms have sought to fill this void, with improvements in alignment accuracy often translating into improvements in AMR parsing accuracy (Pourdamghani et al., 2014;Naseem et al., 2019;Liu et al., 2018). Yet current alignment algorithms still suffer from limited coverage and less-than-ideal accuracy, constraining the design and accuracy of parsing algorithms. Where parsers use latent alignments (e.g., Lyu and Titov, 2018;Cai and Lam, 2020), explicit alignments can still facilitate evaluation and error analysis. Moreover, AMR-to-text generation research and applications using AMR stand to benefit from accurate, human-interpretable alignments.
We present Linguistically Enriched AMR (LEAMR) alignment, which achieves full graph cov-erage via four distinct types of aligned structures: subgraphs, relations, reentrancies, and duplicate subgraphs arising from ellipsis. This formulation lends itself to unsupervised learning of alignment models. Advantages of our algorithm and released alignments include: (1) much improved coverage over previous datasets, (2) increased variety of the substructures aligned, including alignments for all relations, and alignments for diagnosing reentrancies, (3) alignments are made between spans and connected substructures of an AMR, (4) broader identification of spans including named entities and verbal and prepositional multiword expressions.
Contributions are as follows: • A novel all-inclusive formulation of AMR alignment in terms of mappings between spans and connected subgraphs, including spans aligned to multiple subgraphs; mappings between spans and inter-subgraph edges; and characterization of reentrancies. Together these alignments fully cover the nodes and edges of the AMR graph ( §3). • An algorithm combining rules and EM to align English sentences to AMRs without supervision ( §5), achieving higher coverage and quality than existing AMR aligners ( §7). • A corpus with automatic alignments for LDC2020 and Little Prince data as well as a few hundred manually annotated sentences for tuning and evaluation ( §4). We release this dataset of alignments for over 60,000 sentences along with our aligner code to facilitate more accurate models and greater interpretability in future AMR research.

Related Work
The main difficulty presented by AMR alignment is that it is a many-to-many mapping problem, with gold alignments often mapping multiple tokens to multiple nodes while preserving AMR structure. Previous systems use various strategies for aligning. They also have differing approaches to what types of substructures of AMR are aligned-whether they are nodes, subgraphs, or relations-and what they are aligned to-whether individual tokens, token spans, or syntactic parses. Two main alignment strategies remain dominant, though they may be combined or extended in various ways: rule-based strategies as in Flanigan et al. (2014), Flanigan et al. (2016), Liu et al. (2018), and Szubert et al. (2018), and statistical strategies using Expectation-Maximization as in Pourdamghani et al. (2014). JAMR. The JAMR system (Flanigan et al., 2014(Flanigan et al., , 2016 aligns token spans to subgraphs using iterative application of an ordered list of 14 rules which include exact and fuzzy matching. JAMR alignments form a connected subgraph of the AMR by the nature of the rules being applied. A disadvantage of JAMR is that it lacks a method for resolving ambiguities, such as repeated tokens, or of learning novel alignment patterns. ISI. The ISI system (Pourdamghani et al., 2014) produces alignments between tokens and nodes and between tokens and relations via an Expectation-Maximization (EM) algorithm in the style of IBM Model 2 (Brown et al., 1988). First, the AMR is linearized; then EM is applied using a symmetrized scoring function of the form P(a | t) + P(t | a), where a is any node or edge in the linearized AMR and t is any token in the sentence. Graph connectedness is not enforced for the elements aligning to a given token. Compared to JAMR, ISI produces more novel alignment patterns, but also struggles with rare strings such as dates and names, where a rule-based approach is more appropriate. Extensions and Combinations. TAMR (Tuned Abstract Meaning Representation; Liu et al., 2018) uses the JAMR alignment rules, along with two others, to produce a set of candidate alignments for the sentence. Then, the alignments are "tuned" with a parser oracle to select the candidates that correspond to the oracle parse that is most similar to the gold AMR.
Some AMR parsers (Naseem et al., 2019;Fernandez Astudillo et al., 2020) use alignments which are a union of alignments produced by the JAMR and ISI systems. The unioned alignments achieve greater coverage, improving parser performance. Syntax-based. Several alignment systems attempt to incorporate syntax into AMR alignments.   Chen and Palmer (2017) perform unsupervised EM alignment between AMR nodes and tokens, taking advantage of a Universal Dependencies (UD) syntactic parse as well as named entity and semantic role features. Szubert et al. (2018) and Chu and Kurohashi (2016) both produce hierachical (nested) alignments between AMR and a syntactic parse. Szubert et al. use a rule-based algorithm to align AMR subgraphs with UD subtrees. Chu and Kurohashi use a supervised algorithm to align AMR subgraphs with constituency parse subtrees. Word Embeddings. Additionally, Anchiêta and Pardo (2020) use an alignment method designed to work well in low-resource settings using pretrained word embeddings for tokens and nodes. Graph Distance. Wang and Xue (2017) use an HMM-based aligner to align tokens and nodes. They include in their aligner a calculation of graph distance as a locality constraint on predicted alignments. This is similar to our use of projection distance as described in §5. Drawbacks of Current Alignments. Alignment methods vary in terms of components of the AMR that are candidates for alignment. Most systems either align nodes (e.g., ISI) or connected subgraphs (e.g., JAMR), with incomplete coverage. Most current systems do not align relations to tokens or spans, and those that do (such as ISI) do so with low coverage and performance. None of the current systems align reentrancies, although Szubert et al. (2020) developed a rule-based set of heuristics for identifying reentrancy types. Table 1 summarizes the coverage and variety of prominent alignment systems.

Subgraph Alignments Relation Alignments
(n :op1 "New" :op2 "York")), when → v :time g when → ∅, Reentrancy Alignments they → ∅, want → w :ARG0 p (PRIMARY), graduate → g v :ARG0 p (CONTROL); Duplicate Subgraphs they → g :ARG0 p (COREF) students → (p2 :ARG0-of s2) Figure 1: AMR and alignments for the sentence "Most of the students want to visit New York when they graduate." Alignments are differentiated by colors: blue (subgraphs), green (duplicate subgraphs), and orange (relations). Relations that also participate in reentrancy alignments are bolded. tation conventions can be opaque with respect to the words or surface structure of the sentence, e.g., by unifying coreferent mentions and making explicit certain elided or pragmatically inferable concepts and relations. Previous efforts toward general tools for AMR alignment have considered mapping tokens, spans, or syntactic units to nodes, edges, or subgraphs ( §2). Other approaches to AMR alignment have targeted specific compositional formalisms (Groschwitz et al., 2018;Beschke, 2019;Blodgett and Schneider, 2019).
We advocate here for a definition of alignment that is principled-achieving full coverage of the graph structure-while being framework-neutral and easy-to-understand, by aligning graph substructures to shallow token spans on the form side, rather than using syntactic parses. We do use structural considerations to constrain alignments on the meaning side, but by using spans on the form side, we ensure the definition of the alignment search space is not at the mercy of error-prone parsers.
Definitions. Given a tokenized sentence w and its corresponding AMR graph G, a complete alignment assumes a segmentation of w into spans s, each containing one or more contiguous tokens; and puts each of the nodes and edges of G in correspondence with some span in s. A span may be aligned to one or more parts of the AMR, or else is null-aligned. Individual alignments for a sentence are grouped into four layers: subgraph alignments, duplicate subgraph alignments, relation alignments, and reentrancy alignments. These are given for an example in figure 1.
All alignments are between a single span and a substructure of the AMR. A span may be aligned in multiple layers which are designed to capture different information. Within the subgraph layer, alignments are mutually exclusive with respect to both spans and AMR components. The same holds true within the relation layer. Every node will be aligned exactly once between the subgraph and duplicate subgraph layers. Every edge will be aligned exactly once between the subgraph and relation layers, and may additionally have a secondary alignment in the reentrancy layer.

Subgraph Layer
Alignments in this layer generally reflect the lexical semantic content of words in terms of connected, 1 directed acyclic subgraphs of the corresponding AMR. Alignments are mutually exclusive (disjoint) on both the form and meaning sides.

Duplicate Subgraph Layer
A span may be aligned to multiple subgraphs if one is a duplicate of the others, with a matching concept. This is often necessary when dealing with ellipsis constructions, where there is more semantic content in the AMR than is pronounced in the sentence and thus several identical parts of the AMR must be aligned to the same span. In this case, a single subgraph is chosen as the primary alignment (whichever is first based on depth-first order) and is aligned in the subgraph alignment layer, and any others are represented in the duplicates alignment 1 Nodes aligned to a span must form a connected subgraph with two exceptions: (1) duplicate alignments are allowed and are separated into subgraph and duplicate layers; (2) a span may be aligned to two terminal nodes that have the same parent. For example, never aligns to :polarity -:time ever, two nodes and two edges which share the same parent. layer. For example, verb phrase ellipsis, as in I swim and so do you, would involve duplication of the predicate swim, with distinct ARG0s. Similarly, in figure 1, Most of the students involves a subsetsuperset structure where the subset and superset correspond to separate nodes. Because student is represented in AMR like person who studies, there are two 2-node subgraphs aligned to student, one with the variables p and s, and the duplicate with p2 and s2. The difficulty that duplicate subgraphs pose for parsing and generation makes it convenient to put these alignments in a separate layer.

Relation Layer
This layer includes alignments between a span and a single relation-such as when → :timeand alignments mapping a span to its argument structure-such as give → :ARG0 :ARG1 :ARG2. All edges in an AMR that are not contained in a subgraph fit into one of these two categories.
English function words such as prepositions and subordinators typically function as connectives between two semantically related words or phrases, and can often be identified with the semantics of AMR relations. But many of these function words are highly ambiguous. Relation alignments make their contribution explicit. For example, when in figure 1 aligns to a :time relation.
For spans that are aligned to a subgraph, incoming or outgoing edges attached to that subgraph may also be aligned to the span in the relation layer. These can include core or non-core roles as long as they are evoked by the token span. For example, figure 1 contains visit → :ARG0 :ARG1.

Reentrancy Layer
A reentrant node is one with multiple incoming edges. In figure 1, for example, p appears three times: once as the ARG0 of w (the wanter), once as the ARG0 of v (the visitor), and once as the ARG0 of g (the graduate). The p node is labeled with the concept person-in the PENMAN notation used by annotators, each variable's concept is only designated on one occurrence of the variable, the choice of occurrence being, in principle, arbitrary. These three ARG0 relations are aligned to their respective predicates in the relation layer. But there are many different causes of reentrancy, and AMR parsers stand to benefit from additional information about the nature of each reentrant edge, such as the fact that the pronoun they is associated with one of the ARG0 relations.
The reentrancy layer "explains" the cause of each reentrancy as follows: for the incoming edges of a reentrant node, one of these edges is designated as PRIMARY-this is usually the first mention of the entity in a local surface syntactic attachment, e.g. the argument of a control predicate like want doubles as an argument of an embedded clause predicate. The remaining incoming edges to a reentrant node are aligned to a reentrancy trigger and labeled with one of 8 reentrancy types: coref, repetition, coordination, control, adjunct control, unmarked adjunct control, comparative control, and pragmatic. These are illustrated in table 2. These types, adapted from Szubert et al.'s (2020) classification, correspond to different linguistic phenomena leading to AMR reentrancies-anaphoric and non-anaphoric coreference, coordination, control, etc. The trigger is the word that most directly signals the reentrancy phenomenon in question. For the example in figure 1, the control verb want is aligned to the embedded predicate-argument relation and typed as CONTROL, while the pronoun they serves as the trigger for the third instance of p in when they graduate.

Validation
To validate the annotation scheme we elicited two gold-standard annotations for 40 of the test sentences described in §4 and measured interannotator agreement. 2 Interannotator exact-match F1 scores were 94.54 for subgraphs, 90.73 for relations, 76.92 for reentrancies, and 66.67 for duplicate subgraphs (details in appendix A).

Released Data
We release a dataset 3 of the four alignment layers reflecting correpondences between English text and various linguistic phenomena in gold AMR graphs-subgraphs, relations (including argument structures), reentrancies (including coreference, control, etc.), and duplicate subgraphs.
Automatic alignments cover the ≈60,000 sentences of the LDC2020T02 dataset (Knight et al., 2020) and ≈1,500 sentences of The Little Prince.
We manually created gold alignments for evaluating our automatic aligner, split into a development set (150 sentences) and a test set (200 sen-  Figure 2: AMR for the sentence "The house 1 on the left is bigger than the house 2 on the right." tences). 4 The test sentences were annotated from scratch; the development sentences were first automatically aligned and then hand-corrected. We stress that no preprocessing apart from tokenization is required to prepare the test sentences and AMRs for human annotation. We also release our annotation guidelines as a part of our data release.

LEAMR Aligner
We formulate statistical models for the alignment layers described above-subgraphs, duplicate subgraphs, relations, and reentrancies-and use the Expectation-Maximization (EM) algorithm to estimate probability distributions without supervision, with a decoding procedure that constrains aligned units to obey structural requirements. In line with Flanigan et al. (2014Flanigan et al. ( , 2016, we use rulebased preprocessing to align some substructures using string-matching, morphological features, etc. Before delving into the models and algorithm, we motivate two important characteristics: Structure-Preserving. Constraints on legal candidates during alignment ensure that at any point 4 Our test set consists of sentences from the test set of Szubert et al. (2018) but with AMRs updated to the latest release version. This test set contains a mix of English sentences drawn from the LDC data and The Little Prince-some sampled randomly, others hand-selected-as well as several sentences constructed to illustrate particular phenomena. only connected substructures may be aligned to a span. Thus, while our aligner is probabilistic like the ISI aligner, it has the advantage of preserving the AMR graph structure. Projection Distance. The scores calculated for an alignment take into account a distance metric designed to encourage locality-tokens that are close together in a sentence are aligned to subtructures that are close together in the AMR graph. We define the projection distance dist(n1,n2) between two neighboring nodes n1 and n2 to be the signed distance in the corresponding sentence between the span aligned to n1 and the span aligned to n2. This motivates the model to prefer alignments whose spans are close together when aligning nodes which are close together-particularly useful when a word occurs twice with identical subgraphs. Thus, our aligner relies on more information from the AMR graph structure than other aligners (note that the ISI system linearizes the graph). Further details are given in §5.2.

Overview
Algorithm 1 illustrates our base algorithm in pseudocode. The likelihood for a sentence can be expressed as a sum of per-span alignment scores: we write the score of a full set of a sentence's subgraph alignments A as where s are N aligned spans in the sentence w, and g are sets of subgraphs of the AMR graph G aligned to each span. For relations model and the reentrancies model, each g i consists of relations rather than subgraphs. Henceforth we assume all alignment scores are conditioned on the sentence and graph and omit w and G for brevity. The score(⋅) component of eq. (1) is calculated differently for each of the three models detailed below.
Alignment Pipeline. Alignment proceeds in the following phases, with each phase depending on the output of the previous phase: 1. Preprocessing: Using external tools we extract lemmas, parts of speech, and coreference. 2. Span Segmentation: Tokens are grouped into spans using a rule-based procedure (appendix B). 3. Align Subgraphs & Duplicate Subgraphs: We greedily identify subgraph and duplicate subgraph alignments in the same alignment phase ( §5.2). 4. Align Relations: Relations not belonging to a subgraph are greedily aligned in this phase, using POS criteria to identify legal candidates ( §5.3). 5. Align Reentrancies: Reentrancies are aligned in this phase, using POS and coreference in criteria for identifying legal candidates ( §5.4).
The three main alignment phases use different models with different parameters; they also have their own preprocessing rules used to identify some alignments heuristically (appendices C to E). 5 In training, parameters for each phase are iteratively learned and used to align the entire training set by running EM to convergence before moving on to the next phase. At test time, the pipeline can be run sentence-by-sentence.
Decoding. The three main alignment phases all use essentially the same greedy, substructure-aware search procedure. This searches over node-span candidate pairs based on the scoring function modeling the compatibility between a subgraph (or relation) g and span s, which we denote score(⟨g,s⟩). For each unaligned node (or edge), we identify a set of legal candidate alignments using phase-specific criteria. The incremental score improvement of adding each candidate-either extending a subgraph/set of relations already aligned to the span, or adding a completely new alignment-is calculated as as ∆score = score(⟨g 0 ∪ {n},s⟩) − score(⟨g 0 ,s⟩), where g 0 is the current aligned subgraph, s is the span, and n is an AMR component being considered. Of the candidates for all unaligned nodes, the node-span pair giving the best score improvement is then greedily selected to add to the alignment. This is repeated until all nodes have been aligned (even if the last ones decrease the score). The procedure is detailed in algorithm 1 for subgraphs; the relations phase and the reentrancies phase use different candidates (respectively: unaligned edges; reentrant edges), different criteria for legal candidates, and different scoring functions.

Aligning Subgraphs
The score assigned to an alignment between a span and subgraph is calculated as score(⟨g,s⟩) = where g is a subgraph, s is a span, d i is the projection distance of g with its ith neighboring node, and θ 1 and θ 2 are model parameters which are updated after each iteration. The subgraph g is represented in the model as a bag of concept labels and (parent concept, relation, child concept) triples.
The distributions P align and P dist are inspired by IBM Model 2 (Brown et al., 1988), and can be thought of as graph-theoretic extensions of translation (align) and alignment (dist) probabilities. IB stands for inductive bias, explained below. Legal Candidates. For each unaligned node n, the model calculates a score for spans of three possible categories: 1) unaligned spans; 2) spans aligned to a neighboring node (in this case, the aligner considers adding n to an existing subgraph if the resulting subgraph would be connected); 3) spans aligned to a node with the same concept as n (this allows the aligner to identify duplicate subgraphscandidates in this category receive a score penalty because duplicates are quite rare, so they are generally the option of last resort).
Limiting the candidate spans in this way ensures only connected, plausible substructures of the AMR are aligned. To form a multinode subgraph alignment t 1 → n1 :rel n2, the aligner could first align n1 to an unaligned span t 1 , then add n2, which is a legal candidate because t 1 is aligned to a neighboring node of n2 (ensuring a connected subgraph). Distance. We model the probability of the projection distance P dist (d;θ 2 ) using a Skellam distribution, which is the difference of two Poisson distributed random variables D = N 1 −N 2 and can be positive or negative valued. Parameters are updated based on alignments in the previous iteration. For each aligned neighbor n i of a subgraph g, we calculate P dist (dist(g,n i );θ 2 ) and take the geometric mean of probabilities as P dist .
Algorithm 1 Procedure for greedily aligning all nodes to spans using a scoring function that decomposes over (span, subgraph) pairs. (Scores are expressed in real space but the implementation is in log space.) for n ∈ unaligned_nodes do 8: candidate_spans ← get_legal_alignments(n, alignments) 9: for span, i_subgraph ∈ candidate_spans do ▷either there is an edge between n and the indicated subgraph already aligned to span, or i_subgraph would be a new subgraph consisting of n 10: current_aligned_nodes ← alignments[span] [i_subgraph] ▷∅ if this would be a new subgraph 11: new_aligned_nodes ← current_aligned_nodes ∪ {n} 12: ∆score ← get_score(span, new_aligned_nodes, alignments) 13: − get_score(span, current_aligned_nodes, alignments) ▷change from adding n into a subgraph aligned to span; get_score queries score(⟨g, s⟩) and multiplies λ dup if i_subgraph > 1 14: ∆scores.add ( Null alignment. The aligner models the possibility of a span being unaligned using a fixed heuristic: where rank assigns 1 to the most frequent word, 2 to the 2nd most frequent, etc. Thus, the model expects that very common words are more likely to be null-aligned and rare words should almost always be aligned. 6 Factorized Backoff. So that the aligner generalizes to unseen subgraph-span pairs, where P align (g | s) = 0, we use a backoff factorization into components of the subgraph. In particular, the factors are empirical probabilities of (i) an AMR concept given a span string in the sentence, and (ii) a relation and child node concept given the parent node concept and span string. These cooccurrence probabilities p are estimated directly from the training sentence/AMR pairs (irrespective of latent alignments). The product is scaled by a factor λ . E.g., for a subgraph n1 :rel1 n2 :rel2 n3, where c n is the concept of node n, we have Inductive bias. Lastly, to encourage good initialization, the score function includes an inductive 6 We allow several exceptions. For punctuation, words in parentheses, and spans that are coreferent to another span, the probability is 0.5. For repeated spans, the probability is 0.1. bias which does not depend on EM-trained parameters. This inductive bias is based on the empirical probability of a node occurring in the same AMR with a span in the training data. We calculate inductive bias as an average of exponentiated PMIs 1 N ∑ i exp(PMI(n i ,s)), where N is the number of nodes in g, n i is the ith node contained in the subgraph, and PMI is the PMI of n i and s. Aligning Duplicate Subgraphs. On rare occasion a span should be aligned to multiple subgraphs ( §3.2). To encourage the model to align a different span where possible, there is a constant penalty λ dup for each additional subgraph aligned to a span beyond the first. Thus the score for a span and its subgraphs is computed as:

Aligning Relations
For a given relation alignment between a span and a collection of edges, we calculate a score as follows: score(⟨a,s⟩) = P align (a | s;θ 3 ) ⋅ ∏ where a is the argument structure (the collection of aligned edges), s is a span, D 1 is the projection distances of each edge and its parent, and D 2 is  Table 3: Main results on the test set. N represents the denominator of exact alignment recall. There are 2860 gold spans in total, 41% of which are null-aligned and 0.6% of which are aligned to multiple subgraphs. 95% of the spans consist of a single token, and 49% of spans are aligned to a single subgraph consisting of a single node.
the projection distances of each edge and its child. The collection of edges a is given a normalized label which represents the relations contained in the alignment (distinguishing incoming versus outgoing relations, and normalizing inverse edges).
Legal Candidates. There are two kinds of candidate spans for relation alignment. First, previously unaligned spans 7 (with no relation or subgraph alignments), e.g. prepositions and subordinating conjunctions such as in → :location or when → :time. Second, any spans aligned to the relation's parent or child in the subgraph layer: this facilitates alignment of argument structures such as give → :ARG0 :ARG1 :ARG2. Additionally, we constrain certain types of edges to only align with the parent and others to only align with the child.
Distance. For relations there are potentially two distances of interest-the projected distance of the relation from its parent and the projected distance of the relation from its child. We model these separately as parent distance and child distance with distinct parameters. To see why this is useful, consider the sentence "Should we meet at the restaurant or at the office?", where each at token should be aligned to a :location edge. In English, prepositions like at precede an object and follow a governor. Thus parent distance tends to be to the left (negative valued) while child distance tends to be to the right (positive valued). 7 We constrain these to particular parts of speech: prepositions (IN), infinitival to (TO), possessives (POS), and possessive pronouns (PRP$). Additionally, only spans that are between the spans aligned to the parent and any descendent of child nodes of the relation (and are not between the child's aligned span and any of its descendants' spans) are allowed. This works well in practice for English.
Legal Candidates. There are 8 reentrancy types ( §3.4). For each type, a rule-based test determines if a span and edge are permitted to be aligned. The 8 tests use part of speech, the structure of the AMR, and subgraph and relation alignments. A span may be aligned (rarely) to multiple reentrancies, but these alignments are scored separately.

Experimental Setup
Sentences are preprocessed with the Stanza library (Qi et al., 2020) to obtain lemmas, part-of-speech tags, and named entities. We identify token spans using a combination of named entities and a fixed list of multiword expressions (details are given in appendix B). Coreference information, which is used to identify legal candidates in the reentrancy alignment phase, is obtained using  Lemmas are used in each alignment phase to normalize representation of spans, while parts of speech and coreference are used to restrict legal candidates in the relation and reentrancy alignment phases. We tune hyperparameters, including penalties for duplicate alignments and our factorized backoff probability, on the development set.

Results
Table 3 describes our main results on the 200sentence test set ( §4), reporting exact-match and partial-match alignment scores as well as span identification F1 and coverage. 9 The partial alignment evaluation metric is designed to be more forgiving of arbitrary or slight differences between alignment systems. We argue that this metric is more comparable across alignment systems. It assigns partial credit equal to the product of Jaccard indices |T 1 ∪T 2 | for nodes (or edges) and tokens respectively. This partial credit is calculated for each gold alignment and the closest matching predicted alignment with nodes (or edges) N 1 and N 2 and tokens T 1 and T 2 . Coverage is the percentage of relevant AMR components that are aligned.
Our aligner shows improvements over previous aligners in terms of coverage and accuracy even when using a partial credit metric for evaluation. We demonstrate greater coverage, including coverage of phenomena not aligned by previous systems. Table 4 shows detailed results for relation subtypes and reentrancy subtypes. Here, we see room for improvement. In particular, ISI outperforms our system at aligning single relations. Our reentrancy aligner lacks a baseline to compare to, but the breakdown of results by type suggest there are several categories of reentrancies where scores could be improved. Qualitative Analysis. A number of errors from our subgraph aligner resulted from unseen mul-9 A previous draft of this work reported lower scores on relations before a constraint was added to improve the legal candidates for relation alignment.  Table 5: Results when the aligner is trained without projection distance probabilities (−distance) and without the subgraph inductive bias (−inductive bias), as well as a relation aligner with access to gold (instead of trained) subgraphs.
tiword expressions in our test data that our span preprocessing failed to recognize and our aligner failed to align. For example, the expression "on the one hand" appears in test and should be aligned to contrast-01. The JAMR aligner suffers without a locality bias; we notice several cases where it misaligns words that are repeated in the sentence. The ISI aligner generally does not align very frequent nodes such as person, thing, country, or name, resulting in generally lower coverage. It also frequently aligns disconnected nodes with the same concept to one token instead of separate tokens. While our relation aligner yields significantly higher coverage, we do observe that the model is overeager to align relations to extremely frequent prepositions (such as to and of ), resulting in lower precision of single relations in particular.
Ablations. Table 5 shows that projection distance is valuable, adding 1.20 points (exact align F1) for subgraph alignment and 0.57 points for relation alignment. Despite showing anecdotal benefits in early experiments, the inductive bias does not aid the model in a statistically significant way. Using gold subgraphs for relation alignment produces an improvement of over 5 points, indicating the scope of error propagation for the relation aligner.

Conclusions
We demonstrate structure-aware AMR aligners that combine the best parts of rule-based and statistical methods for AMR alignment. We improve on previous systems in terms of accuracy and particularly in terms of alignment coverage and variety of AMR components to be aligned. Table 6 illustrates interannotator agreement for each of the four alignment layers.

B Identifying Spans
As a preprocessing step, sentences have their tokens grouped into spans based on three criteria, outlined in detail below: 1. Named entity spans identified by Stanza. 2. Spans matching multiword expressions from a fixed list of ≈1600 (a) 143 prepositional MWEs from STREUSLE (Schneider and Smith, 2015; (b) 348 verbal MWEs from STREUSLE (c) 1095 MWEs taken from gold AMRs in LDC train data (any concept which is a hyphenated compound of multiple words, e.g., alma-mater or whitecollar) and are not present in the above lists.
(d) ≈12 hand-added MWEs 3. Any sequence of tokens which is an exact match to a name in the gold AMR (e.g., "United Kingdom" and (n/name :op1 "United" :op2 "Kingdom")) is also treated as a span.

C.1 Token matching
We use three phases of rule-based alignment which attempt to align particular spans to particular AMR subgraphs: 1. Exact token matching: If there is a unique full string correspondence between a span and a name or number in the AMR, they are aligned. 2. Exact lemma matching: If there is a unique correspondence between an AMR concept and the lemma of a span (which in the case of a multiword span is the sequence of lemmas of the tokens joined by hyphens), they are aligned. 3. Prefix token matching: A span with a prefix match of length 6, 5, or 4 is aligned if it uniquely corresponds to an AMR named entity. 4. Prefix lemma matching: A span with a prefix match of length 6, 5, or 4 of its lemma is aligned if it uniquely corresponds to an concept. 5. English rules: Several hand-written rules for matching English strings to specific subgraphs are used to match constructions such as dates, currency, and some frequent AMR concepts with many different ways of being expressed, such as and and -.

D.1 Token matching
Some relations take the form :prep-X or :conj-X where X is a preposition or conjunction in the sentence. We use exact match to align these relations as a preprocessing step. The relations :poss and :part may be automatically aligned to 's or of if the correspondence is unique within a sentence.

E Rule-based Reentrancy Alignment Preprocessing
Primary edges are identified as a preprocessing step before aligning reentrancies with the following rules: Any relation which is aligned to the same span as its token (any incoming edge which is a part of a span's argument structure) is automatically made the primary edge. Otherwise, for each edge pointing to a node, we identify the spans aligned to the parent and child nodes in the subgraph layer. Whichever edge has the shortest distance between the span aligned to the parent and the span aligned to the child is identified as the primary edge. In the event of a tie, the edge whose parent is aligned to the leftmost span is identified as the primary edge. Primary reentrancy edges are always aligned to the same span the edge is aligned to in the relation layer of alignments.