A Differentiable Relaxation of Graph Segmentation and Alignment for AMR Parsing

Abstract Meaning Representations (AMR) are a broad-coverage semantic formalism which represents sentence meaning as a directed acyclic graph. To train most AMR parsers, one needs to segment the graph into subgraphs and align each such subgraph to a word in a sentence; this is normally done at preprocessing, relying on hand-crafted rules. In contrast, we treat both alignment and segmentation as latent variables in our model and induce them as part of end-to-end training. As marginalizing over the structured latent variables is infeasible, we use the variational autoencoding framework. To ensure end-to-end differentiable optimization, we introduce a differentiable relaxation of the segmentation and alignment problems. We observe that inducing segmentation yields substantial gains over using a ‘greedy’ segmentation heuristic. The performance of our method also approaches that of a model that relies on the segmentation rules of Lyu and Titov (2018), which were hand-crafted to handle individual AMR constructions.

An AMR graph can be regarded as consisting of multiple concept subgraphs, which can be individually aligned to sentence tokens (Flanigan et al., 2014). In Figure 1, each dashed box represents the boundary of a single semantic subgraph. Red arrows represent the alignment between subgraphs and tokens. For example, '(o / opine-01: ARG1 (t / thing))' refers to a combination of the predicate 'opine-01' and a filler of its semantic role ARG1. Intuitively, this subgraph needs to be aligned to the token 'opinion'. Similarly, '(b / boy)' should be aligned to the token 'boy'. Given such an alignment and segmentation, it is straightforward to construct a simple parser: parsing can be framed as tagging input tokens with subgraphs (including empty subgraphs), followed by predicting relations between the subgraphs. The key obstacle to training such an AMR parser is that the segmentation and alignment between AMR subgraphs and words are latent, i.e. not annotated in the data.
Most previous work adopts a pipeline approach to handling this obstacle. They rely on a prelearned aligner (e.g., (Pourdamghani et al., 2014)) to produce the alignment, and apply a rule system to segment the AMR subgraph (Flanigan et al., 2014;Werling et al., 2015;Damonte et al., 2017;Ballesteros and Al-Onaizan, 2017;Peng et al., 2015;Artzi et al., 2015;Groschwitz et al., 2018). While Lyu and Titov (2018) jointly optimize the parser and the alignment model, the rules handling specific constructions still needed to be crafted to segment the graph. The segmentation rules are relatively complex -e.g., the rules of Lyu and Titov (2018) targeted 40 different AMR subgraph typesand language-dependent. AMR has never been intended to be used as an interlingua (Banarescu et al., 2013;Damonte and Cohen, 2018) and AMR banks for individual languages substantially diverge from English AMR. For example, Spanish AMR represents pronouns and ellipsis differently from the English one (Migueles-Abraira et al., 2018). As new AMR sembanks in languages other than English are being developed (Anchiêta and Pardo, 2018;Song et al., 2020), domain-specific AMR extensions get developed (Bonn et al., 2020;Bonial et al., 2020), and extra constructions are getting introduced to AMRs (Bonial et al., 2018), eliminating the need for rules while learning graph segmentation from scratch is becoming an important problem to solve.
We propose to optimize a graph-based parser that treats the alignment and graph segmentation as latent variables. The graph-based parser consists of two parts: concept identification and relation identification. The concept identification model generates the AMR nodes, and the relation identification component decides on the labeled edges. During training, both components rely on latent alignment and segmentation, which are being induced simultaneously. Importantly, at test time, the parser simply tags the input with the subgraphs and predicts the relations, so there is no test-time overhead from using the latent-structure apparatus. An extra benefit of this approach, in contrast to encoder-decoder AMR models (Konstas et al., 2017;van Noord and Bos, 2017;Cai and Lam, 2020) is its transparency, as one can readily see which input token triggers each subgraph. 1 To develop our parser, we frame the alignment and segmentation problems as choosing a generation order of concept nodes, as we explain in Section 2.2. As marginalization over the latent generation orders is infeasible, we adopt the variational auto-encoder (VAE) framework (Kingma and Welling, 2014). Intuitively, a trainable neural module (an encoder in the VAE) is used to sample a plausible generation order (i.e., a segmentation plus an alignment), which is then used to train the parser (a decoder in the VAE). As one cannot 'differentiate through' a sample of discrete variables to train the encoder, we introduce a differentiable relaxation which makes our objective end-to-end differentiable.
We experiment on the AMR 2.0 and 3.0 datasets. We compare to a greedy segmentation heuristic, inspired by Naseem et al. (2019), that produces a seg-mentation deterministically and provides a strong baseline to our segmentation induction method. We also use a version of our model with segmentation induction replaced by a hand-crafted rule-based segmentation system from previous work; 2 it can be thought of as an upper bound on how well induction can work. On AMR 2.0 (LDC2016E25), we found that our VAE system obtained a competitive Smatch score of 76.1, reducing the gap between using the segmentation heristic (75.2) and the rules exploiting the prior knowledge about AMR (76.8). On AMR 3.0 (LDC2020T02), the VAE system gets even closer to the rule-based system (75.5 vs 75.7), possibly because the rules were designed for AMR 2.0. Our main contributions are: • we frame the alignment and segmentation problems as inducing a generation order, and provide a continuous relaxation to this discrete optimization problem; • we empirically show that our method outperforms a strong heuristic baseline and approaches the performance of a complex handcrafted rule system.
Our method makes very few assumptions about the nature of the graphs, so it may be effective in other tasks that can be framed as graph prediction (e.g., executable semantic parsing, Liang 2016, or scene graph prediction, Xu et al. 2017).
2 Casting Alignment and Segmentation as Choosing a Generation Order

Preliminaries
We start by introducing the basic concepts and notation. We refer to words in a sentence as x = (x 0 , . . . , x n−1 ), where n is the sentence length. The concepts (i.e. labeled nodes) are where m is the number of concepts. In particular, v m = ∅ denotes a dummy terminal node; its purpose will be clear in Section 2.2 where we will define the generative model. We refer to all nodes, except for the terminal node (∅), as concept nodes. A relation between 'predicate concept' i and 'argument concept' j is denoted by E ij . It is set to ∅ if j is not an argument of i. We will use E to denote all edges (i.e. relations) in the graph. In addition, we refer to the whole AMR graph as Our goal is to associate each input token with a (potentially empty) subset of the concept nodes in the AMR graph, while making sure that we get a partition of the node set. In other words, each node in the original AMR graph belongs to exactly one subset. In that way, we deal with both segmentation and alignment. Each subset uniquely corresponds to a vertex-induced subgraph (i.e., the subset of nodes together with any edges whose both endpoints are in this subset). For this reason, we will refer to the problem as graph decomposition 3 and to each subset as a subgraph. We will explain how we deal with edges of the AMR graph in Section 3.2.

Generation Order
We choose a subset of nodes for each token by assigning an order to which the nodes are selected for such subset. In Figure 2, dashed red arrows point from every node to the subsequent node to be selected. For example, given the word 'opinion', the node 'opine-01' is chosen first, and then it is followed by another node 'thing'. After this node, we have an arrow pointing to the node ∅, signifying 3 We slightly abuse the terminology as, in graph theory, graph decomposition usually refers to a partition of edgesrather than nodes -of the original graph. that we finished generating nodes aligned to the word 'opinion'. We refer to these red arrows as a generation order.
A generation order determines a graph decomposition. To recover it from a generation order, we assign connected nodes (excluding the terminal node) to the same subgraph. Then, a subgraph will be aligned to the token that generated those nodes. In our example, 'opine-01' and 'thing' are connected, and, thus, they are both aligned to the word 'opinion'. The alignment is encoded by arrows between tokens and concept nodes, while the segmentation is represented by arrows between concept nodes.
From a modeling perspective, the nodes will be generated with an autoregressive model, which is easy to use at test time ( Figure 3). From each token, a chain of nodes is generated until the stop symbol ∅ is predicted. It is more challenging to see how to induce the order and train the autoregressive model at the same time; we will discuss this in Sections 3 and 4. Constraints While in Figure 2 the red arrows determine a valid generation order, in general, the arrows have to obey certain constraints. Formally, we denote alignment by A ∈ {0, 1} n×(m+1) , where A ki = 1 means that for token k we start by generating node i. As the token can only point to one node, we have a constraint i A ki = 1. Similarly, for a segmentation S ∈ {0, 1} m×(m+1) we have a constraint j S ij = 1. Setting S ij = 1 indicates that node i is followed by node j. In Figure 2, we have A 03 = A 10 = A 23 = A 33 = A 42 = 1 and S 01 = S 13 = S 23 = 1; the rest is 0. Now, we have the full generation order as their concatenation O = [A; S] ∈ {0, 1} (n+m)×(m+1) . As one node can only be generated once (except for ∅), we have a joint constraint: ∀j = m, l O lj = 1. Furthermore, the graph defined by O should be acyclic, as it represents the generative process. We denote the set of all valid generation orders as O. In the following sections, we will discuss how this generation order is used in the model and how to infer it as a latent variable while enforcing the above constraints.

Our Model
Formally, we aim at estimating P θ (v, E|x), the likelihood of an AMR graph given the sentence. Our graph-based parser is composed of two parts: concept identification P θ (v|x, O) and relation identification P θ (E|x, O, v). The concept iden- tification model generates concept nodes, and the relation identification model assigns relations between them. Both require the latent generation order at the training time, denoted by O. Overall, we have the following objective: where P θ (O) is a prior on the generation orders, discussed in Section 4.2. To efficiently optimize this objective end-to-end, as will be discussed in Section 4, we need to ensure that both concept and relation identification models admit relaxation, i.e., they should be well-defined for real-valued O.
In the following subsections, we go through concept identification, relation identification, and their corresponding relaxations.

Concept Identification
As shown in Figure 4, our neural model first encodes the sentence with BiLSTM, producing token representations h token k (k ∈ [0, . . . n−1]), then generates nodes autoregressively at each token with another LSTM.
In training, we need to be able to run the models with any potential generation order and compute P θ (v|x, O). If we take the order defined in Figure 2, the node 1 ('thing') is predicted relying on the corresponding hidden representation; we refer to this representation as h node 1 where is 1 is the node index. With the discrete generation order defined by red arrows in Figure 2, h node 1 is just the LSTM state of its parent (i.e. 'opine-01'). However, to admit relaxations, our computation should be well-defined when the generation order O is soft (i.e. attention-like). In that case, h node 1 will be a weighted sum of LSTM representations of other nodes and input tokens, where the weights are defined by O. Similarly, the termination symbol ∅ for the token 'opinion' is predicted from its hidden representation; we refer to this representation as h tail 1 , where 1 is the position of 'opine' in the sentence. With the hard generation order of Figure 2, h tail 1 is just the LSTM state computed after choosing the preceding node (i.e. 'thing'). In the relaxed case, it will again be a weighted sum with the weights defined by O.
Formally, the probability of concept identification step can be decomposed into probability of generating m concepts nodes and n terminal nodes (one for each token): Representation h node i is computed as the weighted sum of the LSTM states of preceding nodes as defined by O (recall that O = [A; S]): (4) Note that the preceding node can be either a concept node (then the output of the LSTM, consuming the preceding node, is used) or a word (then we use its contextualized encoding). The first term in Equation 4 corresponds to the former situation, and the second one to the latter. Note that this expression is 'recursive' -each node's representation h node i is computed based on representations of all the nodes h node j ; i, j ∈ 1, . . . m − 1. Iterating the assignment defined by Equation 4 for a valid discrete generation order (i.e., a DAG, like the one given in Figure 2), will converge to a stationary point. Crucially, in this discrete case, the stationary point will be equal to the result of applying the autoregressive model (as used in test time, see Figure 4). The stationary point will be reached after T steps, where T is the number of nodes in the largest subgraph. 4 This 'message passing' process is fully differentiable and, importantly, well-defined for a relaxed generation order where A ki and S ji are non-binary. The equivalence between the train-time message passing and the test-time autoregressive computation with discrete O prevents the gap between training and testing, as long as the optimization converges to a near-discrete solution.
The representations h tail k , needed for the terms P θ (∅|h tail k ) in Equation 3, are computed as: where B jk = 1 denotes that the concept node j is the last concept node before generating ∅ for the token k, else B jk = 0. E.g. in Figure 2, we have B 11 = B 42 = 1, and others are 0. Again, in the discrete case, the result will be exactly equivalent to what is obtained by running the corresponding autoregressive model (as in test time, Figure 4), but the computation is also well-defined and differentiable in the relaxed cases, where B jk are realvalued.
While it is clear how S ji and A ki in Equation 4 and B jk in Equation 5 are defined with discrete O = [A; S], we show how they can defined with relaxed (non-binary) O in Appendix B. The MLPs used to compute P θ (v i |h node i ) and P θ (∅|h tail i ) are also defined there.

Relation Identification
Similarly to Lyu and Titov (2018), we use an arcfactored model for relation identification (i.e. predicting AMR edges): where • denotes concatenation, h node i is defined in section 3.1, and A ∞ ki determines whether node i is in a subgraph aligned to token k or not. Note that this is different from A ki which encodes that the node i is the first node in the subgraph (e.g., in Figure 2, A 11 = 0 but A ∞ 11 = 1). In the continuous case, as used during training, A ∞ ki can be thought of as the alignment probability that can be computed from O (see Appendix C).

Estimating Latent Generation Order
We show how to estimate the latent generation order jointly with the parser, as also illustrated in Figure 5.

Variational Inference
In Equation 2, marginalization over O is intractable due to the use of neural parameterization in P θ (v|x, O) and P θ (E|x, O, v). Instead, we resort to the variational auto-encoder (VAE) framework (Kingma and Welling, 2014). VAEs optimize a lower bound on the marginal likelihood: log where KL is the KL divergence, and Q φ (O|G, x) (the encoder, aka the inference network) is a distribution parameterized with a neural network. The lower bound is maximized with respect to both the original parameters θ and the variational parameters φ. The distribution Q φ (O|G, x) can be thought of as an approximation to the intractable posterior distribution P θ (O|G, x).

Stochastic Softmax
In order to estimate the gradient with respect to the encoder parameters φ, we use the perturb-and-MAP framework (Papandreou and Yuille, 2011; Hazan and Jaakkola, 2012), specifically the stochastic softmax (Paulus et al., 2020), which is a generalization the Gumbel-softmax trick (Jang et al., 2016;Maddison et al., 2017) to the structured case.
With Stochastic Softmax, instead of sampling O directly, we independently compute logits W ∈ R (n+m)×(m+1) for all the potential edges in the generation order, and perturb them: where F φ is a neural module computing the logits (see Section 4.2.2), G(0, 1) is the standard Gumbel distribution, and ∈ R (n+m)×(m+1) . Then, those perturbed logits W are fed into a constrained convex optimization problem: This is a linear programming (LP) relaxation of constraints discussed in Section 2.2, where we permit continuous-valued O. Importantly, this LP relaxation is 'tight', and ensures that O( W, 0) is a valid generation order. 5 Now, as we will show in the next section, the solution to this optimization O( W, τ ) can be obtained with a differentiable computation, thus, we write: The entropy regularizor, weighted by τ > 0 ('the temperature'), ensures differentiability with respect to W and, thus, with respect to φ, as needed to train the encoder.
We still need to handle the KL term in Equation 8. We define the prior probability P θ (O) implicitly by having W = 0 in the stochastic softmax framework. Even then, KL(Q φ (O|G, x)||P θ (O)) cannot be easily computed. Following Mena et al. (2018), we upper bound it by replacing it with KL(G(W, 1)||G(0, 1)), which is available in closed form.

Bregman's Method
To optimize objective (11) we iterate over the following steps of optimization: the logits W, without taking constraints into account, and then alternating optimization is used to 'fit' the constraints on columns and rows.
See Appendix I for a proof based on the proof for the Bregman method (Bregman, 1967). In practice, we take T = 50, and have O φ ( , G, x) = O (T ) . Importantly, this algorithm is highly parallelizable and amendable to batch implementation on GPU. We compute the gradients with unrolled optimization.

Neural Parameterization
We introduce the neural modules used for estimating logits W = F φ (G, x) and also the masking mechanism that both ensures acyclicity and enables the use of the copy mechanism. We have where RelGCN is a relational graph convolutional network (Schlichtkrull et al., 2018) that takes an AMR graph G and produces embeddings of its nodes informed by their neighbourhood in G. h end ∈ R 1×d is the trainable embedding of the terminal node, and h token ∈ R n×d is the BiLSTM encoding of a sentence from Section 3.1.
The masking also consists of two parts, the alignment mask and the segmentation mask, W mask = A mask • S mask . If a node is copy-able from at least one token, the alignment mask prohibits alignments from other tokens by setting the corresponding components A mask ij to −∞. Acyclicity is ensured by setting S mask so that generation order with circles will get negative infinity in Equation 11. While there may be more general ways to encode acyclicity (Martins et al., 2009), we simply perform a depth-first search (DFS) from the root node 6 and permit an edge from node i and j only if i precedes j (not necessarily immediately) in the traversal. In other words, S mask ij is set to −∞ for edges (i, j) violating this constraint. The rest of components in S mask are set to 0. Note that this masking approach does not require changes in the optimization method.

Parsing
While we relied on the latent variable machinery to train the parser, we do not use it at test time. In fact, the encoder Q φ (O|G, x) is discarded after training. At test time, the first step is to predict sets of concept nodes for every token using the concept identification model P θ (v|x, O) (as shown in Figure 4). Note that the token-specific autoregressive models can be run in parallel across tokens. The second step is predicting relations between all the nodes, relying on the relation identification model

Experiments
We experiment on LDC2016E25 (AMR2.0) and LDC2020T02 (AMR3.0). The evaluation is based on Smatch (Cai and Knight, 2013), and the evaluation tool of Damonte et al. (2017). We compare our generation-order induction framework to pre-set segmentations, i.e., producing the segmentation on a preprocessing step. We vary the segmentation methods while keeping the rest of the model identical to our full model (i.e., the same autoregressive model and the learned alignment). We provide ablation studies for our induction framework. We further provide visualization of the induced generation order, along with extra details, in Appendix.

Rule-based Segmentation
We introduce a hand-crafted rule-based segmentation method, which relies on rules designed to handle specific AMR constructions. In particular, we use the hand-crafted segmentation system of Lyu and Titov (2018), or, more specifically, its re-implementation by Zhang et al. (2019a). Arguably, this can be thought of as an upper bound for how well an induction method can do. This fixed segmentation can be incorporated into our latent-generation-order framework, so that the alignment between concept nodes and the tokens will still be induced. This is achieved by fixing S, while still inducing A.
Greedy Segmentation We provide a greedy strategy for segmentation that serves as a deterministic baseline. Many nodes are aligned to tokens with the copy mechanism. We could force the unaligned nodes to join its neighbors. This is very similar to the forced alignment of unaligned nodes used in the transition parser of Naseem et al. (2019). Again, the segmentation can be incorporated into our latent-generation-order framework by enforcing S and inducing A. See Appendix E for extra details about the strategy.

Results
In Table 1, we compare our models with recent AMR parsers (Xu et al., 2020a;Cai andLam, 2020, 2019;Zhang et al., 2019a;Naseem et al., 2019;Lindemann et al., 2020;Lee et al., 2020), as well as (Lyu and Titov, 2018), which we build on, and (van Noord and Bos, 2017), the earliest model which does not exploit any rules. Overall, our model ('full') performs competitively, but lags behind scores reported by some of the very recent parsers. 7 However, except for a no-rule version of Cai and Lam (2020), all these models either use rules (Lee et al., 2020) (see Section 7) or specialized pretraining (Xu et al., 2020a). Both our VAE model and the rule-based segmentation achieve high concept identification scores (Damonte et al., 2017). The relation identification component is however weaker than, e.g., (Cai and Lam, 2020). This may not be surprising, as we, following Lyu and Titov (2018), score edges independently, whereas (Cai and Lam, 2020) perform iterative refinement which is known to boost performance on relations (Lyu et al., 2019). Also, we use BiLSTM encoders, which -while cheaper to train and easier to tune -is likely weaker than Transformer encoders used by Astudillo et al.; Lee et al. While these modifications, along with using extra pre-training techniques and data augmentation, may further boost performance of our model, we believe that our model is strong enough for our purposes, i.e. demonstrating that informative segmentation can be induced without relying on any rules.
Indeed, our approach beats the greedy baseline and approaches the rule-based system. The performance gap between the rule-based system and VAE is smaller on AMR 3.0 (0.2 Smatch), possibly because the rules were developed for AMR 2.0.

Alignment Analysis
We analyzed the alignment induced by our full model and the model which uses rule-based segmentation. The alignments were evaluated at the level of individual concepts: if a subgraph was aligned to a token, all its concepts were considered aligned to that token. The evaluation was done on 40 sentences. The alignment error rates were 12%, 15% and 14% for the full model, greedy methods and the rulebased method, respectively. This suggests that our   method is able to induce relatively accurate alignments, and joint induction of alignments with segmentation may be beneficial, or, at the very least, not detrimental to alignment quality.
Ablations To reconfirm that it is important to learn the segmentation and alignment, rather than to sample it randomly, we perform further ablations. In our parameterization, discussed in Section 4.2.2, it is possible to set A raw = 0 and/or S raw = 0, which corresponds to sampling from the prior in training (i.e. quasi-uniformly while respecting the constraints defined by masking) rather than learning them. We consider 4 potential options, from sampling everything uniformly to learning everything (as in our method). The results are summa-rized in Table 3. As expected, the full model performs the best, demonstrating that it is important to learn both alignments and segmentation. Interestingly, both 'segmentation learned' and 'alignment learned' obtain reasonable performance, but the 'nothing learned' model fails badly.  2020) -while not not relying on graph recategorization rules -use a rule system to 'pack' and 'unpack' nodes. In recent work, strong results were obtained without using any explicit segmentation and alignment, relying on sequence-sequence models (Xu et al., 2020b;Cai and Lam, 2020), still the rules appear useful even with these strong models (Cai and Lam, 2020).

Related Work
More generally, outside of AMR parsing, differentiable relaxations of latent structure representations have received attention in NLP (Kim et al., 2017;Liu and Lapata, 2018), including previous applications of the perturb-and-MAP framework (Corro and Titov, 2019). From a more general goal perspective -inducing a segmentation of a linguistic structure -our work is related to tree-substitution grammar induction (Sima'an et al., 1995;Cohn et al., 2010), the DOP paradigm (Bod et al., 2003) and unsupervised semantic parsing (Poon and Domingos, 2009;Titov and Klementiev, 2011), though the methods used in that previous work are very different from ours.

Conclusions
To eliminate hand-crafted segmentation systems used in previous AMR parsers, we cast the alignment and segmentation as generation-order induction. We propose to treat this generation order as a latent variable in a VAE framework. Our method outperforms a simple segmentation heuristic and approaches the performance of a method using rules designed to handle specific AMR constructions. Importantly, while the latent variable modeling machinery is used in training, the parser is very simple at test time. It tags the input words with AMR concept nodes with autoregressive models and then predicts relations between the nodes independently from each other.
Vanilla sequence-to-sequence models are known to struggle with out-of-distribution generalization (Lake and Baroni, 2018;Bahdanau et al., 2019), and, in the future work, it would be interesting to see if this holds for AMR and if such more constrained and structured methods as ours can better deal with this more challenging but realistic setting.

A Decoding
We need to model the identification of the root node of the AMR graph. We specify the root identification as: where h root is a trainable vector. Inspired by Zhang et al. (2019a), who rely on AMR graphs being closely related to dependency trees, we first decode the AMR graph as a maximum spanning tree with log probability of most likely arc-label as edge weights. The reentrancy edges are added afterwards, if their probability is larger than 0.5. We add at most 5 reentrancy edges, based on the empirical founding of Szubert et al. (2020).

B Concept Identification Detail
Now, we specify P θ (v i |h node i ) and P θ (v m |h tail i ) with a copy mechanism. Formally, we have a small set of candidate nodes V(x i ) for each token x i , and a shared set of candidate nodes V share , which contain v copy . This, however, depends on the token, yet we are learning a latent alignment. During training, we consider all the union of candidate nodes from all possible tokensV(v i ) = ∪ j:v i ∈V(x j ) V(x j ). We abuse notation slightly, and denote the embedding of node i by v i . At training time, for node v i , we have where NN is a standard one-layer feedforward neural network, and [[. . .]] denotes the indicator function. S(v, h c i ) assigns a score to candidate nodes given the hidden state. To use pre-trained word embedding (Pennington et al., 2014), the representation of v is decomposed into primitive category embedding (C(v) 8 and surface lemma embedding. The score function is then a biaffine scoring based on the embeddings and hidden states(L(v) . For the 8 AMR nodes have primitive category, including string, number, frame, concept and special nodes (e.g. polarity).
terminal nodes, we have: At testing time, we perform greedy decoding to generate nodes from each token in parallel until either terminal node or T nodes are generated. can be thought as a Markov transition matrix that passes down the alignment along the generation order, but keeps the alignment mass if the node will generate ∅. We truncate the transition at T = 4, as we do not expect a subgraph containing more than 4 nodes.
To obtain A ∞ , we observe A ∞ should obey the following self-consistency equation: A ∞ = A ∞ S :,:m + A (23) This means, node j is generated from token k iff node i is is generated from token k and node i generates node j or node j is directly generated from token k. This A ∞ can be computed by initializing A ∞ = A, and repeating Equation 23 as assignment for T = 4 times. Intuitively, the A ∞ alignment is passed down along the generation order, while keeping getting alignment mass from the first node alignment. As a result, all nodes get assigned an alignment. As an alternative motivation, the above algorithmic assignment works as a truncated power series expansion of self-consistency equation solution A ∞ = [I − S :,:m ] −1 A.

D Ablation on Stochastic Softmax
Our full model uses the Straight-Through (ST) gradient estimator and the Free Bits trick with λ = 10 (Kingma et al., 2017). 9 We perform an analysis of different variations of the stochastic softmax: (1) the soft stochastic softmax is the original one with the entropic regularizer (see Section 4.2); (2) the rounded stochastic softmax, which selects the highest scored next node from each tokens and  All those models use Free Bits (λ = 10), while for 'no free bits' λ = 0. As we can see in Table 4, there is a substantial gap between using structured ST and the two other versions. This illustrates the need for exposing the parsing model to discrete structures in training. Also, the Free Bits trick appears crucial as it prevents the (partial) posterior collapse in our model. We inspected the logits after training and observed that, without free-bits, the learned W are very small, in the [−0.01, +0.01] range.

E Greedy Segmentation
We present a greedy strategy for segmentation that serves as a deterministic baseline. This greedy segmentation can be used in the same way as the rule-based segmentation by setting S mask . Many nodes are aligned to tokens with the copy mechanism. We could force the unaligned nodes to join their neighbors. This is very similar to the forced alignment of unaligned nodes used in the transition parser of Naseem et al. (2019). We traversal the AMR graph the same way as we do when we produce the masking (Section 4.2.2). During the traversal, we greedily combine subgraphs until one of the constraints is violated: (1) the combined subgraph will have more than 4 nodes; (2) the combined subgraph will have more than 2 copy-able nodes. We present the algorithm recursively (see Algorithm 1). Variable z i indicates whether node i is copy-able and T = 4 represent the maximum subgraph size; n denotes the current subgraph size; z indicates whether the current subgraph contains a copy-able node; k is the last node in the current subgraph, which is used to generate to future nodes in a subgraph. The condition n + n ≤ T ∧ z + z ≤ 1 determines whether we combine the current sub-10 Such rounding does not provide any guarantee of being a valid generation order, but serves as a baseline. In general, a threshold function (at 0.5) can be applied if the constraints have no structure.
Input: graph G, node index i Result: segmentation S, n, z, k if j notvisited then S , n , z , k = Greedy(G, j) ; S = S + S ; if n + n ≤ T ∧ z + z ≤ 1 then S kj = 1, n = n + n , z = z + z k = k ; end end end Algorithm 1: Greedy Segmentation graph rooted at node i and the subgraph rooted at node j. Running the algorithm on an AMR graph and the root index will get us the entire segmentation. This greedy method does not require any expert knowledge about AMR, so this should serve as a baseline.

F Visualizing Generation Order
In Figures 6, 7 , 8, 9, 11  As we can see, the standard stochastic softmax indeed produces soft latent structure that might result in large training/testing gap. Furthermore, the rounding strategy does not satisfy the constraint that every concept node can only be generated from one token or another concept node (i.e. [poster-01] is generated twice, and [thing] is never generated.). Meanwhile, the straight through stochastic softmax produce a valid generation order. In Appendix J, we will show the validity formally. It is worth to note that our learned generation order differs from the rule based one. When producing the rule-based segmentation, '(t2 / thing :ARG0-of (e / express-01  )' took precedence over '(t2 / thing :ARG2-of (p / poster-01)' due to the order over traversal edges. The learned model, however, figured out that the poster is the thing.

H Pre-and-Post processing
We follow Lyu and Titov (2018) for pre-and-post processing. We use CoreNLP (Manning et al., 2014) for tokenization and lemmatization. The copy-able dictionary is built with the rules based on string matching between lemmas and concept node string as in Lyu and Titov (2018). For post-processing, wiki tags are added after the named entity being produced in the graph via a look-up table built from the training set or provided by CoreNLP. We also collapse nodes that represent the same pronouns as heuristics for co-reference resolution.

I Proof of Proposition 1
We prove Proposition 1 based on the Bregman method (Bregman, 1967). The Bregman's method solves convex optimization with a set of linear equalities, the setting is as follows: min where F is strongly convex and continuously differentiable. Note that A is not our alignment, but denotes a matrix that represents constraints. Two important ingredients are Bregman's diver-gence D F (x, y) = F (x) − F (y) − ∇F (y), x − y , and Bregman's projection: P ω,F (y) = arg min x∈ω D F (x, y), where ω represents constraint. Now, the Bregman's method works as: Intuitively, Bregman's method iteratively performs pick y 0 ∈ {y ∈ Ω|∇F (y) = uA, u ∈ R m }; Algorithm 2: Bregman's method for solving convex optimization over linear constraints alternating projections w.r.t. each constraint. After each projection, the score F is lowered by the construction of Bregman's projection. Such alternating projections eventually converge, and with careful initialization solve the optimization problem.
Proof of Proposition 1. We show Proposition 1 by showing the Algorithm defined by equations 13, 14, 15 and 16 implements Bregman's method. Then, Proposition 1 follows from Theorem 1. Now, we build Bregman's method for our optimization problem 11. For simplicity, we focus on the linear algebraic structure, but do not strictly follow the standard matrix notation. We have O as variable, and This corresponds to the initialization step as in our Equation 13. Then, we iterate through constraints to perform Bregman's projection. First, the column normalization constraints ∀j < m, n+m−1 i=0 O ij = 1. Take a j < m, we need to compute . Moreover, D F (x, y) = 0 ⇐⇒ x = y. Therefore, for variables that are not involved in the constraints, they are kept the same. To simplify notation, we extend the domain of F to parts of the variable. e.g., F (O :,j ) = i f ij (O ij ). Now, let us focus on column j, we have: arg min = arg min = arg min = arg min =Softmax(log O :,j ) (30) since when iterating over these mutually nonoverlapping constraints, the non-focused variables are always kept the same. It is hence equivalent to computing them in parallel, which is expressed in our column normalization step 14. Similarly, we can derive row normalization step 16. Therefore, our algorithm is an implementation of Bregman's method, and Proposition 1 follows from Theorem 1.

J Generation Order is Discrete by LP
If O( W, 0) is integral valued, it belongs to O by definition. In most cases, there is no guarantee that the linear programming in the relaxed space yields a solution that is also an integer. However, in our cases, we have the following result: Intuitively, this is a generalization of a classical result about perfect matching on bipartite graph (Conforti et al., 2014). To prove this, we need the following theorems from integer linear programming. Theorem 2 (Conforti et al. 2014, page 130,133). l Let A be an q × p integral matrix. For all integral vectors d, l, u and c ∈ R p , max{ c, x : Ax = d, l ≤ x ≤ u} is attained by an integral vector x if and only if A is totally unimodular. 13 Note that this theorem does not claim at all the solution is integer, nor that it is unique. However, one should understand this limitation as some degenerate case of c. However, a total unimodular matrix does characterize the convex hull of its integral points. To prove this, we need an additional lemma.
Lemma 1 (Conforti et al. 2014, page 21). l Let S ∈ R n and c ∈ R n . Then sup{ c, s : s ∈ S} = sup{ c, s : s ∈ Conv(S)}. Furthermore, the supremum of c, s is attained over S if and only if it is attained over Conv(S).
where Conv(S) is the convex hull of S. Now we have the following proposition: Proposition 3. Let A be an q × p integral matrix. For all integral vectors d, l, u ,and c ∈ R p such that {x ∈ {0, 1} p |Ax = d, l ≤ x ≤ u} is a finite set, {x ∈ |Ax = d, l ≤ x ≤ u} = Conv({x ∈ {0, 1} p |Ax = d, l ≤ x ≤ u}) if and only if A is totally unimodular.
In other words, we know the LP relaxation is the convex hull.
Proof. By Theorem 2, A is totally unimodular is equivalent to maximum is attained by an integer solution. Clearly, the LP relaxation contains the convex hull. So, we only need to show that the LP relaxation does not contain any more points. Now suppose the LP relaxation contains another point x that's not in the convex hull. Since, we restrict our discussion on finite set of integer, both the {x } and the convex hull is closed set. Then by the separation theorem, we have a vector c s.t.
Theorem 3 (Conforti et al. 2014, page 133,134). A 0, ±1 matrix A with at most two nonzero elements in each column is totally unimodular if and only if rows of A can be partitioned into two sets, red and blue, such that the sum of the red rows minus the sum of the blue rows is a vector whose entries are 0, ±1 (admits row-bicoloring).
Our O should be the column vector x, and constraints should be represented by a matrix A. In particular, we view O as a column vector, but still access the item by O ij . 14 The matrix A ∈ {0, ±1} (m+(m+n))×((n+m)(m+1)) . A :,ij denotes the constraints involving O ij . The first m rows 14 Alternatively, one could have a vector x and x i(m+1)+j = Oij. However, this will gets clumsy. of A correspond to ∀j < m, n+m−1 i=0 O ij = 1, and the remaining m + n rows correspond to ∀i, m j=0 O ij = 1. Therefore, we have ∀k < m, j < m, i, A k,ij = δ j,k and ∀k ≥ m, j, iA k,ij = δ i,k−m , else A k,ij = 0, where δ j,k = [[j == k]]. We have the linear constraints in standard form as AO = 1.
Lemma 2. The A defined above is totally unimodular.
Proof. First, we show A admits row-bicoloring. We color the first m rows red, and remaining n + m rows blue. The sum of red rows is: and the sum of blues is B ij = 2m+n−1 k=m A k,ij = 2m+n−1 k=m δ i,k−m = 1. Therefore, R ij − B ij = [[j == m]] ∈ {0, ±1}, and A admits a rowbicoloring. Since A has only 0, ±1 value, and one variable in O at most participates in two constraints (incoming and outgoing), by Theorem 3, A is totally unimodular. Now, we prove Proposition 2.
Proof. We have A being totally unimodular. We have c = W , l = 0, u = 1, by Theorem 2, the LP solutions contain an integer vector. Since the Gumbel distribution has a positive and differentiable density, by (Paulus et al., 2020, Proposition 3), arg max O∈O W, O yields a unique solution with probability 1. Clearly, this solution is the only integer solution in our LP solutions. Now, suppose another non-integer solution exists. We know the linear programming domain is the convex hull by Proposition 3. Clearly, another integer solution exists, which contradicts the uniqueness of the integer solution. Hence, the O( W, 0) yields a unique integer solution with probability 1.