Learning compositional structures for semantic graph parsing

AM dependency parsing is a method for neural semantic graph parsing that exploits the principle of compositionality. While AM dependency parsers have been shown to be fast and accurate across several graphbanks, they require explicit annotations of the compositional tree structures for training. In the past, these were obtained using complex graphbank-specific heuristics written by experts. Here we show how they can instead be trained directly on the graphs with a neural latent-variable model, drastically reducing the amount and complexity of manual heuristics. We demonstrate that our model picks up on several linguistic phenomena on its own and achieves comparable accuracy to supervised training, greatly facilitating the use of AM dependency parsing for new sembanks.


Introduction
It is generally accepted in linguistic semantics that meaning is compositional, i.e. that the meaning representation for a sentence can be computed by evaluating a tree bottom-up. A compositional parsing model not only reflects this insight, but has practical advantages such as in compositional generalisation (e.g. Herzig and Berant 2020), i.e. systematically generalizing from limited data.
However, in developing a compositional semantic parser, one faces the task of figuring out what exactly the compositional structures -i.e. the trees that link the sentence and the meaning representation -should look like. This is challenging even for expert linguists; for instance, (Copestake et al., 2001) report that 90% of the development time of the English Resource Grammar (Copestake and Flickinger, 2000) went into the development of the syntax-semantics interface.
Compositional semantic parsers which are learned from data face an analogous problem: to train a such a parser, the compositional structures must be made explicit. However, these structures are not annotated in most sembanks. For instance, the AM (Apply-Modify) dependency parser of Groschwitz et al. (2018) uses a neural model to predict AM dependency trees, compositional structures that evaluate to semantic graphs. Their parser achieves high accuracy  and parsing speed (Lindemann et al., 2020) across a variety of English semantic graphbanks. To obtain an AM dependency tree for each graph in the corpus, they use hand-written graphbank-specific heuristics. These heuristics cost significant time and expert knowledge to create, limiting the ability of the AM parser to scale to new sembanks.
In this paper, we drastically reduce the need for hand-written heuristics for training the AM dependency parser. We first present a graphbankindependent method to compactly represent the relevant compositional structures of a graph in a tree automaton. We then train a neural AM dependency parser directly on these tree automata. Our code is available at github.com/coli-saar/am-parser.
We evaluate the consistency and usefulness of the learned compositional structures in two ways. We first evaluate the accuracy of the trained AM dependency parsers, across four graphbanks, and find that it is on par with an AM dependency parser that was trained on the hand-designed compositional structures of . We then analyze the compositional structures which our algorithm produced, and find that they are linguistically consistent and meaningful. We expect that our methods will facilitate the design of compositional models of semantics in the future.

G-fairy
The fairy G-begin that begins to G-glow glow MODS APPO (a) AM dep-tree with word alignments. The dashed lines connect tokens to their graph constants, and arrows point from heads to arguments, labeled by the operation that puts the graphs together.  Figure 1: AM dep-trees and graphs for the fairy that begins to glow. We usually write our example AM dep-trees without alignments as in (b). We include node names where helpful, as in (c), where e.g. b is labeled begin.   Peng et al. (2015) and Chen et al. (2018), use CCG and HRG based grammars to parse AMR and EDS (Flickinger et al., 2017). They use a combination of heuristics, hand-annotated compositional structures and sampling to obtain training data for their parsers, in contrast to our joint neural technique. None of these approaches use slot names that carry meaning; to the best of our knowledge this work is the first to learn them from data. Fancellu et al. (2019) use DAG grammars for compositional parsing of Discourse Representation Structures (DRS). Their algorithm for extracting the compositional structure of a graph is deterministic and graphbank-independent, but comes at a cost: for example, rules for heads require different versions depending on how often the head is modified, reducing the reusability of the rule. Maillard et al. (2019) and Havrylov et al. (2019) learn compositional, continuous-space neural sentence encodings using latent tree structures. Their tasks are different: they learn to predict continousspace embeddings; we learn to predict symbolic compositional structures. Similar observations hold for self-attention (Vaswani et al., 2017;Kitaev and Klein, 2018).

AM dependency parsing
Compositional semantic graph parsing methods do not predict a graph directly, but rather predict a compositional structure which in turn determines the graph. Groschwitz et al. (2018) represent the compositional structure of a graph with AM dependency trees (AM dep-trees for short) like the one in Fig. 1a. It describes the way the meanings of the words -the graph fragments in Fig. 2 -combine to form the semantic graph in Fig. 1c, here an AMR (Banarescu et al., 2013). The AM dep-tree edges are labeled with graph-combining operations, taken from the Apply-Modify (AM) algebra (Groschwitz et al., 2017;Groschwitz, 2019).
Graphs are built out of fragments called graph constants (Fig. 2). Each graph constant has a root, marked with a rectangular outline, and may have special node markers called sources (Courcelle and Engelfriet, 2012), drawn in red, which mark the empty slots where other graphs will be inserted.
In Fig. 1a, the APP O operation plugs the root of G-glow into the O source of G-begin. Because G-begin and G-glow both have an S-source, APP O merges these nodes, creating a reentrancy, i.e. an undirected cycle, and yielding Fig. 1d, which is in turn attached at S to the root of G-fairy by MOD S . APP fills a source of a head with an argument while MOD uses a source of a modifier to connect it to a head; both operations keep the root of the head.
Types The [S] annotation at the O-source of G-begin in Fig. 2 is a request as to what the type of the O argument of G-begin should be. The type of a graph is the set of its sources with their request annotations, so the request [S] means that the source set of the argument must be {S}. Because this is true of G-glow, the AM dependency tree is well-typed; otherwise the tree could not be evaluated to a graph. Thus, the graph constants lexically specify the semantic valency of each word as well as reentrancies due to e.g. control.
If a graph has no sources, we say it has the empty type [ ]; if a source in a graph printed here has no annotation, it is assumed to have the empty request (i.e. its argument must have no sources).  for graph constants and edges respectively. Computing the highest scoring well-typed AM dep-tree is NP-hard; we use their fixed-tree approximate decoder here.

Decomposition algorithm
The central challenge of compositional methods lies in the fact that the compositional structures are not provided in the graphbanks. Existing AM parsers (Groschwitz et al., 2018;Lindemann et al., , 2020 use hand-built heuristics to extract AM dep-trees for supervised training from the graphs in the graphbank. These heuristics require extensive expert work, including graphbank-specific decisions for source allocations and graphbank-and phenomenon-specific patterns to extract type requests for reentrancies. In this section we present a simpler yet more complete method for obtaining the basic structure of an AM dep-tree for a given semantic graph G (for decomposing the graph), with much reduced reliance on heuristics. We will learn meaningful source names jointly with training the parser in §5 and §6.
Notation. We treat graphs as a quadruple G = N G , r G , E G , L G , where the nodes N G are arbitrary objects (in the examples here we use lowercase letters), r G ∈ N G is the root, E G ⊆ N G ×N G is a set of directed edges, and L G is the labelling function for the nodes and edges. For example in Fig. 3a, the node g is labeled "glow". The node identities are not relevant for graph identity or evaluation measures, but allow us to refer to specific nodes during decomposition. We formalize AM dep-trees as similar quadruples. Note that our example graphs are all AMRs, but our algorithms apply unchanged to all graphbanks

Basic transformation to AM dep-trees
Let us first consider the case where the semantic graph G has no reentrancies, like in Fig. 3a. The first step in obtaining the AM dep-tree for G is to obtain the basic shape of the constants. We let each graph constant contain exactly one labeled node. Each edge belongs to the constant of exactly one node. The edges in the constant of a node are called its blob (Groschwitz et al., 2017); the blobs partition the edge set of the graph. For example, the blobs of the AMR in Fig. 3a are g plus the 'ARG0' edge, t plus the 'mod' edge, and f . We normalise edges so that they point away from the node to whose blob they belong, like in Fig. 3b, where the 'mod' edge is reversed and grouped with the node t to match P-tiny in Fig. 4. We add an -of suffix to the label of reversed edges. From here on, we assume all graph edges to be normalised this way.
Heuristics for this partition of edges into blobs are simple yet effective. Thus, this is the only part of this method where we still rely on graphbankspecific heuristics. (We use the same blob heuristics as  in our experiments).
Once the decision of which edge goes in which blob is made, we obtain canonical constants, which are single node constants using placeholder source names and the empty request at every source; see e.g. P-glow in Fig. 4 (P for 'placeholder'). Placeholder source names are graphspecific source names: for a given argument slot in a constant, let n be the node that eventually fills it in G; we write n for the placeholder source in that slot. For example in the AM dep-tree in Fig. 3c the source f in P-glow ( Fig. 4) gets filled by node f in the AMR in Fig. 3b. These placeholder sources are unique within the graph, allowing us to track source names through the AM dep-tree. When we restrict ourselves to the canonical constants, in a setting without reentrancies, the compositional structure is fully determined by the structure of the graph: Lemma 4.1. For a graph G without reentrancies, given a partition of G into blobs, there is exactly one AM dep-tree C G with canonical constants that evaluates to G.
We call this AM dep-tree the canonical AM tree  Fig. 3c shows the canonical AM tree for the graph in Fig. 3b, using the canonical constants in Fig. 4. The canonical AM tree uses the same nodes and root as G, and essentially the same edges, but all edges point away from the root, forming a tree. Each node is labeled with its canonical constant. Each edge n − → m ∈ E C is labeled APP m if the corresponding edge in the graph has the same direction, and is labeled MOD n if there is instead an edge m − → n in G.

Reentrancies and types
Finding AM dep-trees for graphs with reentrancies, like in Fig. 6a, is more challenging. To solve the problem in its generality, we first unroll the graph as in Fig. 6b We then obtain a canonical AM-tree C U for the unrolled graph U as in §4.1 (see Fig. 6c), but REF-n nodes fill n-sources; e.g. x has an incoming APP f edge here. C U evaluates to U , not to G; we obtain an AM dep-tree that evaluates to G through a process called resolving the reentrancies, which removes all REF-nodes and instead expresses the reentrancies with the AM type system. Fig. 6e shows the result T of applying this resolution process to C U in Fig. 6c. In T , the s and g sources of the graph P -and (see Fig. 5) each have a request [f] that signals that the f sources of P-sparkle and P-glow are still open when these graphs combine with P -and, yielding the partial Algorithm 1: Reentrancy resolution Pick a y ∈ R s.t. there is no x ∈ R, x = y, with y on an x-resolution path; 5 for p ∈ y-resolution paths:  Fig. 6d. Since identical sources merge in the AM algebra, Fig. 6d has a single f-source slot. Into this slot, P-fairy is inserted to yield the original graph G in Fig. 6a, and we have obtained the reentrancy without using a REF-node. f is now a child of a in T ; we call a the resolution target of f , RT (f ). In general the resolution target of a node n is the lowest common ancestor of n and all nodes labeled REF-n. Thus, to resolve the graph, we (a) add the necessary type requests to account for sources remaining open until they are merged at the resolution target and (b) make each node a dependent of its resolution target and remove all REF-nodes. Algorithm 1 describes this procedure. It uses the idea of an nresolution path, which is a path between a node n or a REF-n node and its resolution target. In Fig. 6c, there are two f -resolution paths: one in blue between f and its resolution target a, and one in green between the REF-f node x and its resolution target a. Further, τ (n) is the type of the graph constant in T for a node n and β(n) is the type of the result of evaluating the subtree below n in T .
In the example, Algorithm 1 iterates over all edges in both resolution paths (Line 6; the order of these iterations does not impact the result). For the two bottom edges s Since the subtree rooted at f evaluates to a constant with empty type, no actual changes are made here (β(y) can be non-trivial from resolution paths handled previously). For the two upper edges a APPs − − → s and a APPg − − → g, Line 10 applies, adding f to the requests at s and g in the constant at a. In Line 11, f gets moved up to become a child of its resolution target a and in Line 12 the REF-f node x gets removed, yielding T in Fig. 6e. Algorithm 1 is correct in the following precise sense: Theorem 1. Let G be a graph, let U be an unrolling of G, let C U be the canonical AM-tree of U , and let T be the result of applying Algorithm 1 to C U . Then T is a well-typed AM dep-tree that evaluates to G iff for all y ∈ N G , for all y-resolution paths p in C, 1. the bottom-most edge n − → m of p (i.e. m is y or labeled REF-y) does not have a MOD label, and 2. for all y-resolution paths p in C, if n MOD − − → m ∈ p, n, m = y, then there is a directed path in G from n to y.
Condition (1) captures the fact that moving MOD edges in the graph changes the evaluation result (the modifier would attach at a different node) and Condition (2) the fact that modifiers are not allowed to add sources to the type of the head they modify.
Algorithm 1 does not yield all possible AM deptrees; in Appendix B, we present an algorithm that yields all possible AM dep-trees (with placeholder sources) for a graph. However, we find in practice that Algorithm 1 almost always finds the best linguistic analysis; i.e. reasons to deviate from Algorithm 1 are rare (we estimate that this affects about 1% of nodes and edges in the AM dep-tree). We leave handling these rare cases to future work.

Unrolling the graph
To obtain an unrolled graph U , we use Algorithm 2. The idea is to simply expand G through breadth-first search, creating REF-nodes when we encounter a node a second time. We use separate queues F and B for forward and backward traversal of edges, allowing us to avoid traversing edges backwards wherever possible, since that would yield MOD edges in the canonical AM-tree C U , which can be problematic for the conditions of Theorem 1. And indeed, we can show that whenever there is an unrolled graph U satisfying the conditions of Theorem 1, Algorithm 2 returns one.
Algorithm 2 does not specify the order in which the incident edges of each node n are added to the Algorithm 2: Unrolling Input: Graph G 1 F, B ← empty FIFO queues; 2 U ← empty graph; 3 add r G to U , add outgoing edges of r G to F and incoming edges of r G to B; 4 while F ∪ B = ∅: queues, leaving an element of choice. However, we find that nearly all of these choices are unified later in the resolution process; meaningful choices are rare. For example in Fig. 6b, f and x may be switched, but Algorithm 1 always yields the AM dep-tree in Fig. 6e. In practice, we execute Algorithm 2 with arbitrary queueing order, and follow it with Algorithm 1. The AM dep-tree we obtain is guaranteed to be a decomposition of the original graph whenever one exists: Theorem 2. Let G be a graph partitioned into blobs. If there is a well-typed AM dep-tree T , using that blob partition, that evaluates to G, then Algorithm 2 (with any queueing order) and Algorithm 1 yield such a tree.

Tree automata for source names
We have now seen how, for any graph G, we obtain a unique AM dependency tree T . This tree represents the compositional structure of G, but it still contains placeholder source names. We will now show how to automatically choose source names. These names should be consistent across the trees for different sentences; this yields reusable graph constants, which capture linguistic generalizations and permit more accurate parsing. But the source names must also remain consistent within each tree to ensure that the tree still evaluates correctly to G; for instance, if we replace the placeholder source f in P-glow in Fig. 6e by O, but we replace f in P -and by S, then the AM dep-tree would not be well-typed because the request is not satisfied. We therefore proceed in two steps. In this section, we represent all internally consistent source assignments compactly with a tree automaton. In §6, we then learn to select globally reusable source names jointly with training the neural parser.
Tree automata. A (bottom-up) tree automaton (Comon et al., 2007) is a device for compactly describing a language (set) of trees. It processes a tree bottom-up, starting at the leaves, and nondeterministically assigns states from a finite set to the nodes. A rule in a tree automaton has the general shape f (q 1 , . . . , q n ) → q. If the automaton can assign the states q 1 , . . . , q n to the children of a node π with node label f , this rule allows it to assign the state q to π. The automaton accepts a tree if it can assign a final state to the root node. Tree automata can be seens as generalisation of parse charts.
General construction. Given an AM dependency tree T with placeholders, we construct a tree automaton that accepts all well-typed variants of T with consistent source assignments. More specifically, let S be a finite set of reusable source names; we will use S = {S, O, M} here, evoking subject, object, and modifier. The automaton will keep track of source name assignments, i.e. of partial functions φ from placeholder source names into S. Its rules will ensure that the functions φ assign source names consistently.
We start by binarizing T into a binary tree B, whose leaves are the graph constants in T and whose internal nodes correspond to the edges of T ; the binarized tree for the dependency tree in Fig. 7a is shown in Fig. 7b. We then construct a tree automaton A B that accepts binarized trees which are isomorphic to B, but whose node labels have been replaced by graph constants and operations with reusable source names. The states of A B are of the form π, φ , where φ is a source name assignment and π is the address of a node in B. Node addresses π ∈ N * are defined recursively: the root has the empty address , and the i-th child of a node at address π has address πi. The final states are all states with π = , indicating that we have reached the root. Rules. The automaton A B has two kinds of rules. Leaf rules choose injective source name assignments for constants; there is one rule for every possible assignment at each constant. That is, for every graph constant H at an address π in B, the automaton A B contains all rules of the form where φ is an injective map from the placeholder sources in H to S, and G is the graph constant identical to H except that each placeholder source s in H has been replaced by φ(s).
For example, the automaton for Fig. 7b contains the following rule: Note that this rule uses the node label G-begin with the reusable source names, not the graph constant P -begin in B with the placeholders.
In addition, operation rules percolate source assignments from children to parents. Let APP x for some placeholder source x be the operation at address π in B. Then A B contains all rules of the form APP φ 1 (x) ( π0, φ 1 , π1, φ 2 ) → π, φ 1 as long as φ 1 and φ 2 are identical where their domains overlap, i.e. they assign consistent source names to the placeholders. The rule passes φ 1 on to its parent. The assignments in φ 2 are either redundant, because of overlap with φ 1 , or they are no longer relevant because they were filled by operations further below in the tree. The MOD case works out similarly.
In the example, A B contains the rule because φ b and φ g agree on f. A complete accepting run of the automaton is shown in Fig. 7c.
The automaton A B thus constructed accepts the binarizations of all well-typed AM dependency trees with sources in S that match T .

Joint learning of compositional structure and parser
As a final step, we train the neural parser of Groschwitz et al. (2018) directly on the tree automata. For each position i in the sentence, the parser predicts a score c (G, i) for each graph constant G, and for each pair i, j of positions and operation , it predicts an edge score c i − → j .
The tree automata are factored the same way, in that they have one rule per graph constant and per dependency edge. As a result, we get a oneto-one correspondence between parser scores and automaton rules when aligning automata rules to words via the words' alignments to graph nodes.
We thus take the neural parser scores as rule weights c (r) for rules r in the automaton. In a weighted tree automaton, the weight of a tree is defined as the product of the weights of all rules that built it. The inside score I of the tree automaton is the sum of the weights of all the trees it accepts. Computing this sum naively would be intractable, but the inside score can be computed efficiently with dynamic programming. Our training objective is to maximize the sum of the log inside scores of all automata in the corpus.
The arithmetic structure of computing the inside scores is complex and varies from automaton to automaton, which would make batching difficult. We solve this with the chain rule as follows: where θ are the parameters of the neural parser, which determine c(r), and α (r) is the outer weight of the rule r (Eisner, 2016), i.e. the total weight of trees that use r divided by c(r). The outer weight can be effectively computed with the inside-outside algorithm (Baker, 1979). This occurs outside of the gradient, so we do not need to backpropagate into it. Since the scores c (r) are direct outputs of the neural parser, their gradients can be batched straightforwardly.

Setup
We evaluate parsing accuracy on the graphbanks DM, PAS, and PSD from the SemEval 2015 shared task on Semantic Dependency Parsing (SDP, Oepen et al. (2015)) and on the AMRBank LDC2017T10 (Banarescu et al., 2013). We follow  in the choice of neural architecture, in particular using BERT (Devlin et al., 2019) embeddings, and in the choice of decoder, hyperparameters and pre-and postprocessing (we train the model of §6 for 100 instead of 40 epochs, since it is slower to converge than supervised training). When a graph G is non-decomposable using our blob partition, i.e. if there is no well-typed AM dep-tree T that evaluates to G, and so the condition of Theorem 2 does not hold, then we remove that graph from the training set. (This does not affect coverage at evaluation time.) This occurs rarely, affecting e.g. about 1.6% of graphs in the PSD training set. Like , we use the heuristic AMR alignments of (Groschwitz et al., 2018). These alignments can yield multi-node constants. In those cases, we first run the algorithm of Section 4 to obtain an AM tree with placeholder source names, and then consolidate those constants that are aligned to the same word into one constant, effectively collapsing segments of the AM tree into a single constant. We then construct the tree automata of Section 5 as normal.

Results
We consider three baselines. Each of these chooses a single tree for each training instance from the tree automata and performs supervised training. The random trees baseline samples a tree for each sentence from its automaton, uniformly at random. In the random weights baseline, we fix a random weight for each graph constant and edge label, globally across the corpus, and select the highestscoring tree for each sentence. The EM weights baseline instead optimizes these global weights with the inside-outside algorithm.   (2020)). FG'20 is Fernández-González and Gómez-Rodríguez (2020). Table 1 compares the baselines and the joint neural method. Random trees perform worst -consistency across the corpus matters. The difference between random weights and EM is suprisingly small, despite the EM algorithm converging well. The joint neural learning outperforms the baselines on all graphbanks; we analyze this in § 8. We also experimented with different numbers of sources, finding 3 to work best for DM, PAS and AMR, and 4 for PSD (all results in Appendix C). Table 2 compares the accuracy of our joint model to  and to the state of the art on the respective graphbanks. Our model is competitive with the state of the art on most graphbanks. In particular, our parsing accuracy is on par with , who perform supervised training with hand-crafted heuristics. This indicates that our model learns appropriate source names.
Grahbank-specific pre-and processing. The pre-and postprocessing steps of  we use still rely on two graphbank-specific heuristics, that directly relate to AM depenency trees: in PSD, it includes a simple but effective step to make coordination structures more compatible with the specific flavor of application and modification of AM dependency trees. In AMR it includes a step to remove some edges related to coreference (a non-compositional source of reentrancy).
We include in brackets the results without those two preprocessing steps. The drop in performance for PSD indicates that while for the most part our method is graphbank-independent, not all shapes of graphs are equally suited for AM dependency-parsing and some preprocessing to bring the graph 'into shape' can still be important. For AMR, keeping the co-reference based edges leads to AM trees that resolve those reentrancies with the AM type system. That is, the algorithm 'invents' ad-hoc compositional explanations for a non-compositional phenomenon, yielding graph constants with type annotations that do not generalize well. The corresponding drop in performance indicates that extending AM dependency parsing to handle coreference will be an important future step when parsing AMR; some work in that direction has already been undertaken (Anikina et al., 2020).

Linguistic Analysis
As AM parsing is inherently interpretable, we can explore linguistic properties of the learned graph constants and trees. We find that the neural method makes use of both syntax and semantics.
We compute for each sentence in the training set the best tree from its tree automaton, according to the neural weights of the best performing epoch. We then sample trees from this set for handanalysis (see Appendix A), to examine whether the model learned consistent sources for subjects and objects. We find that while the EM method uses highly consistent graph constants and AM operations, the neural method, which has access to the strings, sacrifices some graph constant and operation consistency in favour of syntactic consistency.
Syntactic Subjects and Objects. In the active sentence The fairy charms the elf, the phrase the fairy is the syntactic subject and the elf the syntactic object. In the passive The elf is charmed (by the fairy), the phrase the elf is now the syntactic subject, even though in both sentences, the fairy is the charmer and the elf the charmee. Similarly, the fairy is the syntactic subject in the intransitive sentence The fairy glows.
Intra-Phenomenon Consistency. For both the EM and neural method, we found completely consistent source allocations for active transitive verbs in all four sembanks. These source allocations were also the overwhelming favourite graph constants for two-argument predicates (72-92%), and the most common sources used by Apply operations (94-98%). For example, in AMR, the graph constant template in Fig. 8a appears 26,653 times in the neural parser output. 74% of these used sources x = S 1 and y = S 2 (from S = {S 1 , S 2 , S 3 }). All active transitive sentences in our sample used this source allocation, so we call this the active graph constant (e.g. G-charm in Fig. 2) and refer to the sources S 1 and S 2 as S and O respectively, for subject and object. All four sembanks showed this kind of consistency; when we refer to S and O sources below, we mean whichever two sources displayed the same behaviour as S 1 and S 2 in AMR.
All four graphbanks are also highly consistent in their modifiers: classical modifiers such as adjectives are nearly universally adjoined with one consistent source -we refer to it as M -and MOD M is the overwhelming favourite (90-99%) for MOD operations.
Cross-Phenomenon Consistency. We call a parser syntactically consistent if its syntactic subjects fill the S slot, regardless of their semantic role. A syntactically consistent parser would acquire the AMR in Fig. 8c from the active sentence by the analysis in Fig. 8b, and from the passive sentence by the analysis in Fig. 8d, with the passive constant G-charmP from Fig. 2.
The neural parser is syntactically consistent: in all sembanks, it uses the same source S for syntactic subjects in passives as for actives. EM, conversely, prefers to use the same graph constants for active and passives, flipping the APP edges to produce syntactically inconsistent trees as in Fig. 8e. Single-argument predicates are also syntactically consistent in the neural model, using S for subjects and O for objects, while EM picks one source. The heuristics in  have passive constants, but use them only when forced to, e.g. when coordinating active and passive.
Finally, we compute the entropy of the graph constants for the best trees of the training set as is the frequency of constant G in the trees.The entropies are between 2 and 3 nats, but are consistently lower for EM than the neural method, by 0.031 to 0.079 nats. Considering that the neural method achieves higher parsing accuracies, using the most common graph constants and edges possible evidently is not always optimal for performance. The syntactic regularities exploited by the neural method may contribute to its improved performance.

Conclusion
In this work, we presented a method to obtain the compositional structures for AM dependency parsing that relies much less on graphbank-specific heuristics written by experts. Our neural model learns linguistically meaningful argument slot names, as shown by our manual evaluation; in this regard, our model learns to do the job of the linguist. High parsing performance across graphbanks shows that the learned compositional structures are also well-suited for practical applications, promising easier adaptation of AM dependency parsing to new graphbanks. To sample trees, we compute for each sentence in the training set the best tree from its tree automaton, according to the neural weights of the best performing epoch. This ensures the AM trees evaluate to the correct graph. We then sample trees from this set for hand-analysis.
To get relevant sentences, we sampled 5-to-15word sentences with graph constants from the following six categories: Transitive verbs: graph constants with a labeled root and two arguments with edges labelled as in As explained in the main text, we define the active constants as those with the most common source allocation, and the passive constants as those with the active source allocation flipped. We sampled both active and passive source allocations.
Verbs with one argument: Graph constants just like the transitive ones but lacking one of the arguments. There are four of these, given both source allocations.
Generally these graph constants are used for more than just verbs; for each of the six categories we sampled until we had ten relevant sentences. We visualised the AM trees and categorised the phenomena, for example active or passive verbs, nominalised verbs, imperatives, relative clauses, gerund modifiers, and so forth.
To answer the question of whether the parser used consistent constants for active and passive transitive sentences, we sampled until we had ten sentences with active or passive main verbs. For the single-argument verbs, we also looked at nominalised verbs, modifiers, and so forth. (Sampling and visualisation scripts will be available together with the rest of our code on GitHub.)

B An algorithm to obtain all AM dep-trees for a graph
Let G be a graph partitioned into blobs. Let U G be the set of unrolled graphs for G that can be obtained by Algorithm 2 by varying the queue order. Let further M G be the set of results of Algorithm 3 below for every input AM dep-tree T = C U for U ∈ U G and every choice of set M as specified in the algorithm. Algorithm 3 switches the order of two nodes m and k, making k the head of the subtree previously headed by m. This change of head is only possible when the incoming edge of m is labeled MOD (for APP, the change of head changes the evaluation result). It also requires a MOD edge between m and k; an APP edge with this type of swap would lead to a non-well-typed graph.
Finally, let R G be the set of results of Algorithm 4 for every input AM dep-tree T ∈ M G and any valid choice of R and RT (valid as described in the algorithm). Algorithm 4 is like Algorithm 1 for reentrancy resolution, but can have resolution targets RT (n) that are higher in the tree than the lowest common ancestor of n and the REF-n nodes. Further, Algorithm 4 uses the same methodology to also move nodes that do not need resolution to become descendents of a 'resolution target' higher in the tree (i.e. R here can now also contain nodes for which no REF node exists).
Then the following Theorem 1 holds: Table 3: Common hyperparameters used in all experiments (the random trees, random weights and EM weights baselines use 40 epochs since they converge faster). For a complete description of the neural architecture, see  and its supplementary materials.