SMATCH++: Standardized and Extended Evaluation of Semantic Graphs

The Smatch metric is a popular method for evaluating graph distances, as is necessary, for instance, to assess the performance of semantic graph parsing systems. However, we observe some issues in the metric that jeopardize meaningful evaluation. E.g., opaque pre-processing choices can affect results, and current graph-alignment solvers do not provide us with upper-bounds. Without upper-bounds, however, fair evaluation is not guaranteed. Furthermore, adaptions of Smatch for extended tasks (e.g., fine-grained semantic similarity) are spread out, and lack a unifying framework. For better inspection, we divide the metric into three modules: pre-processing, alignment, and scoring. Examining each module, we specify its goals and diagnose potential issues, for which we discuss and test mitigation strategies. For pre-processing, we show how to fully conform to annotation guidelines that allow structurally deviating but valid graphs. For safer and enhanced alignment, we show the feasibility of optimal alignment in a standard evaluation setup, and develop a lossless graph compression method that shrinks the search space and significantly increases efficiency. For improved scoring, we propose standardized and extended metric calculation of fine-grained sub-graph meaning aspects. Our code is available at https://github.com/flipz357/smatchpp


Introduction
Semantic graphs such as meaning representations (MRs) aim at capturing the meaning of a text.Typically, these graphs are rooted, directed, acyclic, and labeled.Vertices denote semantic entities, and edges represent semantic relations (e.g., instrument, cause, etc.).A prominent MR framework is Abstract Meaning Representation (AMR), proposed by Banarescu et al. (2013), which anchors in a propositional knowledge base (Palmer et al., 2005).
However, SMATCH measurement is non-trivial and lacks specification.For instance, SMATCH involves an NP-hard optimization problem of structural graph alignment, which distinguishes it from most metrics used in other evaluation tasks.In practice, a solution of this problem is found by employing a hill-climber.However, a hill-climber terminates at local optima, and it cannot inform us about a score upper-bound.In the end, this means that we lack information about the quality of the returned solution, potentially lowering our trust in the final evaluation.To mitigate this issue, we would like to study the possibility of optimal solution, or solution with a tight upper-bound.There are also other issues, on which we lack understanding.E.g., we do not know to what extent different pre-processing choices may affect the evaluation results, and we miss specification of SMATCH's popular fine grained sub-graph metrics (Damonte et al., 2017), where it is unclear how sub-graphs should be best extracted and compared.
Paper structure and contributions First, we describe and generalize the SMATCH metric ( §3), and summarize recent SMATCH variants in one framework.Then we break the metric down into three modules ( §4), which ĺets us better distribute our attention over its key components.For each module, we discuss specification of goals and mitigation of issues.In the pre-processing module ( §5), we motivate graph standardization to allow safer matching of equivalent MR graphs with different structural choices.In the optimization module ( §6), we test strategies for solving the alignment problem with optimality guarantees.In the scoring module ( §7), we discuss standardized and extended scoring of fine-grained semantic aspects, such as causality, tense, and location.

Related work
Metric standardization An inspiration for us is the work of Post (2018), who propose the popular SACREBLEU framework for fairer comparison of machine translation systems with a standardized BLEU metric (Papineni et al., 2002).Specifically, SACREBLEU ships BLEU together with a specified tokenizer -prior to this, BLEU differences between systems could depend on different tokenization protocols.Facing the challenging problem of graph evaluation, a main contribution of our work is that we i) analyze weak spots in the current evaluation setup and ii) discuss ways of mitigating these issues, aiming at best evaluation practices.Cai and Lam (2019) introduce a variant of SMATCH (Cai and Knight, 2013) that penalizes dissimilar structures if they are situated in proximity of the graph root, motivated by their assumption that 'core-semantics' are located near the root of MR graphs.Furthermore, Opitz et al. (2020) introduce a SMATCH variant that performs a graded match of semantic concepts (e.g., cat vs. kitten), aiming at extended use-cases beyond parsing evaluation, where MRs of different sentences need to be compared.Similarly, Wein and Schneider (2022) adapt an embedding-based variant of SMATCH for cross-lingual MR comparison.We show that the different SMATCH adaptions can be viewed through the same lens with a generalized notion of triple match.Furthermore, Damonte et al. (2017) propose fine-grained SMATCH that measure MR agreement in different aspects, such as semantic roles, coreference or polarity.We diagnose and mitigate issues in the aspectual assessment, and show how to extend the measured aspects.

MR metrics
Conceptually different MR metrics have been proposed by Anchiêta et al. (2019) and Song and Gildea (2019) who aim at increased efficiency using structure extraction via breadth-first traversals, or Opitz et al. (2021) who compare MRs of different sentences with Wasserstein Weisfeiler-Leman kernels (Weisfeiler and Leman, 1968;Togninalli et al., 2019).Since significant parts of this paper are independent from SMATCH-specific scoring1 , other MR metrics can profit from our work.

SMATCH: Overview and generalization
We introduce SMATCH and define a generalized SMATCH, so that we can summarize recent SMATCH variants in one framework.
Preliminary I: MR graph If not mentioned otherwise, we view an MR graph a as a set of triples, where a triple has one of two types.Unary triples have the structure <x, :rel, c>, where the source x is a variable and the target c is a descriptive label that shows the type or an attribute of x, depending on the edge label :rel.2Using variables such as x we can (co-)refer to different events and entities and capture complex events.Binary triples have the structure <x, :rel, y>, where both the source x and the target y are variables. 3reliminary II: SMATCH The idea of SMATCH is to measure structural similarity of graphs via the amount of triples that are shared by a and b.
To obtain a meaningful score, we must know an alignment map: vars(a) ↔ vars(b) that tells us how to map a variable in the first MR to a variable in the second MR.In this alignment, every variable from a can have at maximum one partner in b (and vice versa).Let an application of a map to a graph a be denoted as a map := {t map ; t ∈ a}, where t map of a triple t = <x, :rel, y> is set to t map = <map(x), :rel, map(y)> for binary triples, and t map = <map(x), :rel, c> for unary triples.
Under any alignment map, we can calculate an overlap score f .In original SMATCH, f is the size of the triple overlap of a and b: f (a, b, map) = |a map ∩ b|.
(1) , Ultimately we are interested in Finding a maximizer map ⋆ lies at the heart of SMATCH, and we will dedicate ourselves to it later in §6.For now, we assume that we have map ⋆ at our disposal.Therefore, we can calculate precision (P ) and recall (R): to obtain a final F1 evaluation score: 2P R/(P + R).With such a score, we can assess the similarity of MRs, and compare and select parsing systems.
Generalizing SMATCH In SMATCH, two triples are said to match if they are identical under a mapping.I.e., we match with match(t, t ′ ) := I[t = t ′ ] that returns 1 if two triples t and t ′ are the same, and zero else (we omit the map for simplicity).Recently, SMATCH has been adapted and tailored to different use-cases.E.g., SMATCH has been extended to incorporate word embeddings (Opitz et al., 2020;Wein and Schneider, 2022) to match <x, :instance, c> triples for studying crosslingual MRs or MRs of different sentences. 4On the other hand, Cai and Lam (2019) propose a rootdistance bias, based on the assumption that 'coresemantics' lie in the proximity of an MR's root.
We find that we can summarize such variants in one framework.We achieve this by introducing a scaled triple matching function: For matching concepts with embeddings, we can use an embedding similarity on the descriptive concept labels with sim(c, c ′ ) and the importance 4 Consider <x, :instance, cat> extracted from one sentence vs. <y, :instance, kitten> extracted from another sentence.A graded match is required to properly assess the similarity of the concepts.weight w t ′ t = 1 ∀t, t ′ .5For Root-distance biased SMATCH as proposed by Cai and Lam (2019) we set w t ′ t such that we discount triple matches that are distant to the root. 6ur generalization does not change or constrain the original SMATCH.Instead, our goal was to define a more general framework of SMATCH-type metrics that unifies recently proposed SMATCH variants and show possibilities for further extension.For the following studies, we set SMATCH++ to basic SMATCH, which is recovered by setting ∀t, t ′ :

A modular view on SMATCH
To set the stage for inspection, we break SMATCH down into three modules.i) Preprocessing, ii) Alignment, and iii) Scoring.In particular, i) Preprocessing discusses any graph reading and processing in advance of the alignment.ii) Alignment revolves around the search mechanism used for finding an optimal mapping map ⋆ .iii) Scoring involves calculating final scores and statistics that are returned to a user.For each module, we will specify its goals, assess potential weak spots and discuss mitigation.
5 Module I: Pre-processing

Module goal and current implementation
MRs are typically stored and distributed in a 'Penman' string format, which can serialize any rooted and directed graph into a string.The goal of this module is to project two serialized textual MRs onto two sets of triples, as outlined in Figure 1.
The target domain of this projection should be a standardized MR graph space, where format divergences that do not impact graph seman- tics are eliminated.Original SMATCH performs pre-processing as follows: i) lower-case strings, ii) de-invert edges (e.g., <x, :relation-of, y> → <y, :relation, x>).However, while these steps seem sensible, more steps can be undertaken to enhance evaluation.

Two structures, one meaning: reification
Some MR guidelines, including the AMR guideline, allow meaning-preserving structural graph translations (Banarescu et al., 2019;Goodman, 2019) with so called reifications (or de-reification as an inverse mechansim).A subset of relations is selected to constitute a semantic relation core set (e.g., :arg0, :arg1, ..., :op1, :op2,...) and for all other remaining relations (e.g., :location, :time), we use rules to map the relation to a subgraph, where the rule-triggering relation label is projected onto a node, and the former source and target of the relation are attached with outgoing core relations.E.g., consider Figure 2, where a reification is applied to a <x, :location, y> relation.In this case, the rule is: • location (de)reififcation: <x, :location, y> ⇐⇒ <z, :instance, beLocatedAt> ∧ <z, :arg1, x> ∧ <z, :arg2, y>, where :arg1 indicates the thing that is found at a location :arg2.
The question whether an annotator should use either means of representation, is answered in the guidelines as follows: whenever they feel like it (Banarescu et al., 2019).Therefore, a parser should not be penalized or rewarded for projecting reified (or non-reified) structures.
Empirical assessment of effect To understand the effect that reification can have on the final SMATCH score, it is interesting to study an edgecase: evaluating graphs that are fully reified against graphs that are fully de-reified.As a data set we take LDC2017T10, a standard AMR benchmark.Additionally, we gather automatic parses by applying an AMR parser (Xu et al., 2020).The results of this experiment are shown in Table 1.In the first three lines (i-iii) we compare equivalent translated versions of the test partition (gold vs. gold).We find that two equivalent gold standards can be judged to be very different (73.9 points, -26.1 points).A similar phenomenon can be observed when looking at the parses.The best parser score is achieved when comparing parses and references in the domain of reified graphs (82.8 points).On the other hand, if only the reference is reified, the parser score drops by 20 points (viii).
However, we also see that the results of a basic evaluation (vii) is practically the same as the result when evaluating with de-reified graphs (vi), indicating that both parser and gold annotation abstain from reification, where possible.
Discussion Having established that rule-based graph translations can enhance evaluation fairness, we pose the question: should we prefer reification or de-reification for space standardization?
The answer should be reification, since it can be seen as a form of generalization.More precisely, we note that reification of non-core relations is always possible.In fact, an interesting effect of reified structures is that they equip us with the means to attach further structure, or features, to semantic relations.On the other hand, however, de-reification is not always possible.It is only welldefined if there is no incoming edge into the node that corresponds to the non-core relation 7 , and if there are not more than two outgoing edges 8 .
However, there are also (practical) arguments against reification.Consider that de/non-reified MRs are smaller and have more edge label differentiation.This i) may facilitate more intuitive display for humans and ii) shrinks the alignment search space.Indeed, a large solution space may have ramifications for evaluation optimality and efficiency (in §6, we empirically study this issue).Therefore, when taking into account that the empirical effect size appears neglectable in the average case, these trade-offs may not always be justified, and we may instead use de-reification, where possible.

Triple removals
Duplicate triples are triples that occur more than once.We find that they are sometimes produced by some parsers.Additionally, some parsers introduce a node more than once, which results in two triples <x, :instance, a> and <x, :instance, b>.Currently, SMATCH removes all such introductions of a second concept, but does not remove duplicate triples.By contrast, we propose to remove all duplicate triples, since they have no clear semantics, and stay agnostic to second introductions of a concept (in some MRs, it may be acceptable that an entity is the instance of two concepts), keeping all such triples (if they are not identical). 9

Module II: Alignment
The goal of this module is solving Eq. 2, finding a map ⋆ for optimal matching score.
SMATCH uses a hill-climber for solving Eq. 2. An issue with this is that such a heuristic terminates at local optima and cannot provide us with any upper-bounds.Upper-bounds, however, can inform users about the quality of the outputted solution and thus increase the trustworthiness of the final score (and any parser comparison that is based thereupon).Therefore, we can conclude that using a hill-climber seems practical but may not be optimal, especially when considering cases where fair comparison needs to be guaranteed.Instead, we would like to use an Integer Linear Program (ILP) to obtain the (optimal) solution.Alternatively, at least, we would like to know a tight upper-bound to inform ourselves about the trustworthiness of 8 I.e., since reification can potentially be used to model n-ary relations, only in the case where n = 2 we can model the structure with a single (labelled) edge 9 Due to rare occurrence of such phenomena in our parsed data, we find the effects of either choice to be negligible.the final score.But ILP is NP hard, and therefore it seems optimal but possibly not practical, a conception that might favor the usage of a hillclimber.
Triggered by these considerations, we review the hill-climber and the ILP and assess their effects on MR evaluation, with two desiderata in mind: evaluation quality and efficiency.Additionally, we propose a strategy for loss-less MR compression that can improve efficiency of any solver.
6.1 Practical but not optimal: hill-climber SMATCH hill-climbing uses two operations, which we denote as switch, and assign.The assignoperation assigns a variable from vars(a) to an unaligned variable from vars(b): (i, ∅) → (i, j = map ′ (i)), where map ′ is a candidate map.The switch operation does an alignment cross-over with respect to two alignment pairs, i.e.: , where map is the current alignment and map ′ the candidate alignment.In each iteration, we examine all possible switch-and assign options, and greedily choose the best one. 10n example alignment procedure is shown in Figure 3.
In practice, we can resort to multiple random restarts, to find better optima.However, this hardly addresses the underlying issue: we lack any information on upper-bounds, which may trustworthiness of results, especially when facing larger graphs with lots of local optima.

ILP: Optimal, but less practical?
We would like to use Integer Linear Programming (ILP) for optimal solution of the graph alignment.
Problem statement Assume two graphs g, g ′ with node sets V, V ′ .Let u(i, j) denote the amount of unary triple matches, given we align i from V to j from V ′ , counting matches of triples that involve one MR variable.On the other hand, b(i, j, k, l) will denote the amount of structural binary triple matches, given we align i from V to j from V ′ and k from V to l from V ′ .Here, we count matching binary triples that involve two MR variables.Usually, these data are pre-computed.Let x indicate our current map, i.e., if x ij = 1 then we align i from V to j from V ′ .We find our solution at max The constraint ensures that every node from one graph is aligned, at maximum, to one node from the other graph.By linearization, and introducing structural variables y, we obtain the equivalent ILP: where the structural variables, if active, show us countable binary triple matches.This is an NP complete problem, imposing limits on its capability to provide us with optimal solutions for larger graphs (note, however, that we can retrieve intermediate solutions and upper-bounds).

Reduced search space with lossless graph compression
We observe that in an MR a, every variable x ∈ vars(a) is related to a concept c, e.g., <x, :instance, cat>.This means that a concept c does identify a variable x ∈ vars(a) iff ∀y ∈ vars(a) : <y, :instance, c> ⇒ y = x.Therefore, if x denotes a cat, and there is no other entity in the MR that also denotes a cat, then x may be referred to simply by cat.This carries over to pairs of MRs: which are the focus of the paper -instead of considering vars(a), we simply consider vars(a)∪vars(b).Therefore, we can replace all n variables from vars(a)∪vars(b) that are identified by concepts, with the corresponding concepts (see Appendix A.1 for a full example).This shrinks the search space by reducing the amount of variables that the optimizer has to consider.Note that such a compression is lossless, in the sense that the possibility of full reconstruction of the original MR is ensured.This implies that if two compressed MRs are assessed as (non-)isomorphic, then the uncompressed MRs are also (non-)isomorphic.

Solver experiments
Two questions are of main interest: 1. RQ1, solution quality: (How) do the final SMATCH results depend on the solver?2. RQ2, solution efficiency: How does the evaluation time depend on the solver?
In addition, we would like to assess how our answers to RQ1 and RQ2 might be affected by reification (resulting in a bigger search space) and MR compression (resulting in a smaller search space).
Setup We simulate a standard AMR parsing evaluation setting.We parse the LDC2017T10 testing data with six parsers: P 1 (Xu et al., 2020), P 2 (Cai and Lam, 2020), P 3 (Lindemann et al., 2020), P 4 (Zhang et al., 2019), P 5 (Lyu and Titov, 2018), P 6 (Cai and Lam, 2019).We evaluate the parsers using ILP or hill-climber (denoted by ).As is standard, we show F1 micro corpus scores.For reference, we also run evaluation with the standard SMATCH hillclimbing script (denoted as previous).We observe that we successfully reproduce the scores from the standard SMATCH script with our implementation (first two lines of -N indicates hill-climber optimizer with N restarts.quality: solution quality of solver -first number is the amount of matching triples summed over all six parser evaluations (yield); second number indicates the tightest found upper-bound (which is only known by ILP).

RQ1: solution quality
Insight: Better alignment → safer evaluation Importantly, we see that the ILP yields score increments for all parsers, which signals the occurrence of alignment problems, where the (despite multiple restarts) did not find the optimal solution.The effect-size is larger for reified graphs.We find differences of up to 1 point F1 score (Table 2: reify 288,597 matches (99.04%) and misses the temporary ILP upper-bound by 2,786.The growing gap underlines the degrading quality of the hill-climber when facing larger graphs.
Finally, the (slight) differences in increments among parsers when we evaluate them on reified graphs indicate that different parsers do make different decisions on when to reify an edge.For instance the score difference ∆ for reified graphs vs. non-reified graphs (using ILP) of P 5 , P 6 is 2.5 points, for P 1 2 points and for P 2 1.7 points.This supports our theoretical insights from §5.2 -reification can make parser comparison fairer.

RQ2: Solution efficiency
Insight I: ILP isn't that impractical It seems to be commonly presumed that original SMATCH uses a hill-climber to make evaluation more practical and fast.However, our results qualify this presumption.For evaluating a full corpus (1371 graph pairs), SMATCH with ILP needs only about 48 seconds longer than original SMATCH with hillclimber (50s vs 98s).When the search space grows (due to reification) the time gap widens to a difference of 165 seconds.However, the consistent improvement of scores due to ILP (signaling suboptimal hill-climber solutions) can make the time increase acceptable for evaluations where fairness is critical.
Insight II: MR compression increases evaluation speed Viewing the last four rows of Table 2, we see that the MR compression i) did not lead to switched system ranks and ii) increased the evaluation speed by a large factor.Using MR compression, the ILP runs a full system evaluation in 11.7 seconds for standard graphs and 27.3 seconds for the reified graphs.Given that the MR compression is lossless (c.f.§6.2), it provides us with an option for more efficient evaluation that is also safe (i.e., optimal).
7 Module III: scoring 7.1 Main scores: Precision, Recall and F1 The goal of this module is to provide the user with a final result.As discussed in §3, the main scores (Precision, Recall, and F1) follow directly from the map ⋆ .The final score is typically micro averaged, summing matching statistics across all graph pairs before they are normalized.SMATCH++ makes two additions, macro-scoring and confidence intervals.Macro-averaging scores over graph pairs can be a useful complementary signal, specifically when comparing high-performance parsers (Opitz and Frank, 2022a).Additionally, we adopt the bootstrap assumption (Efron, 1992) for calculating confidence intervals.To make calculation feasible, bootstrapping is performed after the alignment stage.Table 3 shows results of the additional statistics.Confidence intervals range between +-[0.5, 1] points for all parsers.Macro score shows an outlier, where P 6 (+2.6 points) is more positively affected than other parsers (+[1.0,1.9] points). 13

Measuring aspectual semantic similarity
We observe considerable interest in applying finegrained aspectual MR metrics (Damonte et al., 2017) for inspecting linguistic aspects captured by MRs (e.g., semantic roles, negation, etc.).Applications range from parser diagnostics (Lyu and Titov, 2018;Xu et al., 2020;Bevilacqua et al., 2021;Martínez Lorenzo et al., 2022), to NLG system diagnostics and sentence similarity (Opitz andFrank, 2021, 2022b).Formally, given an aspect of interest asp and an MR g, we apply a subgraph-extraction 13 We find a potential explanation in a motivation of P6's creators to focus on semantics in proximity of an MR's top node (the proportion of such semantics increases when the graph is smaller, and smaller graphs have more influence on macro average than on micro average).function sg(g, asp) to build an aspect-focused subgraph, and compute a matching score (e.g., F1).

Review of previous implementation
We study the description in Damonte et al. (2017) and the most frequently used implementation (Lyu, 2018).
The treated aspects14 are divided in two broad groups: i) alignment-based matching: For some aspects, we extract aspect-related genuine subgraphs, on which we calculate an optimal alignment.ii) bag-of-label matching: for other aspects, we detect aspect-related variables and gather associated node labels15 in a bag/list, to compute an overlap score based on simple set intersection.E.g., SRL-aspect belongs to the first category (i): we extract <x, :arg n , y> relations, and their corresponding instance triples (here: <x, :instance, c>, and <y, :instance, c'>).Then we calculate SMATCH on such SRLsubgraphs.The Negation, Named Entity (NEs) and Frames aspect is put into the second group (ii).We look for a relation/node-label that signals a particular aspect, e.g., <x, :polarity, -> (for negation) or <x, :name, y> (for NEs), we extract x, and replace x with the descriptive label c from <x, :instance, c>.For Frames, we search for <x, :instance, c> where c is a PropBank predicate, and collect c.Finally, we can evaluate without an alignment, using set intersection.
Open questions We pose two questions: 1. Can the sub-graph extraction be improved?2. Are there other aspects that we can measure?

Improving sub-graph extraction
Sensible range of extraction For some phenomena, the current extraction range is clearly too limited.For instance, let us consider named entities, which can be captured in more complex and nested MR structure.E.g., in AMR, one node typically indicates the type of the named entity (NE), and another multi-node structure represents name and other attributes.Consider two AMRs a and b, from which we want to extract NE structures to measure the agreement of the graphs w.r.t.NE similarity.As shown in Figure 4, assume that one graph is about a cat named Bob 16 , while the other graph is about a cat named Lisa 17 .Obviously, the MRs have similarities in their NE structure (since there are named cats), but also differences (since the cats have different names).However, NE-focused SMATCH only extracts cat and cat, and returns maximum score.
Hence, for all finer-grained aspects that are captured by non-atomic MR structures (e.g., Named Entities), we propose to gather the full sub-graph starting at the aspect-indicating relation or node label.In the NE example, as shown in Figure 4, we would be provided a score of 0.5, better reflecting the similarity of the two NE structures.
Sub-graph compression, align and match We find a middle-ground in the advantages of the coarse matching (concreteness, efficiency) and graph alignment (safe matching) by using alignment with lossless MR compression.This is optimal and efficient, and alleviates the need to switch among fine and coarse extraction methods.

Extending fine-grained scores
Beyond negation and named entities -other semantic aspects We find that the fine-grained SMATCH metrics by Damonte et al. (2017) miss some interesting features captured by MRs.For instance, four interesting AMR aspects that are currently not captured are cause, location, quantification, and tense.SMATCH++ allows their integration in a straightforward way.An example for tense extraction is displayed in Figure 5, where our SMATCH++ sub-graph extraction extracts the complete temporal sub-graph, triggered by the edge label :time (if we would resort to the style of fine-grained SMATCH, we would miss larger parts of the temporal structure, only extracting the node label end).
17 Triples: <x, :instance, cat>, <y, :instance, name>, <x, :name, y>, <y, :op1, "lisa">.  in Table 4. Interestingly, we see that projecting causality seems hard: all parsers tend to struggle when assessing causal structures (31.2 up to 47.8 F1 points), showing much room for improvement.The temporal structures, on the other hand, can be assessed with somewhat higher accuracy (48.4 up to 67.7 points).We also see some switched ranks, indicating different parser strengths.Overall, parser score differences seem notably more pronounced than when calculating SMATCH (++) on the full graphs, showing the difficulty of capturing finer phenomena, and highlighting strengths of more recent parsers.

Conclusion
SMATCH++ is the first specification of a standardized, extended, and extensible SMATCH metric.We aim at i) standardized and transparent comparison of graph parsing systems, and ii) improved extensibility for custom applications. 18The applications can include finer parser diagnostics and measuring semantic sub-graph similarities such as quantification, cause, or tense with our fine-grained metrics.

Limitations
We have to leave some questions open.First, we would have liked to shed more light on the solvers' behaviors when facing large graphs, in isolation.On one hand, our benchmark corpus indeed contains some large MRs with many variables, including reified MRs and MRs that represent multiple sentences (up to 174 variables, cf.Table 2).We have that ILP could cope with these harder problems, providing optimal solutions in reasonable time.When facing bigger graphs, however, we can expect that the solution quality of the hillclimber quickly degrades, while the ILP will struggle to find optimal solutions.While our graph compression strategy can help mitigate this issue by reducing the alignment search space, it would be interesting to study the quality of temporary solutions, or of solutions of LP relaxation.There are also relaxed ILP solvers (Klau, 2009) that iteratively tighten the lower and the upper-bound.They could prove useful for aligning larger MR graphs, or, at least, to find useful upper-bounds.
Second, in this paper we studied SMATCH (++) that measures structural overlap and assigns each triple the same weight.But structural differences of similar degree can have a different impact on overall meaning similarity as perceived by humans, which can have ramifications for measuring sentence similarity (Opitz et al., 2021) and meaningful evaluation of strong AMR parsers (Opitz and Frank, 2022a).Therefore, for a deeper assessment of MR similarity we may have to use conceptually different metrics, or explore SMATCH++-based strategies and (sensibly) weigh triples depending on label importance or compose an overall score by weighting measured sub-aspect similarities.that would result in the same score -this is not captured in this Figure ).The X-axis shows the amount of alignment variables.In different terms, a higher point in this Figure is equivalent to a larger pool of local optima of different quality, and thus we can conjecture a greater likelihood that the optimal solution is not returned by the hill-climber.

A.3 Aspect overview
Previously measured aspects For all aspects we retrieve F1, Precision, and Recall.We change: Add default option for extracting aspect sub-graphs, measure all aspects under alignment.

Aspects we added:
• Cause: Cause is modeled via cause-01.We extract label of :arg1 (what is caused?) and subgraph of :arg2, the cause itself.
• tense: Tense is modeled via <x, :time, y> edge.We extract label of the thing that happens and subgraph of y, the temporal description where it happens.
• location: Similar to above but with :location edge.
• quantifier: Similar to above but with :quant edge.

A.4 Best practice
To provide a balance between efficiency, safety and meaningfulness of scores, default procedure of SMATCH++ is currently set to: 1. Pre-processing: lower-casing, duplicateremoval, de-reify where applicable.
2. Alignment: Solver: ILP.Triple-match: w t t = 1 ∀t, t ′ ; sim(c, c ′ ) := I[c = c ′ ] 3. Scoring: Precision, Recall, F1, Bootstrap confidence intervals An option to increase efficiency without incurring a loss in safety and meaningfulness is achieved by adding graph compression to the pre-processing.It is set as the default for fine semantic aspect scores.Also, to ensure utmost safety, we have to consider applying reification standardization (incurring a significantly longer evaluation time).

Figure 1 :
Figure 1: A serialized MR string is read into a graph.

Figure 3 :
Figure3: Sketch of search space (top) and hill-climber run (bottom).Every hill-climber step constitutes an improved lower bound, but we cannot obtain a tight upper-bound (an accessible trivial upper-bound is the amount of triples in the smaller of two graphs: 3).

Figure 6 :
Figure 6: Assessing solution quality variability.Top: basic graphs, bottom: reified graphs.Diagonal line: linear trend.Horizontal line: arithmetic mean.See text in §A.2 for more description and §6.4.1 for discussion.
1. Measured under alignment (a) SRL: extract <x, arg n , y> triples and corresponding instance triples.(b) Coreference/Re-entrancies: extract <x, rel, y> triples for which there is another triple <z, :rel', y> (meaning y is a re-entrant node) and also extract corresponding instance triples.2. Measured via bag-of-structure extraction and set operations (a) Concepts: collect all node labels.(b) Frames: collect all node labels where the label is a PropBank predicate frame.(c) NonSenseFrames: see above, but with sense label removed (d) NE: Named entities, collect all node labels that have an outgoing :name relation.(e) Negation: collect all node labels that have an outgoing :polarity relation.(f) Wikification: collect all node labels that have an incoming :wiki relation.(g) IgnoreVars: replace all variables in triples with concepts, collect triples.SRL, Named Entities, coreference (re-entrant nodes) Additional aspects measured by us: Cause, Tense, Location, Quantifier.

Table 1 :
Results of meaning-preserving translations.rfyStd: score when we project X and Y into standardized reified space.

Table 2 :
Parser evaluation.time refers to the approximate total time needed to evaluate a single parser (i.e., processing 1371 graph pairs).

Table 3 :
Evaluation with additional macro statistics and confidence intervals.Solver: ILP.