Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

Systems that generate sentences from (abstract) meaning representations (AMRs) are typically evaluated using automatic surface matching metrics that compare the generated texts to the texts that were originally given to human annotators to construct AMR meaning representations. However, besides well-known issues from which such metrics suffer (Callison-Burch et al., 2006; Novikova et al., 2017), we show that an additional problem arises when applied for AMR-to-text evaluation because mapping from the more abstract domain of AMR to the more concrete domain of sentences allows for manifold sentence realizations. In this work we aim to alleviate these issues and propose $\mathcal{M}\mathcal{F}_\beta$, an automatic metric that builds on two pillars. The first pillar is the principle of meaning preservation $\mathcal{M}$: it measures to what extent the original AMR graph can be reconstructed from the generated sentence. We implement this principle by i) automatically constructing an AMR from the generated sentence using state-of-the-art AMR parsers and ii) apply fine-grained principled AMR metrics to measure the distance between the original and the reconstructed AMR. The second pillar builds on a principle of (grammatical) form $\mathcal{F}$, which measures the linguistic quality of the generated sentences, which we implement using SOTA language models. We show - theoretically and experimentally - that fulfillment of both principles offers several benefits for evaluation of AMR-to-text systems, including the explainability of scores.


Introduction
Abstract Meaning Representation (short: AMR) (Banarescu et al., 2013) aims at capturing the meaning of a sentence in a machine-readable graph format. For instance, the AMR in Figure 1 represents a sentence such as "Perhaps, the parrot is telling herself a story?". Among other phenomena, AMR captures predicate senses, semantic roles, coreference and utterance type. In the example, tell-01 links to a PropBank (Palmer et al., 2005) predicate sense, and arg n labels indicate participant roles: parrot is both speaker (arg0) and hearer (arg2), story is the utterance (arg1). AMR-to-text generation is a task that has garnered lots of attention over the recent years (Song et al., 2017(Song et al., , 2018Konstas et al., 2017;Cai and Lam, 2020b;Ribeiro et al., 2019). The output of AMR-to-text systems is typically evaluated against the sentence from which the AMRs was created using standard surface string matching metrics such as BLEU (Papineni et al., 2002) or CHRF(++) (Stanojević et al., 2015;Popović, 2015Popović, , 2016Popov, 2017), as employed in general NLG tasks. However, such metrics are suffer from several issues, for example, they are highly sensitive to the translations used for assessment which may easily lead to falsely confident conclusions about a metrics efficacy (Callison-Burch et al., 2006;Mathur et al., 2020).
Moreover, we find that this sensitivity to reference sentences is aggravated when evaluating AMR-to-text. The root cause lies in the fact that there are manifold ways to realize a sentence from a meaning representation. For example, in Figure  2, we see four candidate sentences (i-iv) generated from an AMR (left). In this case, one system generates i: Maybe the cat is playing. while another sys-possible-01 play-01 cat Generated candidates: i: Maybe the cat is playing.
ii: It is possible that a cat is playing.
iii: Perhaps, the cat plays the flute.
Iv: Mayybe the cat are playing.

Original sentence:
Perhaps, the cat plays. tem generates iii: Perhaps, the cat plays the flute.. Clearly, i better captures the meaning contained in the gold graph (left side) compared to iii, which contains 'hallucinated' content -a severe issue in neural generation models that is hard to detect (Koehn and Knowles, 2017;Dušek et al., 2019;Nie et al., 2019;Logan et al., 2019;Wang and Sennrich, 2020). Now, when we use a Canonical surface matching metric (here: BLEU) 1 , we evaluate the Original sentence against the Generated sentence. Yet, when comparing i and iii against the original sentence, the system that produces the hallucinating sentence (iii) is greatly rewarded (∆: +36 BLEU points to the disadvantage of the systems that produce the meaning preserving sentences (i) (only 18 BLEU points) and (ii) (only 5 BLEU points).
In conclusion, we want to aim at a (better) metric that measures meaning preservation of the generated output towards the MR given as input; we do this by (re-)constructing an AMR from the generated sentence and comparing it to the input AMR. In Figure 2, Reconstruction is the result of parsing iii. We see that the reconstructed AMR is flawed, in the sense that it deviates from the original meaning representation. Specifically, iii misrepresents the sense of play (01 vs. 11) and hallucinates a semantic role (arg1) with filler flute. By contrast, when converting sentences (i, ii, or iv) to AMRs, we obtain flawless reconstructions. We will measure their preservation of Meaning using well-defined graph matching metrics.
However, Figure 2 also illustrates that assessing meaning preservation will not be sufficient to rate the quality of the generated sentence: sentence (iv) captures the meaning of the AMR perfectly -but its form is flawed: it contains a typo and wrong verb inflection, a common issue (especially) in lowresource text generation settings (Brussel et al., 2018;Koponen et al., 2019;Matusov, 2019). In order to rate both meaning and form of a generated sentence, we combine the score for meaning reconstruction with a score alled Form that allows us to judge the sentence's grammaticality and fluency. By this move we aim at an explainable and more suitable ranking with a combined MF score (last colum: 1st/2nd: i; 3rd: iv; 4th: iii).
Generally, our contributions are as follows: • We propose two linguistically motivated principles that aim at a sound evaluation of AMRto-text systems: (i) the principle of meaning preservation and (ii) the principle of (grammatical) form.
• From these complementary principles we derive and implement a (novel) MF β score for AMR-to-text generation which composes its score based on individual measurements of meaning and form aspects. MF β allows users to modulate these two views on generation quality with respect to their impact on the final metric score.
• We conduct two major pilot studies involving a range of competitive AMR-to-text generation systems and human annotations. In the first study, we investigate the potential practical benefits of MF β when assessing systems, such as its prospects to offer interpretability of metric scores and finer-grained system analyses. In the second study, we assess potential weak spots of MF β , for example, its dependence on a strong AMR parser.
We will release all code and data.

Related work
Many NLP tasks involve generation of text, e.g., from text to text as in machine translation or document summarization, or generation of sentences from structured content, such as data-to-text generation in its most general form (tables or graphs), or generation from meaning representations such as AMR or DRT. Traditionally, the performance of such systems has been evaluated with word ngram matching metrics such as the popular BLEU metric in MT (Papineni et al., 2002) or Rouge (Lin, 2004) in document summarization. Alternatively, researchers use character n-gram matching metrics such as crf (Stanojević et al., 2015;Popović, 2015Popović, , 2016Popov, 2017). Yet, such metrics suffer from several well-known issues (Callison-Burch et al., 2006;Novikova et al., 2017;Mathur et al., 2020), for instance, they depend on symbolic matching, greatly penalizing equivalent generations that differ from the gold reference in surface form. The issues may become aggravated in settings where one maps from more abstract input to more concrete output, including, but not limited to AMR-to-text, Table-to-text (Liu et al., 2017;Parikh et al., 2020) or knowledge to text (Koncel-Kedziorski et al., 2019). Recently, unsupervised (Zhang* et al., 2020) or learned metrics (Sellam et al., 2020) based on contextual language models have been proposed. For example, the BERTSCORE (Zhang* et al., 2020) metric uses BERT (Devlin et al., 2019) to encode the candidate and the reference sentence and computes the score based on an a cross-sentence wordsimilarity alignment. This metric is computationally more expensive but tends to show higher agreement with human raters. Yet, all of these metrics have in common that they lack explainability and interpretability and are not well applicable for encoding an AMR graph. First practical attempts of assessing sentence quality with semantic assessment have been examined in MT using semantic role labeling (Lo, 2017) or WSD and NLI (Carpuat, 2013;Poliak et al., 2018), in-between lies SPICE that evaluates caption generation via inferred semantic propositions (Anderson et al., 2016).

Fusing meaning and form into MF β
Comparing sentences with surface matching metrics suffers from several well-known issues (see Section 1 and Section 2). Now, we will focus on another critical aspect of such metrics that is specific to tasks that map abstract input to natural language output (as in AMR-to-text). Equipped with this background, we start building our MF β score which targets the alleviation of these issues.
An issue of the typical evaluation setup that is specific to generating text from more abstract input Let us first denote the process of creating AMRs from sentences as parse ≡ abstractif y ≡ f and the process of generating sentences from such abstract representation as generation ≡  concretize ≡ f −1 . When evaluating AMR-totext generation approaches, researchers typically assess how well the generated sentence s matches the sentence s from whence the original AMR was created: s = generate(parse(s)) is matched against s. However, there is a problem with this approach, because we map from an abstract input to a more concrete output with a 'one-to-many mapping'. This means that means that there are possibly many different valid sentences. So, in order to really assess whether two produced sentences are both valid outputs from the same AMR structure, we need to perform this assessment in the AMR domain. This can be achieved by applying an inverse system that generates AMR from text (a parser). Put differently, consider that a system f has generated s from an AMR p = f (s), then we would like that a metric : D × D → [0, 1] satisfies the following equivalence: s ≡ s ⇐⇒ f −1 (s) = p ⇐⇒ metric(s, s ) = 1. Two outputs are equivalent if they lead to the same abstract meaning construction. This also means that we can consider the actual source sentence as distant, i.e., we may never use it directly. This is exemplified in Figure 3, where, in an analogue to AMR-to-text, we see a (surjective) function that generates concrete objects from abstract objects (e.g., mammal → {dog, mouse, cow}). Now, consider that we are given mammal and are tasked with generating a single concrete instance. How can we assess whether our output is correct? We observe that this cannot safely happen by testing whether the output (e.g., cow) equals another instance of mammal (e.g., dog). However, we can use f −1 as a right-inverse function, re-applying the abstraction f to convert the concrete instance back to an abstract object.
3.1 From principles to MF β To alleviate the issue described above, we first introduce our Principle of meaning. Generated sentences should allow loss-less AMR reconstruction.
This principle expresses a key expectation that we formulate for a system that generates NL sentences from abstract meaning representations. Namely, the generated sentence should reflect the meaning of the AMR.
However, this principle alone is not sufficient: we also expect the system to generate grammatically well-formed and fluent text. For example, the following system output: Possibly, it(self) tells parrot a story. contains relevant content expressed in the AMR of Figure 1, but it is neither grammatically wellformed, nor a natural and fluent sentence. This leads us to our Principle of form. Generated sentences should be syntactically well-formed, natural and fluent.
In the style of the well-established F β score, we fuse these two principles into the MF β score: (1) Here, β allows the user to gauge the evaluation towards F orm or M eaning, accounting for their specific application scenario. We anticipate that most users will prefer the harmonic mean (β = 1), or giving M eaning a higher emphasis compared to F orm (e.g., by setting β = 0.5). However, in our experiments we will also consider extreme decompositions into M eaning-only (β → 0) or F orm-only (β → ∞).

Parameterizing meaning
We propose to measure M eaning with a score range in [0, 1] by (i) reconstructing the AMR with a state-of-the-art parser and computing the relative graph overlap of the reconstruction and the source AMR using graph matching. We call this a RES-MATCH. I.e., given a generated sentence s and source AMR p, we match parse(s ) against p and compute M eaning = amrM etric(parse(s ), p). This means that we have to decide upon parse and amrM etric. We propose two potential settings.
AMR reconstruction To reconstruct the AMR using parse, we will be using the parser by Cai and Lam (2020a), henceforth denoted as GSII, as it constitutes the latest state-of-the-art AMR parser. Based on IAA estimates by Banarescu et al. (2013), this parser (80.3 Smatch F1 2 ) is almost on-par with human agreement (estimates range between 0.71 and 0.83 Smatch F1).

AMR metric for reconstruction assessment
To gain a single M eaning score we propose to use S 2 match (Opitz, 2020) that is based on the canonical AMR evaluation metric Smatch (Cai and Knight, 2013). It is essentially the same as Smatch except that it uses a graded match for concept nodes. This offers the potential to compensate for some unwanted noise in automatically generated text 3 or lexical deviations from the original sentence.
Discussion All in all, MF β leaves researchers a lot of flexibility as to which parser or amrM etric they prefer. For our parser, we aimed at the possibly best one that achieves high IAA with humans. However, while this property makes it most suitable at first glance, we would also like to know whether the parser is vulnerable to specific peculiarities of generated sentences. Moreover, we would like to have knowledge about the impact on the performance assessment of MF β when we use another parsing system. Therefore, we will investigate these issues more closely in Section 5.1. With regard to the amrM etric, there certainly exist usecases for other metrics, or custom metrics. For example, Anchiêta et al. (2019); Song and Gildea (2019) propose metrics that aim at faster evaluation by ablating the costly variable-alignment. This may prove valuable when one wants to apply MF β on large corpora. 4

Parameterizing form with LMs
Assessing the (related) aspects of sentence grammaticality and fluency is not an easy task (Heilman et al., 2014;Dickinson and Ragheb, 2015;Katinskaia and Ivanova, 2019). Recently, Lau et al. (2020) show that probability estimates based on language models can be used as an indicator for measuring complex notions of form, measuring acceptability in context with LMs. Here, we want to measure F orm with respect to grammaticality and fluency. Therefore, we investigate the performance of state-of-the-art LMs for predicting these two aspects as rated by humans. Since gradedness of the acceptability of F orm is difficult to interpret, and we aim at producing a ratio score for F orm that we can feed into MF β , we use a binary variable that reflects a threshold up to which a sentence is considered to be of acceptable form, or not. The F orm performance of the system then can be well interpreted as the ratio of sentences that it produced, which are judged to be of acceptable form. 5 Binary form assessment given a specific candidate generation s , we use a binary variable to assess whether s is of satisfactory form. For this, we first calculate the mean token probability: where ctx j is different for uni-directional LMs (ctx j = tok 1...j−1 ) and bi-directional LMs (ctx j = tok 1...j−1,j+1...n ). We compute this score both for the generated sentence mtp(s ) and for the source sentence as reference mtp(s), calculating a score of preference pref Score = mtp(s ) mtp(s )+mtp(s) . The decision on whether the generated sentence s is acceptable is then calculated as where tol is a tolerance parameter. Less formally, a sentence is considered to have an acceptable surface form in relation to its reference if its form is estimated as being at least as good as the reference minus a tolerance, which we fix at 0.05. Finally, the corpus-level score for F orm reflects the ratio of sentences a system has produced, that are of acceptable form. This is inspired by Lau et al. (2020) except that the creation of the binary variable enables us to have obtain a corpus-level score for F orm that is interpretable by expressing a ratio in the range of [0,1], which is necessary to ensure sound MF β calculation. 6 Form predictor selection Similar to Lau et al.  2017), which contains human fluency and grammaticality judgements for machinegenerated sentences. Based on the results, we select GPT-2 as our basis for F orm assessment, since we find that it exhibits a good F1 score in binary prediction of fluency and grammaticality, and it shows slightly better performance compared with the other LMs. More details on this experiment can be found in Appendix 7.1.

Goals of our pilot studies
Our proposed MF β metric for AMR-to-Text generation is aimed at offering a more balanced and justified assessment of generated sentences according to M eaning and F orm than currently offered by standard surface-matching metrics. However, as detailed in §3.2 and §3.3, they depend on a number of hyper-parameters, such as the parser applied for M eaning reconstruction or the used LMs for the assessment of F orm.
To provide more insight into the properties of different modulations of the proposed decompositional MF β metric and its possible dependence on the introduced parameters, we will conduct a series of pilot studies to better assess the potential benefits and weaknesses of MF β when used to evaluate and rank AMR-to-text systems.
Specifically, we want to investigate i) to what extent MF β aligns with other metrics in system scoring; ii) whether MF β has the potential to explain its scores better than other metrics; iii) whether possible divergences in the assessment of system outputs are justified and in line with our principles for assessing M eaning and F orm.
Since any dependence on parameters that are subject to changes over time (such as LM capacity or AMR parsing performance) may be not desirable, an important task is to assess the effects of these factors on metric scores and system rankings. To investigate these questions, we conduct two pilot studies.
In the first pilot study, we want to assess the relation of MF β to the conventionally applied string with accept > 0.5 equals 1.0. However, when such an assessment of a single sentence would be required, we may fall back on the pref Score (+/-tol) as a realistic assessment of form. matching metrics when ranking state-of-the-art systems, and its potential advantages. For instance, we are interested whether MF β can justify potential differences in rankings and if it succeeds in disentangling F orm and M eaning.
In the second pilot study, we investigate a potential Achilles' heel of MF β , namely its dependence on a parser and a LM. Therefore, we (i) investigate the effects of using another parser and (ii) we assess a potential remedy for this problem by using parse quality control. Finally, (iii), we validate the binary predictions by F orm in a small annotation study conducted by a native speaker. To put the results of MF β into perspective, we display the scores of several metrics that align with the sentence-matching setup that was previously used for evaluation of AMR-to-text. Along with BLEU, we display Meteor and chrf++ scores, since these three metrics are the most commonly used ones. Additionally, we calculate the recently proposed BERTSCORE (Zhang* et al., 2020) based on RoBERTa-large . The results are displayed in Table 1, col. 3-6. MF β scores (col. 7-12) are divided into the core M eaning (RESMATCH using GSII) and F orm scores, and the combined MF β scores with β = 1 (harmonic) vs. β = 0.5, giving higher weight to M eaning.

RESMATCH upper-bound approximation
As an upper-bound approximation for RESMATCH we propose parsing a gold sentence s and comparing the result m s against the gold AMR m gold : apprU B = metric(parse(s),m gold ). Essentially, this is the same score as used in canonical parser evaluation. This means that we would not expect the reconstruction parse m of s to score higher than had we applied parse to the original sentence: metric(parse(s ), m ) ≤ metric(parse(s), m gold ) = apprU B, where s , s the generated and original sentence, parse(s ) the reconstructed AMR m , m gold the original AMR. 7 4.2 Enhanced interpretability of system rankings with MF β Surface matching metrics are not very discriminative and lack interpretability Table 1 shows that the baseline metrics tend to agree with each other on the ranking of systems, but there also exist differences, for example, BERTSCORE and Meteor select M'20 as the best performing system while BLEU and chrF++ select W'20. While certain differences may be due to individual properties of metrics as such, e.g., Meteor allowing inexact word matching of synonyms, in general, the underlying factors are difficult to assess, since the score differences between the systems with switched ranks are rather small, and none of these metrics can hardly provide us with meaningful interpretation for their score that would extend beyond shallow surface statistics. Therefore, these metrics cannot give us much intuition about why and when one system may be preferable over the other.
MF β yields more discriminative rankings We assess the MF β score with harmonic mean (β = 1; Table 1, col. 11) and emphasis on M eaning (β = 0.5; Table 1, col. 12). We see that, while the overall rankings stay similar, the ∆s between system scores tend to grow. E.g., BERTSCORE assigns only 1.3 and BLEU 6.0 points difference between their selected best and worst systems, while MF β=0.5 assigns 8.1 points and MF β=0.5 more than 15 points in difference.
MF 1 and MF 0.5 align well with BERTSCORE Wb'20 27.3 (7) --92.6 (6) 79.6 65.0 71.5 (7) 49.5 (6) 58.5 (6) 65.7 (7) Cai and Lam (2020b   (96.7 Pearson's ρ with β = 0.5 and 93.2 Pearson's ρ with β = 1). Interestingly, it appears that this is mostly due to F orm, which exhibits, in contrast to RESMATCH, a very good agreement with BERTSCORE (F orm: 90 Pearson's ρ, RES-MATCH: 63.4 Pearson's ρ). However, F orm differs from the other metrics in the aspect that it assigns greater ∆s among some systems, which indicates that some systems are capable to produce sentences of significantly improved form. At this point, it is also important to recall that F orm, in contrast to the other metrics, does not match two inputs, instead it bases its decisions solely on the generated sentences without matching their tokens against a reference. Thus, the high agreement with BERTSCORE could support the view that BERTSCORE may be more formorientated than perhaps one would assume (Mehri and Eskenazi, 2020). However, that does not mean that BERTSCORE ignores the meaning, a conclusion that is supported by an even better correlations to MF β , i.e., when we factor in some M eaning into our MF β score.
(Decomposing) MF β can provide explanations for system strengths Before, we have seen that BERTSCORE incorporates both aspects, M eaning and F orm without separating them. However, because it intermingles these two aspects in a way that is hardly transparent, it cannot provide us with an insight into whether the systems have different strengths with respect to F orm and M eaning. Here, it would be important that F orm and M eaning are disentangled, as much as possible, so that they can provide complementary views on our problem that could explain different system rankings. That our metric indeed captures such complementary views is supported by the correlation statistic in Table 2, where we see that RES-MATCH indeed appears to measure some different properties than the other metrics, since it exhibits the lowest average agreement compared with respect to all other metrics. Therefore, we may conclude that the weak correlation of M eaning and F orm points towards an achievement of a key goal of this work: the disentanglement of F orm and M eaning, and that different systems have a tendency to be better in one aspect than the other (W'20 slightly favors M eaning, achieving first place in this aspect, while M 20 favors F orm, Table 1); in Section 5.2 we will see that the latter (M'20) indeed appears to produce sentences of considerably better form).
Using RESMATCH based on Damonte et al. (2017) leads to interpretable rankings RES-MATCH, when parameterized with fine-grained AMR metrics by Damonte et al. (2017), gives us deeper insight into the performance differences of competitive systems, with respect to specific semantic aspects.
The results are shown in Table 3. For example, when researchers aim at high quality for generation of named entities, they might better rely on the system ranked last in the overall ranking (R'19), which improves upon the best overall system by 3.4 points in NER recall and 1.9 points in F1 NER.
Furthermore, we see that the third best system according to all main metrics (Mb'20), may be less suited for correct negation generation. In this aspect, it lags behind the overall fourth best system  C'20 by -5.8 points negation F1. We provide a full example, where RESMATCH explains a meaning negation error, in Figure 4 in the Appendix 7.2. In sum, the system of W'20 appears to be the clear winner in most aspects of meaning. This is intuitive, since the system has been trained with an auxiliary signal that provides information on how well an AMR can be reconstructed from the generated sentence. However, this systems suffers in F orm performance, ranging much lower compared to the M'20 systems, which is why it is ranked only third place when using MF β (and BERTSCORE), c.f. Table 1. In Section 5.2, we will conduct a native speaker study to assess whether this lack in F orm performance is really as great as indicated by our F orm score. Nevertheless, researchers who want to focus completely on the M eaning may set β = 0 which discounts the form factor completely. Our evaluation shows that these researchers then may want to prefer the W'20 system for generation.
Finally, we see that the fine-grained metrics of Damonte et al. (2017) enhance our M eaning component with the capacity to provide interpretation for system ranks. Additionally, in the Appendix of this paper, we provide very detailed examples of AMR reconstructions that lead to different rankings of single candidate sentences: in one case, RES-MATCH explains SRL confusion (Appendix 7.3), in another aspect confusion (Appendix 7.4).
The gap to the apprU B indicates ample room for improvement of AMR-to-text systems All metrics, including the surface matching metrics, e.g., BLEU or BERTSCORE, have a mathematical upperbound, which is 100 points. However, this upper-bound is not well interpretable since we cannot expect a system to score 100 points and estimation of true upper-bounds is extremely costly. RESMATCH, however, has an interpretable upperbound (approximation): apprU B. It shows re-searchers that there is room for improvement of AMR-to-text generation systems (the gap to the best system according to RESMATCH (W'20), is more than 6 points in F1 and almost 10 points in recall).
Form, being disentangled from the distant source sentence, also shows that for most systems there is much room for the generation of wellformed and fluent sentences.
5 Pilot study II: Assessing vulnerabilities of MF β MF β has two apparent vulnerabilities: first, it depends on a parser for reconstruction. Here, we have used the state-of-the-art parser that is on par with human IAA. However, we cannot exclude the possibility that it introduces unwanted errors in the evaluation scores of MF β . Second, the F orm component is based on a LM and we have seen that it can change system rankings, even when it is discounted (in Table 1, both MF β with β = 0.5 and β = 1.0 slightly disagree with the ranks assigned by M eaning only). On one hand our LM was carefully selected, and other metrics (e.g., BERTSCORE) also heavily depend on LMs. Yet, on the other hand, we cannot exclude the possibility that the changed rankings are unjustified.
In this pilot study, we investigate these weak spots more closely by first assessing the outcome of MF β when using another parser and discussing a mitigation of parser errors using a parse-quality control mechanism. Then we discuss the result of a human annotation study to assess whether the provided rankings by F orm were really justified.

The parser: Achilles' heel of MF β ?
Using another parser In this experiment we assess RESMATCH's robustness against using a different parser. This is an important point, since the metric and rankings could change with the parser and/or users may have reasons to use different parsers for the reconstruction. Here, we would hope, that the difference of using one competitive parser over the other will not be too extreme. To investigate this issue, we use GPLA (Lyu and Titov, 2018), a neural graph-prediction system that jointly predicts latent alignments, concepts and relations. We select GPLA because it constitutes a technically quite distinct approach compared to GSII.
The results are shown in  labeled with GPLA and GSII, without a ♦. We see that RESMATCH GP LA and RESMATCH GSII tend to agree in the majority of the rating (F1: Spearman's ρ = 0.95, Pearson's ρ = 0.96, p<0.001).
When considering MF β with β = 0.5, the vulnerability further decreases (Spearman's ρ = 0.96, Pearson's ρ = 0.99, p<0.001). Thus, we may conclude that RESMATCH exhibits some vulnerability towards using any of these two quite different parsers, but the extent of this vulnerability does not appear critical.
While we see that using GPLA has little effect on the ranks, we see that the nominal scores can differ substantially (e.g., W'20 73.1 F1 using GPLA and 75.3 F1 using GSII). However, we see that the increments are almost uniform. Therefore, we conjecture that there does not exist a system which got unfairly treated by parameterizing our metric with another parser. An unfair treatment could have arisen, e.g., if a parser unjustifiably generates overtly bad AMR reconstructions to specific systems. In such a case, the score increments would not be uniform. Hence, these increments are very likely to stem from the fact that we simply used a better parser, which is more benevolent to all generation systems.
More quality control: parse quality assessment An assessment for the reconstruction quality of single parses would allow researchers to get confidences for the provided scores by MF β or one could conduct the evaluation only on a subset of generations where we are ensured that the quality of the parse reconstruction lies above a certain level. To assess the potential of such a solution, we use a parse quality estimation system (Opitz and Frank, 2019;Opitz, 2020). We then filter all tuples of generated sentences where the estimated quality of the parse lies above 95% F1 score. This leaves us with 169 tuples, on which we run the evaluation.
The results are given in Table 4, in the columns labeled with a ♦. With high-quality parses ensured, the RESMATCH ranking of systems changes slightly (GSII vs. GSII ♦ : Pearson' ρ = 0.92, Spearman's ρ = 80.0), as well as the ranking of MF β (GSII vs. GSII ♦ : Pearson' ρ = 0.95, Spearman's ρ = 0.96). However -even though the evaluation data were changed by the filtering step -the tendency of MF β in discriminating better systems from worse systems stays stable: over all settings, the two groups containing the highest-scored three systems and the lowest-scored four systems do not change.

The F orm component of MF β
In Section 4.2, we have seen that the F orm component of MF β can impact the system rankings.
We also saw that it tends to be in large agreement with BERTSCORE (not in the absolute scores, but in the rankings). However, BERTSCORE is mostly used in MT and therefore we would like to assess if the scores provided by F orm are really justified when evaluating AMR-to-text.
Human annotation To investigate this, we ask a native speaker of English to annotate 50 paired sentences of M'20 and W'20 with respect to their structural well-formedness, considering only grammaticality and fluency. The annotator was explicitly asked to not consider whether a sentence 'makes sense', by presenting the Green ideas sleep furiously example as free from structural error. We give more details on this annotations and provide examples in 7.5. The annotator agreed in 42 of 50 pairs with the preference as predicted by GPT-2, which is a significant result (binomial test p<0.000001). Additionally, we manually examine several produced sentences. We find that the M'20 and Mb'20 generations indeed appear considerably better on the surface level, compared to the generations of all other systems. For instance, the best system on the meaning level, W'20, frequently produces inflection mishaps: Their hopes for entering the heat is already in-sight, while we find little of such violations with M'20 (here: Their hopes for entering the heat are already in sight). We also find adverbial errors to varying degrees, e.g., W'20 writes They are the most indoor training at home ., while M'20 writes They are most trained indoors at home. Arguably both of these sentences are not of perfect form (correct: mostly), but the second sentence is substantially more well-formed. R'19 G'20 Wb'20 C'20 Mb'20 M'20 W'20 GPT-2 51.6 (4) 47.1 (6) 49.5 (5) 51.9 (4) 74.0 (1) 69.8 (2) 55.7 (3) BERT 43.4 (6) 40.6 (7) 50.4 (4) 44.7 (5) 71.4 (1) 71.0 (2) 55.9 (3) Using a different LM The human study indicates that GPT-2 was mostly right when it favors one sentence over the other, with respect to fluency and grammaticality. However, when considering that there is a recent trend to build systems that are based on fine-tuning LMs, we need to assess whether they may be favored (too) much if F orm is parameterized with a same or a highly similar LM compared to the LM these systems use for tuning. We find such a case in M'20: on one hand, they did not fine tune the same GPT-2 which we used for F orm prediction, but they fine-tuned its siblings GPT-2-medium and GPT-large, which may share great structural similarities. Therefore, we also use BERT for our F orm prediction. The results (Table 5) support the unambiguous conclusion from the human annotation: by large margins, both M'20 and Mb'20 deliver generations that are of significantly improved form and both agree on the group of the best three systems. Note that this insight can be provided by MF ∞ , but it cannot be carved out by using the conventional metrics, since they prohibit us from disentangling F orm and M eaning.

Conclusion
We proposed MF β -score, a linguistically motivated metric for evaluation of text generation from (abstract) meaning representation. The metric is built on two pillars: F orm, which measures grammaticality and fluency of the produced sentences and M eaning, which assesses how much meaning of the input AMR is reflected in the produced sentence. We saw that MF β allows for a fine-grained system performance assessment that goes beyond what surface matching metrics can provide. Specifically, the β-parameter allows researchers to decompose the metric in either of the two parts, paving the way for custom gauging and selection of text generation systems. We observed that MF β score could potentially be interpreted as BERTSCORE but offers the possibility to factorize and focus on the meaning aspects disentangled from form properties, and bears the potential for score interpretability via fine-grained semantic system assessment.
Conversely, and in sharp contrast to BERTSCORE, the F orm component of MF β enables an assessment of grammaticiality and fluency that does not rely on a match of the generated sentences against their references, and thus offers an assessment independent of lexical alignment. A critical hyper-parameter of our metric is the dependency on the parser used for meaning reconstruction. To alleviate this issue, we used the latest state-of-the-art parser in our experiments. Additionally, we investigated this dependency by trying out a different parser and controlling for parse-quality. Our studies show that the absolute scores tend to increment when a better parser or only high-quality parses are used, but the ranking of systems stays quite stable. In future work, we want to investigate more ways of reconstruction quality control, e.g., using ensemble parsing. Furthermore, while benchmarking of systems needs deeper exploration, we consider the usage of MF β scores to obtain better diagnostics and explainability of generated texts another interesting use case. Deng Cai and Wai Lam. 2020b To estimate how well they are able to assess F orm, we make use of human-assigned scores for data from the WebNLG task as provided by Gardent et al. (2017). It contains grammaticality and fluency judgments by humans for more than 2000 machine-generated sentences. We report the F1 score, both for grammaticality and fluency, by converting the human assessment scores to accept predictions, and using them as a gold standard to evaluate the LM-based accept predictions over (i) all 12k sentence pairs 8 and (ii) only the 5k sentence pairs where both grammaticality and fluency where either rated as 'perfect' (max. score) or 'poor' (min. score) by the human. 9 The results are displayed in Table 6 and show (i) that the LMs lie very close to each other with respect to their capacity to predict fluency and grammatically, and (ii) that both fluency and grammaticality can be predicted fairly well. Based on this, Table 6: Results for assessing the F orm score prediction (corpus-level) of different LMs for NLG-generated sentences against humans judgements (separated by grammaticality and fluency); all: all 12k generated sentences vs. 'poor/perfect': the 5k instances of best/worst generations in both grammaticality and fluency.
for responsibility . we select GPT2 for assessing F orm, since it provides the best score on average, outperforming the other systems in grammaticality prediction.

RESMATCH explains negation error
In Figure 4, both systems struggle to fully capture the meaning of the original AMR f (s). However, the system based on GPT medium (Mb'20) erroneously assesses that we are not responsible and we fear. However, quite the opposite is true: the gold graph and gold sentence states that there is responsibility and there is no fear. This important facet of meaning is better captured by C'20. The reconstruction shows that it reflects the gold negated concepts much better and does not distort facts that are core to the meaning. In consequence, the negation F1 is zero for the left sentence with the distorted facts and maximum for the sentence that sticks true to the facts.