Translate, then Parse! A Strong Baseline for Cross-Lingual AMR Parsing

In cross-lingual Abstract Meaning Representation (AMR) parsing, researchers develop models that project sentences from various languages onto their AMRs to capture their essential semantic structures: given a sentence in any language, we aim to capture its core semantic content through concepts connected by manifold types of semantic relations. Methods typically leverage large silver training data to learn a single model that is able to project non-English sentences to AMRs. However, we find that a simple baseline tends to be overlooked: translating the sentences to English and projecting their AMR with a monolingual AMR parser (translate+parse,T+P). In this paper, we revisit this simple two-step base-line, and enhance it with a strong NMT system and a strong AMR parser. Our experiments show that T+P outperforms a recent state-of-the-art system across all tested languages: German, Italian, Spanish and Mandarin with +14.6, +12.6, +14.3 and +16.0 Smatch points


Introduction
Abstract Meaning Representation (AMR), introduced by Banarescu et al. (2013), aims at representing the meaning of a sentence in a semantic graph format. Nodes represent entities, events and concepts, while (typed) edges express their relations.
AMR itself, as of now, is English-focused, e.g., predicate frames are linked to English PropBank (Kingsbury and Palmer, 2002). However, the abstract nature of AMR, and the fact that they are not explicitly linked to syntactic structure, make it appealing for extracting semantic structure of sentences in various languages. This insight led to the recent interest in a new task: cross-lingual AMR parsing (Damonte and Cohen, 2018). Here, researchers develop models to project sentences from different languages onto AMR graphs. Mod- * *Equal contribution. els that have recently been proposed are typically trained on large-scale silver data and learn to directly project the non-English sentences onto their AMR graphs (see Figure 1) (Damonte and Cohen, 2018;Blloshmi et al., 2020). However, there is an intuitive baseline that we argue has so-far received too little attention: translate+parse, T+P. It first translates a sentence to a pivot language and applies a mono-lingual parser for that language. In light of the rapid progress of both NMT and AMR parsing models for English, our hypothesis is that this baseline has become more effective and thus more realistic. Moreover, we argue that it could be beneficial to disentangle two key latent representations involved in the process of cross-lingual AMR parsing: i) one that translates between two natural languages and ii) one that translates between a natural language and a meaning representation. This way, the cross-lingual AMR construction process is more transparent and can be better analyzed.
In our work we test these hypotheses by translating the source language sentences into English with a strong NMT system, and parse the resulting English sentences using a strong AMR parser. We show that our baseline delivers strong performance in cross-lingual AMR parsing across all considered languages, outperforming task-focused state-of-the-art models in all settings. We also discuss fairer evaluation of cross-lingual AMR parsing and relevant implications of this work for research into cross-lingual AMR parsing.
We will release all code under public license. 1 2 Related work Cross-lingual AMR parsing Cross-lingual AMR parsing was introduced by Damonte and Cohen (2018).
They trained an alignmentbased AMR parser model that leverages large amounts of parallel silver AMR data obtained through annotation projection from a curated parallel corpus. The authors also discussed translate+parse (T+P) as a baseline using either the NMT systems Google translate and Nematus (Sennrich et al., 2017), or the SMT system Moses (Koehn et al., 2007), together with a mono-lingual transition-based parser (Damonte et al., 2017). However, their best T+P approach was Google Translate (GT) -which cannot be fully replicated by other researchers since both training data and model structure are hidden. Given the recent advances in NMT (Barrault et al., 2019(Barrault et al., , 2020 and mono-lingual AMR parsing (Xu et al., 2020), where parsers now achieve scores on par with human IAA assessments (c.f. Banarescu et al. (2013)), we show that time is ripe to put more spotlight on T+P. Blloshmi et al. (2020) address the problem from complementary perspectives: i) they train a system that projects AMR graphs from parsed English sentences to target sentences via a parallel corpus, yielding gold non-English sentences and silver AMRs. Conversely, ii) they train a system that employs an NMT system to translate English sentences from a human-annotated AMR dataset to another language, yielding pairs of silver non-English sentences and gold AMRs. This alleviates the dependency on external AMR aligners.
(Mono-lingual) AMR parsing Mono-lingual AMR parsing equally made big strides in recent years, so that today AMR parsers deliver benchmark scores that are on-par with measured human IAA. The latest step forward was achieved with neural sequence-to-sequence models pre-trained on large-scale MT benchmark data (Roberts et al., 2020;Xu et al., 2020) or are fine-tuning selfsupervised seq-to-seq language models such as T5 or BART (Lewis et al., 2019;Bevilacqua et al., 2021). Previou models perform parsing based on different techniques, e.g., predicting latent alignments jointly with nodes (Lyu and Titov, 2018), or via an iterative BFS writing traversal Lam, 2019, 2020).
3 Translate, then parse! Our pipeline model contains two components: Sent-to-Sent: NMT system We use Helsinki-NLP's Opus-MT models (Tiedemann and Thottingal, 2020) to translate the sentences to English. The models are freely accessible 2 and provide high scores on public evaluation benchmarks. 3 Sent-to-AMR: AMR parser For parsing English target sentences to AMR, we use the parser from amrlib 4 , which consists of a T5 language model (Roberts et al., 2020) that has been finetuned on English sentences and their AMRs.

Experiments
Data We employ the cross-lingual AMR parsing benchmark LDC2020T07. It was built from the test split of the English mono-lingual LDC2017T10 data by translating its sentences to four languages: German, Spanish, Italian and Mandarin Chinese. This amounts to a total of 5,484 AMR-sentence pairs, or 1,371 AMR-sentence pairs per language.
Evaluation metrics Our main evaluation metric is Smatch F1 . The Smatch metric aligns the predicted graph with the gold graph and computes an F1 score that measures normalized triple overlap. Additionally, we calculate F1 scores for finer-grained core semantic sub-tasks Damonte et al. (2017). 5 In our analyses ( §4.2), we also study results with S2MATCH (Opitz et al., 2020), that offers a potentially fairer evaluation in cross-lingual AMR parsing, since it does not penalize allowed paraphrases that may emerge, e.g., due to the non-monotonous nature of translation (e.g., huckleberry → Heidelbeere (DE) → blueberry).

Main results
Results are displayed in Table 1. Overall, our translate+parse baseline outperforms previous work by large margins. In all assessed semantic categories, T+P outperforms XL-AMR models by more than 10 Smatch points. The smallest improvement obtained is achieved in IT with +12.6 points.
In some key semantic categories, the differences are extreme. For negation detection we obtain performance improvements that range from +26.5 points (IT) to +37.3 (DE). The named entity recognition improves by +20.3 points for German, +20.4 points for Spanish, +17.6 points for Italian.

Studies
Using a graded metric for evaluation When evaluating predicted AMRs against reference AMRs in cross-lingual AMR parsing, we are essentially comparing AMRs from sentences that are not exactly the same. This means that predicted concepts that are valid may get erroneously penalized by the evaluation metric. For instance, consider a German source sentence that contains Heidelbeere, and our cross-lingual AMR system predicts i) huckleberry or ii) blueberry. Depending on which concept is mentioned in the reference AMR graph (based on the unseen sentence from which the human SemBank annotator created this graph), only one of the two options will be viewed as correct, which results in unfair evaluation. To mitigate this, we propose to conduct the cross-lingual AMR evaluation using S2MATCH (Opitz et al., 2020), a metric that admits graded concept similarity. S2MATCH has a hyper-parameter τ that sets a threshold for sufficiently similar concept nodes across AMRs, using cosine-similarity. The alignment of similar concepts can increase the final score. The default τ is 0.5, but we also try 0.0 which is less strict and fosters dense alignment.
The results are displayed in Table 2. Interestingly, most score improvements are obtained for German (+3.9 points) and Mandarin Chinese (+5.2 points). We conjecture that this is because there is slightly less variety in EN-{ES, IT} translations, than for EN-DE, and especially for EN-ZH. This is also visible from the results of our baseline XL-AMR, which we reevaluate using S2MATCH: Most gains are obtained for Mandarin Chinese with an improvement of more than 7 points F1 score. Inspecting test cases manually, we find many cases were S2MATCH made the evaluation fairer. For instance, the following gold-pred (DE: [German word]) concept tuples are ignored by SMATCH but considered by S2MATCH: pledgepromise (DE: 'versprechen'); write-compose (DE: 'verfasst') strong-resolute (DE: 'deutlich'); spiritghost (DE: 'Geist'), etc. In all these cases the crosslingual AMR system predicted the correct concept, but was penalized by SMATCH. A concrete example case, with lexical (see colored nodes) and structural (see dotted nodes) meaning-preserving divergences, is shown in Fig. 2. For future work that applies cross-lingual AMR parsing evaluation, we recommend additional evaluation assessment with S2MATCH.
NMT quality The quality of our automatic translations is evaluated with two metrics: i) BLEU score (Papineni et al., 2002) and ii) S(entence-)BERT (Reimers and Gurevych, 2019), in order to assess surface-oriented as well as semantic similarity. For SBERT, we create sentence embeddings for both our translations and the English reference sentences and compute pair-wise cosine similarity.
Looking at the quality of our MT outputs (Table  4), we see that translation quality is generally quite high. The moderate BLEU scores seem to result more from variation in surface form than from incorrect translations, which is backed by the high cosine similarity scores across languages (and also     highlights the need for a fairer and graded AMR evaluation as proposed above. 6 Finally, comparing the different source languages, there seems to be a higher quality in the translations from German, Spanish, and Italian, compared to Mandarin Chinese. This is not only reflected in the BLEU scores, but also in the SBERT cosine scores, which suggest a higher semantic similarity between our translations from DE, ES, IT and the reference sentences.
Semantic cross-lingual consistency of crosslingual AMR systems A cross-lingual AMR system should be expected to deliver the same or highly similar AMRs for two sentences from dif-6 This is also supported by a manual analysis of samples. ferent languages, if the sentences carry the same meaning. We may say that a system is semantically consistent if it complies to this expectation.
To measure the degree of consistency, we evaluate system outputs of a cross-lingual AMR system for input language X against the outputs of the same system when fed sentences in language Y, from a parallel dataset (X,Y) of sentences in languages X and Y. In the standard evaluation, we computed EVAL(system(X), A) and EVAL(system(Y ), A), where A are target AMRs. In this experiment, we instead calculate EVAL(system(X), system(Y )), assessing the degree of consistency of a system.
The results are provided in Table 3, where we see a very clear picture that holds true both for our joint baseline (XL-AMR) and our T+P approach and all examined semantic categories: the highest consistency is achieved for Spanish-Italian (ES-I, XL-AMR: 63.9 S2MATCH; T+P: 81.8 S2MATCH), while the lowest consistency is achieved for German and Mandarin Chinese (D-Z, XL-AMR: 48.7 S2MATCH; T+P: 66.7 S2MATCH). When directly comparing the parsing systems, overall T+P appears to offer better consistency in all categories, especially negation. However, the substantial variance between languages may indicate that either i) there is a great necessity for making cross-lingual parsers more robust or, ii), that AMR representations, as constructed from English, may be better prepared to represent (besides English) Spanish and Italian language, than, e.g, German or Chinese.

Discussion
We believe that the surprising effectiveness of translate+parse touches upon a key question: to what degree can AMR be considered an interlingua? On one hand, Banarescu et al. (2013) explicitly state that AMR 'is not designed as an interlingua'. Indeed, AMRs created for English sentences do have a flavour of English, since they are partially grounded in English PropBank (Kingsbury and Palmer, 2002). But linking AMRs to a PropBank of another language, e.g., Brazilian (Duran and Aluísio, 2012) or Arabic (Palmer et al., 2008), and parsing non-English sentences into corresponding AMRs, would not solve, but only displace the problem of being tied to a specific language's lexical semantic inventory. 7 On the other hand, AMR does contain abstract meaning components that represent language phenomena we may consider as universals: negation, occurrence of named entities, semantic events and their related participants, as well as semantic relations such as Possession, Purpose or Instrument. 8 We argue that this abstract structure again pushes AMR more towards an interlingua. Hence, the emergent interest in cross-lingual (A)MR (Oepen et al., 2020;Fan and Gardent, 2020;Sheth et al., 2021;Sherborne and Lapata, 2021) is well justified. However, even if AMR's inventory may favor an interlingual representation, we cannot, in general, expect a homomorphism of AMRs constructed from semantically equivalent sentences in various languages, given wide-spread phenomena that can preclude a uniform AMR representation, such as constructions involving head-switching phenomena or differences in lexical meaning.
Such a middle-ground is indicated by our results: (Too) much divergence may be involved when mapping non-English sentences to original EN-AMRs directly, which is penalized by the strict(er) SMATCH metric. We show that evaluation with the softer S2MATCH metric admits small deviations in the conceptual inventory of different languages. The fact that our indirect two-step approach T+P shows very strong performance also strengthens the view that AMR is not fully an interlingua. The better performance of T+P may in part be due to a capacity of strong NMT systems to neutralize some amount of inter-lingual divergence, so that evalu-7 Potentially, this may be mitigated in the future by linking AMR to x-lingual PropBanks (Akbik et al., 2015) 8 C.f. Xue et al. (2014).
ation against EN-AMRs can yield better results in this setting. Note that in our T+P approach two important intermediate (latent) representations are clearly separated: one in the NMT model (that builds a bridge between two natural languages) and one in the parser (that builds a bridge between English and a language of meaning with a flavor of English). By analyzing divergences between source and target in the T step, we can uncover aspects of semantic representations that are not isomorphic between languages, and which -by transfer via translation -may be neutralized to match the pivot-flavored AMR structure. Hence, the T+P approach offers an ideal framework for studying interlingual similarities and divergences in crosslingual AMR parsing, by comparing the structuralsemantic divergences of non-English sentences and their translated English counterparts (aka translational divergences), with the aim of identifying structural-semantic differences between languages that can affect the cross-lingual mapping of sentences into a uniform interlingual AMR. 9

Conclusion
We revisited translate+parse, an intuitive baseline for cross-lingual AMR parsing. Equipped with a recent NMT system and a monolingual AMR parser, T+P outperforms other approaches by large margins across all evaluation settings. We propose to employ a graded metric for fairer evaluation of cross-lingual AMR parsing. Our work can serve as a strong baseline for future development of crosslingual AMR parsers. Finally, the T+P approach provides an ideal platform for deeper assessment, analysis, and break-down of potential interlingual aspects of AMR.