Bootstrapping Multilingual AMR with Contextual Word Alignments

We develop high performance multilingualAbstract Meaning Representation (AMR) sys-tems by projecting English AMR annotationsto other languages with weak supervision. Weachieve this goal by bootstrapping transformer-based multilingual word embeddings, in partic-ular those from cross-lingual RoBERTa (XLM-R large). We develop a novel technique forforeign-text-to-English AMR alignment, usingthe contextual word alignment between En-glish and foreign language tokens. This wordalignment is weakly supervised and relies onthe contextualized XLM-R word embeddings.We achieve a highly competitive performancethat surpasses the best published results forGerman, Italian, Spanish and Chinese.


Introduction
Abstract Meaning Representation graphs are rooted, labeled, directed, acyclic graphs representing sentence-level semantics (Banarescu et al., 2013). In the example shown in Figure 1, the sentence The boy wants to go is parsed into an AMR graph. The nodes of the AMR graph represent the AMR concepts, which may include normalized surface symbols e.g. boy, Propbank frames (Kingsbury and Palmer, 2002) e.g. want-01, go-02 as well as other AMR-specific constructs. Edges in an AMR graph represent the relations between concepts. In this example :arg0, :arg1 correspond to standard roles of Propbank.
One distinctive aspect of AMR annotation is the lack of explicit alignments between nodes in the graph and words in the sentences. Since such alignments are essential for training many of presentday AMR parsers, there have been various efforts to link the AMR concepts to their corresponding span of words (Flanigan et al., 2014;Pourdamghani Figure 1: AMR graph for The boy wants to go and its German translation Der Junge will gehen. Implicit alignments between the English text and AMR concepts are denoted by dotted arrows. Explicit alignments between English and German texts are denoted by solid arrows. Lyu and Titov, 2018;Chen and Palmer, 2017). A significant emphasis of this paper is on deriving these alignments for multilingual AMR parsers.
Even though by nature AMR is biased towards English, recent work has evaluated the potential of AMR to work as an interlingua.  and Xue et al. (2014) categorize and propose refinements for divergences in the annotation between English and Chinese as well as Czech AMRs. Anchiêta and Pardo (2018) import the corresponding AMR annotation for each sentence from the English annotated corpus and revisit the annotation to adapt it to Portuguese. However, Damonte and Cohen (2018) show that it may be possible to use the original AMR annotations devised for English as representation for equivalent sentences in other languages without any modification despite the translation divergence. This defines the problem of multilingual AMR parsing that we seek to address in this paper -given a sentence in a foreign language, recover the AMR graph originally designed for its English translation. We implement multilingual AMR parsers for German, Spanish, Italian and Chinese.
In this paper we propose that transformer-based multilingual word embeddings can be a useful tool for addressing the problem of multilingual AMR parsing. Besides using contextual word embeddings as input token embeddings, we leverage them for annotation projection, where existing AMR annotations for English are projected to a target language by using contextual word alignments. In our experiments, we employ XLM-RoBerta large (Conneau et al., 2019) as the multilingual pre-trained transformer model. We show that our proposed procedure achieves competitive results as some of the classical methods for text-to-AMR alignment. Furthermore, such a procedure is easily scalable to the 100 languages that XLM-R is trained on.
We also combine different techniques for concept alignments and AMR parser training which significantly improve performance over the base models. For concept alignment, we combine the proposed contextual word alignments with previously established alignment techniques utilizing matching rules tailored to AMR as well as machine translation aligners (Flanigan et al., 2014;Pourdamghani et al., 2014). For AMR parser training, we pre-train an AMR parser on the treebanks of different languages simultaneously and subsequently finetune on each language. This is analogous to the techniques used for silver data pre-training (Konstas et al., 2017;van Noord and Bos, 2017) in AMR parsing and multi-lingual pre-training (Aharoni et al., 2019) in machine translation.
Finally, we conduct a detailed error analysis of the multilingual AMR parsing. One of the major errors we have found involves synonymous concepts, which share the same meaning as the original concepts in English, but differ in spellings. While this error is mainly caused by the fact that the multilingual word embeddings bridge non-English input tokens to English concepts, it also highlights the highly lexical nature of Smatch scoring  which does not take synonymous concepts into consideration. We also elaborate upon error analysis of the direct comparison between our proposed annotation projection method using contextual word alignment and a previous baseline, using fast align.
The rest of the paper is organized as follows: In Section 2, we discuss related work. In Section 3, we present our main proposal on annotation projec-tion based on contextual word alignments. In Section 4, we describe various combination approaches that improve the multilingual parser performances significantly. These include combining word-toconcept alignments, using multi-lingual treebanks and combining human-annotated and synthetic treebanks. In Section 5, we discuss experimental results. In Sections 6 and 7, we present detailed error analyses. We conclude the paper in Section 8.

Related work
Multilingual AMR. There have been significant advances in AMR parsing for languages other than English. Previous studies Xue et al., 2014;Migueles-Abraira et al., 2018;Sobrevilla Cabezudo and Pardo, 2019) investigated AMR annotations for a variety of different languages such as Chinese, Czech, Spanish and Brazilian Portuguese. Vanderwende et al. (2015) automatically parse the logical representation for sentences in Spanish, Italian, German and Japanese, which is then converted to AMR using a small set of rules.
While much of this work, along with studies such as Li et al. (2016); Anchiêta and Pardo (2018), produces AMR graphs whose nodes were labeled with words from the target language, Damonte and Cohen (2018) developed AMR parsers for English and used parallel corpora for annotation projection to train Italian, Spanish, German, and Chinese parsers that recover the AMR graph originally designed for the English translation. Their main results showed that the new parsers can overcome certain structural differences between languages.
Similar to Damonte and Cohen (2018), we also train multilingual AMR parsers by projecting English AMR annotation to target foreign languages (German, Spanish, Italian and Chinese), but we depart from their approach in the specifics of the annotation projection by exploring contextual word alignments directly derived from multilingual contextualized word embeddings. While both procedures utilize parallel corpora, the annotation projection of Damonte and Cohen (2018) requires additional supervised training of their statistical word aligner. Our proposed contextualized word alignment is however unsupervised in nature. Alternatively, a recent study by Blloshmi et al. (2020) showed that one may in fact not need alignmentbased parsers for cross-lingual AMR, rather modelling concept identification as a seq2seq problem. In this paper, we will compare our results to both Damonte and Cohen (2018) and Blloshmi et al. (2020).
Word vector alignment techniques. Traditional word alignment methods often use parallel corpora and IBM alignment models (Brown et al., 1990(Brown et al., , 1993 as well as improved versions (Och and Ney, 2003;Dyer et al., 2013). More recently, there have been an advent of techniques that align vector representation of words from varying levels of supervision (Ruder et al., 2019). Often word vectors are learned independently for each language and then a mapping from source language vectors to target language vectors with a bilingual dictionary is developed (Mikolov et al., 2013;Smith et al., 2017;Artetxe et al., 2017). To reduce the need for bilingual supervision, the iterative method of starting from a minimal seed dictionary and alternating with learning the linear map was employed by a recent body of work (Conneau et al., 2018;Schuster et al., 2019;Artetxe et al., 2018).
The work most similar to ours is Cao et al. (2020) where the authors obtain contextual embedding alignments from multilingual BERT (Devlin et al., 2018;Pires et al., 2019) and subsequently improve the alignments via finetuning using supervised parallel corpora. Our contextual word alignment between two parallel sentences may be thought of as an adaptation of their contextual word retrieval task. However, we refrain from any finetuning of the contextual embeddings and show that the contextual word alignments from the off-the-shelf XLM-R model achieves results competitive to the word alignments by fast-align (see Damonte and Cohen (2018)). This suggests the potential for inexpensive, massive scaling of AMR parsing up to 100 languages on which XLM-R is trained.

Annotation projection
We adopt a transition-based parsing approach for AMR parsing following (Ballesteros and Al-Onaizan, 2017;Naseem et al., 2019;. These produce an AMR graph g from an input sentence s by predicting instead an action sequence a from s as a sequence to sequence problem. This action sequence applied to a state machine M produces then the desired target graph as g = M (a, s). Transition-based parsers require the action sequence for each graph in the training data. This is determined by a rule-based oracle a = O(g, s) which relies on external word-to-node alignments. In all the subsequent experiments we will use the oracle and action set from (Fernandez Astudillo et al., 2020).

Projection method
In order to train AMR parsers in a non-English language, we use the annotation projection method to leverage existing English AMR annotation and overcome resource shortage in the target language. First, the English text is aligned to corresponding AMR concepts using both rule-based JAMR aligner (Flanigan et al., 2014) and a IBM model type aligner (Pourdamghani et al., 2014). The latter will henceforth be referred to as the EM aligner. Given the English text-to-AMR concept alignments, we then project these to the target language using word alignment. In the following subsection we describe in the proposed word alignment method, called contextual word alignment, which is trained in a weakly supervised manner.

Contextual word alignments
Given two languages, we align word pairs within parallel sentences if their vector representations derived from the underlying multilingual pre-trained model are similar according to cosine distance. As vector representation we use the average of all 24 layers of the XLM-R large contextual embeddings. We will refer to this average as the word's contextual embedding henceforth for simplicity.
More precisely, suppose we have two parallel sentences -E = e 0 , e 1 , e 2 , ..., e M in English and F = f 0 , f 1 , f 2 , ..., f N in the target language. We will use r to represent the pre-trained multilingual model such that r(S) i is the contextual embedding for the i th word in sentence S. Then a word e i ∈ E is contextually word aligned to f j if and only if the cosine similarity score between their word embeddings is the highest. Thus we define the corresponding contextual alignment function χ(f j |e i ) as, Figure 2: Annotation projection is achieved using JAMR and EM aligners for English text-to-AMR concept alignment and contextual word alignment between tokens of the source (English) and target languages.
As an example, the following are sentences from our German and English training datasets: E: Establishing models in industrial Innovation F: Etablierung von Modellen in der industriellen Innovation Their contextual word alignments are, Figure 2 pictorially illustrates our complete annotation projection method using the contextual word alignment χ(F|E). English tokens and AMR concepts are aligned using JAMR and EM aligners. The resulting AMR annotation augmented with English word-to-concept alignments is then projected onto the given target language using contextual word embeddings. Henceforth, for brevity we will at times refer to this approach as A.P.

Combination approaches
We apply three types of combination techniques to the multilingual AMR parsers, trained by projecting English annotations using contextual word alignments derived from the multilingual contextual word embeddings, each of which improves the parser performance significantly.

Alignment combination
One such technique is to combine the contextual word alignment based A.P. with the baseline wordto-concept alignment which aligns the target tokens directly to AMR concepts using JAMR and EM aligners. Since the EM aligner is an unsupervised method, it can be directly applied to the target language tokens and English AMR concepts. How- Figure 3: Illustration of the EM, JAMR + A.P. combination alignment: first align target tokens to AMR concepts using JAMR+EM aligners with any remaining concepts then aligned using the annotation projection method proposed in Figure 2. ever, we note that this baseline alignment approach gives incomplete coverage (87% concepts aligned to German, 88% to Italian and 91% to Spanish tokens). Thus, we supplement this by aligning the remaining concepts using the A.P. of Figure 2.
For example, suppose we have as before two parallel sentences -E = e 0 , ..., e M in English and F = f 0 , ..., f N in the target language, as well as AMR concepts N = n 0 , ..., n L . Then one of our proposed foreign text-to-AMR concept combination alignment procedures EA(f i |n j ) (see Figure  3) is defined as, where BA(f i |n j ) represents that the j th concept is aligned to the i th token in F using the baseline aligner BA. If for any concept n j ∈ N, BA(f i |n j ) = None, we use annotation projection to align it where AP (f i |n j ) is given by, We also experiment with other such alignments, in particular by using the intersection of cosine alignment (χ(F |E) ∩ χ(E|F )) as the contextual word alignment. In this case, As before, ∀n j ∈ N where iAP (f i |n j ) = None we align it using the baseline aligner BA(f i |n j ). For any further remaining unaligned concepts, we employ maxAP (f i |n j ) which can be described as: That is, we pick the uni-directional contextual word alignment with the higher score and project the AMR annotation accordingly.

Multilingual treebank combination
In addition to training the parser on the treebank of each language -derived from English treebank via annotation projection -we also experiment with combining all the target language treebanks to create a single multilingual treebank. We notice that pre-training an AMR parser on this multilingual treebank with subsequent finetuning on the treebank of each language, improves performance over the parser trained only on each individual treebank.

Human and synthetic treebank combination
We create a synthetic AMR corpus by parsing 85k unlabeled sentences from the context portion of SQuAD-2.0. The resulting synthetic AMR graphs are filtered as per the procedure in  and combined with the AMR-2.0 training set (LDC2017T10), to produce an expanded AMR-2.0 + SQuAD training dataset of 94k sentences. We then project annotations of this expanded English treebank onto each of the target languages, and train the corresponding target language parser. We observe that despite the lower quality of the synthetic AMRs as compared to their human-annotated counterparts, their inclusion in the training set significantly improves parser performance.

AMR Parser and Data
For our experiments, we use the stack-Transformer model (Fernandez Astudillo et al., 2020) 1 as our AMR parser. The stack-Transformer is a transition based parser with a modified Transformer architecture to encode the parser state. It uses a cross entropy loss function and has hyper-parameters similar to those of machine translation described in (Vaswani et al., 2017). We use a beam size of 3 to decode our models and evaluate them using Smatch scores . Model performance values in this manuscript are an average over the best performing models across 3 random seeds. Lastly, the input to the parser -the vector representation of each word -is obtained by averaging over not only all 24 layers of the pre-trained XLM-R large contextual embeddings but also over constituent wordpieces within each word. For all four languages -German, Spanish, Italian and Chinese -we experiment on AMR1.0 (LDC2015E86). For the first three we also experiment on AMR-2.0 (LDC2017T10). Results from the former are compared to Damonte and Cohen (2018) and from the latter to Blloshmi et al. (2020). Details of our training, dev and test sets are given in Table1. 2 To train each target language parser, we first translate the input sentences of AMR-2.0 and AMR-1.0 with Watson Language Translator. 3 This creates the supervised parallel corpus which we then use for our unsupervised annotation projection via contextual word alignment. We also align target language tokens directly to AMR concepts using JAMR and EM aligners for baseline system evaluation and for combination alignments. We select the best performing models using the devset. Finally, for our best models, we report results using the machine as well as human translations (LDC2020T07) of the test sets.

Baselines
Our first baseline is zero-shot learning, where we train on the English dataset but test on a foreign language dev-set (Baseline I). The reason behind this experiment is to test the ability of the XLM-R contextual word embeddings to capture the meaning of the given token irrespective of the underlying language. Note that it is only for this experiment that languages for the train and dev sets differ. In another set of experiments we align the target language tokens directly to the AMR concepts only using the JAMR and EM aligners (Baseline II). Lastly, we also test the annotation projection procedure of Damonte and Cohen (2018). Note that while the previous authors use fast align (Dyer et al., 2013) for word alignment between the parallel data and only JAMR aligner for the English text-to-AMR alignment, in Baseline III we have utilized fast align in conjunction with both JAMR and EM aligners (for English text-to-AMR alignment) for improved performance.

Results
Table 2 compares our different proposed approaches to the three baseline methods using the AMR2.0 and AMR1.0 datasets. We see that our proposed approach -annotation projection with contextual word alignment, in this case using χ(F|E) -shows fairly competitive results with    those of Baseline III for the target languages of German, Italian and Spanish, especially when applied to the smaller corpus of AMR1.0. This is remarkable considering our method requires no additional training and can be easily generalized for zero-shot learning on all different languages that XLM-R was pretrained on. We then train several parsers using our suggested combination approaches. The first such method comprises of both the EM, JAMR + A.P aligners (see Eq. 3).
In a different approach, we use the intersection cosine word alignment based annotation projection (i.e χ(F|E) ∩ χ(E|F)). Since this leaves many AMR concepts unaligned, we follow it by aligning concepts using the baseline JAMR and EM aligners. Any leftover unaligned concepts are then aligned using max(χ(E|F), χ(F|E)) (Eq. 5). In another set of experiments, we pre-train a parser on a multilingual treebank, where the train set is a combination of the LDC treebank in all target languages. The parser is then finetuned on each individual language. We surmise that such an experiment will give us a truly multilingual parser capable of successfully decoding all the target languages. Its strength is evident in its performance, it outperforms all our baseline approaches -in the case of AMR1.0 dev set by at least 1.4 points. Finally, in the last two experiments on AMR2.0 we train on the language-specific LDC + SQuaD train set. We see that this gives us our best performing parsers, where the training data is aligned using a combination (EM, JAMR + A.P) alignment. We test a subset of the AMR2.0 and all of the AMR1.0 models on corresponding test sets. The results are shown in Tables 3 and 4. For AMR1.0, while all of our models including the baselines outperform previously published results, the best performing model is the parser which was trained on multilingual data and whose training input text was aligned to its AMR concepts using the combination of EM, JAMR and A.P aligners. For AMR2.0, models trained on the LDC + SQuAD dataset outperform those trained on multilingual data. Both of these outperform the recently published work of Blloshmi et al. (2020). 4 We note that the parser performs better on the machine translated test data than on the human translated data. This should be attributed to the 4 We did not run experiments with LDC + SQuAD dataset on AMR1.0 since our primary reason for running experiments on AMR1.0 was to more directly be able to compare our results to (Damonte and Cohen, 2018) Figure 4: Histogram of different kinds of errors training and testing condition mismatch of the human translated test data since all models are trained on machine translated training data. For instance, the out-of-vocabulary (oov) ratio of the human translated test data is consistently higher than that of the machine translated test data. For example, for AMR1.0 the oov ratio of human translated test data vs. machine translated test data is 10.2% vs. 9% for German, 7.3% vs. 6.8% for Spanish, 8.1% vs. 7.6% for Italian and 7.6% vs. 5.5% for Chinese.

Error analysis
We carried out an error analysis of 56 German sentences parsed by the best performing model trained on the combination of AMR2.0 and SQuAD training data. Statistics of the various errors are depicted in Figure 4. Top 5 most frequent errors include (i) introduction of synonymous concepts, (ii) missing concepts, (iii) incorrect roles, (iv) target tokens in AMR concepts, (v) incorrect parsing of multi-sentence as an instance of conjunction.

Synonymous concepts
The most common error we encounter is synonymous AMR concepts, as shown in Figure 5. Comparing the expected graph (top) to the parsed version (bottom), we note that concept previous is synonymized to past. While this error is mainly caused by the fact that the multilingual word embeddings bridge non-English input tokens to English concepts, it also highlights the highly lexical nature of Smatch scoring  which does not take synonymous concepts into consideration. Given that AMR is supposed to represent the core meaning of a sentence regardless of its syntactic and morphological variations, Smatch scoring should be able to capture lexical variations such as synonymous concepts. :ARG1-of (s / stupefy-01)) :degree (b / bit)) :location (e / environment :mod (t2 / this))) Was ist in dieser Umgebung falsch, wenn sie die bisherige stupeftende Propaganda ein bisschen kritisieren?
In critical moments, we are all descendants of Yan emperor and Huang emperor.

Missing concepts and incorrect roles
Some concepts are missing in the parsed AMR, such as stupefy-01 in Figure 5. The parser also incorrectly identifies relations between concepts. In Figure 5, arguments ARG1 and ARG2 for concept wrong-02 are swapped. In Figure 6, the relation :source is replaced by frame argument ARG1.

Incorrect parsing of Multi-sentence
Another frequent error includes incorrect parsing of multi-sentence as an instance of conjunction, especially when sentences are demarcated by commas. Note that the multi-sentence errors are not specific to multilingual parsing and occur frequently when parsing English input sentences as well. This multisentence error is mostly caused by the ambiguity of commas, which can subsume various semantics depending on the contexts across languages.

Misrecognition of foreign token as a named entity
Some target tokens may legitimately be realized in the gold AMR, especially when the target tokens are named entities, e.g. Frankfurt, Anna, Noah, etc. This often leads to errors in the parsed AMR when a target token is incorrectly recognized as a named entity. In Figure 6, German token Kaisers is incorrectly parsed as part of named entities Yan Kaisers and Huang Kaisers. The failure to capture the correct concept emperor for the German token Kaisers leads to a subsequent error of not reifying the role to have-org-role-91 5 , evident in the comparison of the parsed AMR with the gold AMRs.

Others
Other errors include lack of stemming in the target language, such as Kaisers in Figure 6. Stemming errors are mostly caused by the fact that we have not incorporated target language stemmers whereas we have incorporated spacy 6 for English. Some errors are caused by machine translation. English fragmentary input taking a look is translated to Sehen Sie sich, which is then incorrectly parsed as imperative sentence. Nominal target language tokens often fail to invoke predicates. Given the input in English "cultural tyranny in the cloak of nationalism", tyranny invokes the predicate tyrannize-01. Its German counterpart Tyrannei, however, fails to  invoke the predicate in "kulturellen Tyrannei im Mantel des Nationalismus".

Word alignment error analysis
We compared the annotation projection for AMR1.0 between fast align and the contextual alignment. As noted in Table 3 they perform comparably for German, Italian and Spanish. However, on detailed analysis we notice that annotation projection using contextualized alignments has a greater coverage in terms of foreign text-to-AMR alignments compared to fast align (eg. for German, contextual alignment A.P. gives 99.95% coverage in comparison to 97.47%.). This is likely due to the fact that fast align is based on an IBM alignment model, which relies on expected counts of alignment pairs and uses additional alignment constraints. Contextualized alignment relies on the unrestricted pairing by cosine distance of the XLM-R contextual word embeddings of the input tokens. Given an English token, the contextualized alignment necessarily aligns it to a foreign language word. Furthermore, since embeddings are contextual and pre-trained with large amounts of data, they are robust to non frequent alignment pairs. The difference between contextualized alignment and fast align for their coverage is most noticeable for compounds. A German counterpart of English non -tariff is nichttarifäre. While contextualized alignment aligns nichttarifäre to non, which is subsequently aligned to the concept "-" for polarity, fast align leaves nichttarifäre unaligned. Such difference is evidenced in the parser performance on negations realized in diverse morphologies. Comparing the AMR1.0 parser performance on negations between fast align (Baseline III in Table 3) and the contextualized alignment (A.P in Table 3), we find that contextualized alignment consistently outperforms fast align across the three European target languages, as shown in Table 5.

Conclusion and future directions
In this paper we propose to use transformer-based multilingual word embeddings for annotation projection of AMR annotations. We show that our proposed procedure achieves competitive results as some of the classical methods for text-to-AMR alignment. We apply combination techniques to concept alignments and AMR parser training, which significantly improve performance over the base models. We also provide a detailed error analysis of the multilingual AMR parsing.
Given pre-trained transformer-based multilingual word embeddings, contextual word alignment proves to be a useful avenue for overcoming differences amongst languages and addressing the multilingual AMR problem with weak supervision. Moreover, our annotation projection procedure not only achieves a highly competitive performance for German, Spanish, Italian and Chinese but also permits zero-shot learning to other languages included in the training set of the underlying XLM-R multilingual transformer.
Future work may include diversifying input texts using AMR2text (Mager et al., 2020) generation which can address the difference in results between machine translated and human translated test data. The potential of the AMR parser to overcome translation divergence also points to its utility in an end-to-end multilingual translation system, bypassing the need for supervised parallel corpora for machine translation system training.