XLPT-AMR: Cross-Lingual Pre-Training via Multi-Task Learning for Zero-Shot AMR Parsing and Text Generation

Due to the scarcity of annotated data, Abstract Meaning Representation (AMR) research is relatively limited and challenging for languages other than English. Upon the availability of English AMR dataset and English-to- X parallel datasets, in this paper we propose a novel cross-lingual pre-training approach via multi-task learning (MTL) for both zeroshot AMR parsing and AMR-to-text generation. Specifically, we consider three types of relevant tasks, including AMR parsing, AMR-to-text generation, and machine translation. We hope that knowledge gained while learning for English AMR parsing and text generation can be transferred to the counterparts of other languages. With properly pretrained models, we explore four different finetuning methods, i.e., vanilla fine-tuning with a single task, one-for-all MTL fine-tuning, targeted MTL fine-tuning, and teacher-studentbased MTL fine-tuning. Experimental results on AMR parsing and text generation of multiple non-English languages demonstrate that our approach significantly outperforms a strong baseline of pre-training approach, and greatly advances the state of the art. In detail, on LDC2020T07 we have achieved 70.45%, 71.76%, and 70.80% in Smatch F1 for AMR parsing of German, Spanish, and Italian, respectively, while for AMR-to-text generation of the languages, we have obtained 25.69, 31.36, and 28.42 in BLEU respectively. We make our code available on github https://github.com/xdqkid/XLPT-AMR.


Introduction
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a widely used formalism that represents the semantics of a sentence with a directed and acyclic graph. Figure 1 (b) shows an example AMR graph where the nodes such as * Corresponding Author: Junhui Li. "doctor" and "give-01" represent concepts, and the edges such as ":ARG0" and ":ARG1" stand for semantic relations between two connected concepts. Recent studies on AMR mainly fall in two directions: AMR parsing which converts a sentence into an AMR graph (Flanigan et al., 2014;Wang et al., 2015a;Konstas et al., 2017, to name a few) and its inverse, i.e., AMR-to-text generation that produces a sentence from an AMR graph (Flanigan et al., 2016;Song et al., 2017Song et al., , 2018, to name a few).
Restricted by the availability of annotated corpora, most of previous studies on AMR focus on English while very few studies are for Chinese and Portuguese Sobrevilla Cabezudo et al., 2019;Anchiêta and Pardo, 2020). Cross-lingual AMR research, however, has received relatively less attention. In fact, crosslingual AMR has mainly been studied in the scope of annotation works (Xue et al., 2014;. Till recently, Damonte and Cohen (2018) demonstrate that AMR annotated for English can be used as cross-lingual semantic representations, and propose to conduct cross-lingual AMR parsing via annotation projection and machine translation. Blloshmi et al. (2020) follow the same line and create large-scale silver data to boost the performance of cross-lingual AMR parsing. Fan and Gardent (2020) focus on multilingual AMR-to-text generation for twenty one different languages. The aforementioned studies consider AMR parsing and AMR-to-text generation separately.
In this paper, we formalize both AMR parsing and AMR-to-text generation as sequence-tosequence (seq2seq) learning and propose a novel and effective approach to cross-lingual AMR, which is illustrated in Figure 1. Upon the availability of the English AMR dataset and English-to-X parallel datasets (X ∈ {German, Spanish, Italian} in this paper), our purpose is to boost the performance of zero-shot AMR parsing and text generation in

(a) Parallel Sentences English
The doctors gave her medication and it's made her much better.

Italian
I medici le hanno dato un farmaco che la fa stare molto meglio. X-language. To this end, we borrow the idea of joint pre-training from  and explore three types of relevant tasks, including machine translation tasks, AMR parsing and AMR-to-text generation tasks. We conjecture that knowledge gained while learning for English AMR parsing and text generation could be helpful to the X-language counterparts, and machine translation tasks could act as a good regularizer . To the best of our knowledge, this is the first study that utilizes such a pre-training approach in cross-lingual AMR research.
We also explore and compare four different finetuning methods to answer the question that whether combining AMR parsing and AMR-to-text generation tasks in fine-tuning stage will achieve better performance. Moreover, inspired by the teacherstudent mechanism (Kim and Rush, 2016;Chen et al., 2017), we extend the fine-tuning method to improve a target fine-tuning task with the help of another relevant yet stronger task. Experimental results on the cross-lingual AMR dataset (LDC2020T07) show that the proposed approach greatly advances the state of the art of cross-lingual AMR.
Overall, we make the following contributions.
• We propose an effective cross-lingual pretraining approach for zero-shot AMR parsing and AMR-to-text generation. Our pre-trained models could be used for both AMR parsing and AMR-to-text generation.
• We explore and compare different fine-tuning methods. We also propose a teacher-studentbased fine-tuning method that achieves the best performance.
• We evaluate our approach in three zero-shot languages of AMR and our approach greatly advances the state of the art.

Related Work
We describe related studies on AMR from three perspectives: English AMR parsing, English AMRto-text generation, and cross-lingual AMR.
English AMR-to-Text Generation. As an inverse task of AMR parsing, AMR-to-text generation aims to write a sentence from an AMR graph. Early studies on this task rely on grammar-based approaches (Flanigan et al., 2016;Song et al., 2017). More recent studies propose to regard AMR-totext generation as a machine translation or seq2seq task (Pourdamghani et al., 2016;Ferreira et al., 2017;Konstas et al., 2017;Cao and Clark, 2019). However, seq2seq approaches tend to lose structural information in AMR graphs since they simply linearize AMR graphs into sequences before feeding them into the models. To prevent information loss caused by linearization, a variety of graph-tosequence approaches have been proposed to better model structural information (Song et al., 2018;Beck et al., 2018;Damonte and Cohen, 2019;Guo et al., 2019;Ribeiro et al., 2019;Cai and Lam, 2020b;Zhao et al., 2020;Yao et al., 2020;Bai et al., 2020). By taking advantages of strong pre-trained language models, recent studies achieve new state of the art (Mager et al., 2020;Harkous et al., 2020;Ribeiro et al., 2020;Bevilacqua et al., 2021) .
Cross-Lingual AMR. All above related studies focus on English AMR research. Relatively limited efforts have been put on other languages due to the lack of language-specific AMR corpora. Actually, whether AMR can act as an interlingua is an open question (Xue et al., 2014;. Till lately , Damonte and Cohen (2018) demonstrate that a simplified AMR can be used across languages and for the first time they study crosslingual AMR parsing for languages rather than English. Blloshmi et al. (2020) employ large-scale silver parallel AMR data to bridge the gap between different languages and greatly advance the performance of cross-lingual AMR parsing. Sheth et al. (2021) explore annotation projection to leverage existing English AMR and overcome resource shortage in the target language. Furthermore, Fan and Gardent (2020) explore cross-lingual AMR-to-text based on pre-trained cross-lingual language model (XLM) (Lample and Conneau, 2019). In this paper we build strong cross-lingual pre-trained models for both AMR parsing and AMR-to-text generation. Moreover, a nice property of our approach is that for AMR parsing, unlike related studies (Damonte and Cohen, 2018;Blloshmi et al., 2020), we do not need to perform lemmatization, POS tagging, NER, or re-categorization of entities, thus require no language specific toolkits in pre-processing.

Cross-Lingual Pre-Training
In this section, we first present the background of our pre-training approach (Section 3.1), followed by the description of cross-lingual pre-training tasks (Section 3.2). Then we present our joint pre-training (Section 3.3). For simplicity, in the following we use German as a representative to describe our approach to German AMR parsing and AMR-to-text generation.

Background
Transformer-based Seq2Seq Learning. Our models are built on the Transformer framework (Vaswani et al., 2017). The encoder in Transformer consists of a stack of multiple identical layers, each of which has two sub-layers: one implements the multi-head self-attention mechanism and the other is a position-wise fully-connected feedforward network. The decoder is also composed of a stack of multiple identical layers. Each layer in the decoder consists of the same sub-layers as in the encoder plus an additional sub-layer that performs multi-head attention to the distributional representation produced by the encoder. See Vaswani et al. (2017) for more details.

AMR Graph Linearization and Recovering.
To make Transformer applicable to AMR parsing and AMR-to-text generation, on the one hand we follow van Noord and Bos (2017) to linearize AMR graphs into sequences by removing variables, wiki links and duplicating the co-referring nodes. On the other hand, for AMR parsing we need to recover the graph representation from linearized AMRs by assigning a unique variable to each concept, pruning duplicated and redundant materials, restoring co-referring nodes, fixing incomplete concepts and performing Wikification. 1 In this paper, we adopt linearization and recovering scripts provided by van Noord and Bos (2017). 2

Cross-Lingual Pre-Training Tasks
Due to the unavailability of gold training data of German AMR parsing and AMR-to-text generation, we view English as a pivot and hope that knowledge gained while learning for English AMR parsing and text generation could be helpful for the German counterparts. Specifically, given an EN-DE parallel dataset T EN , T DE , we use an English AMR parser trained on annotated English AMRs (i.e., AMR2.0) to parse the English sentences into AMR graphs, thus obtain a trilingual parallel dataset T = T EN , T DE , T AM R . Then 899 on the trilingual parallel dataset, we propose crosslingual pre-training via multi-task learning. We consider three types of tasks, i.e., AMR parsing, AMR-to-text generation, and machine translation.
AMR Parsing Tasks, which include both English AMR parsing on the training data T EN , T AM R and German AMR parsing on T DE , T AM R . Note that both AMR parsing tasks are trained on silver AMR graphs.
AMR-to-Text Generation Tasks, which include both English AMR-to-text generation and German AMR-to-text generation. Similar to AMR parsing, these two AMR-to-text generation tasks are also trained on silver AMR graphs T AM R , T EN and T AM R , T DE , respectively.
Machine Translation Tasks, which include both English-to-German and German-to-English machine translation tasks on T EN , T DE . The advantage of including the bi-directional translation tasks is three-fold. First, English-to-German translation will enable the decoder to generate fluent German sentence, which is beneficial to German AMR-to-text generation. Second, German-to-English translation will enable the encoder to capture syntax and semantic information from German sentences, which is beneficial to German AMR parsing. Third, translation tasks can serve as regularization to the training of AMR parsing and AMR-to-text generation, both of which are apt to overfit to the training data.
Overall speaking, in our pre-training there exist three types of (six) pre-training tasks in total. The pre-training is conducted on a trilingual parallel dataset T EN , T DE , T AM R , where T EN and T DE are parallel gold sentence pairs while T AM R is the set of corresponding silver AMR graphs.

Jointly MTL Pre-Training
To train the above six pre-training tasks with a single model, we follow the strategy used in  and add preceding language tags to both source and target sides of training data to distinguish the inputs and outputs of each training task. As illustrated in Table 1, we use <en>, <de>, and <amr> as the tags of begin-of-sentence for English sentences, German sentences, and linearized AMRs, respectively.
Our joint pre-training on multiple tasks falls into the paradigm of multi-task learning (MTL). In the training stage, we take turns to load the training English <en> English Sentence German <de> German Sentence AMR <amr> Linearized AMR data of these pre-training tasks. For example, we update model parameters on a batch of training instances from the first task, and then update parameters on a batch of training instances of the second task, and the process repeats. We also note that, according to our preliminary experimentation, the effect of different orders of carrying out these pre-training tasks is negligible.

Fine-Tuning Methods
To fine-tune a pre-trained model, we create a fine-tuning dataset from English annotated AMRs (i.e.,AMR2.0). Given English-AMR parallel data F EN , F AM R , we use an English-to-German translator to translate the English sentences into German sentences, thus obtain trilingual parallel dataset F = F EN , F DE , F AM R . As our goal is to improve the performance of zero-shot AMR parsing and AMR-to-text generation, our primary fine-tuning tasks are German AMR parsing and AMR-to-text generation. Moreover, we could include the other four fine-tuning tasks as auxiliary tasks when necessary, i.e., English AMR parsing and AMR-to-text generation, as well as English-to-German and German-to-English translation.
Once the fine-tuning dataset is ready, we can finetune a pre-trained model with different methods. The vanilla fine-tuning method that fine-tunes a pretrained model on the dataset of a primary task is a natural choice. We can also fine-tune a pre-trained model jointly over all fine-tuning tasks, or over the primary tasks plus specifically chosen fine-tuning tasks that are relevant. In the following we explore and compare four different fine-tuning methods.

Vanilla Fine-Tuning
Given a pre-trained model, vanilla fine-tuning updates the parameters of the pre-trained model solely on the dataset of the downstream task. For example, for German AMR parsing, we fine-tune the pre-trained model on the fine-tuning dataset of the German AMR parsing task. In other words, vanilla fine-tuning involves only a single-task learning.

One-for-All MTL Fine-Tuning
We fine-tune a pre-trained model synchronously for all six fine-tuning tasks, which are the same as the pre-training tasks. Related studies (Li and Hoiem, 2018; have shown that it is important to optimize for high accuracy of a primary fine-tuning task while preserving the performance of other tasks. Preserving the performance of various pre-training tasks could be viewed as a regularizer for each fine-tuning task. Similarly to joint pre-training, we take turns to load the fine-tuning data of these fine-tuning tasks. Consequently, we obtain a single fine-tuned model for all tasks.

Targeted MTL Fine-Tuning
Rather than including all fine-tuning tasks within a single model, we can selectively choose relevant fine-tuning tasks. For German AMR parsing, we use AMR parsing on German as the primary finetuning task and German-to-English translation as an auxiliary fine-tuning task. The auxiliary task will enhance the encoder to capture semantic information from German sentences. This is also consistent with the fine-tuning tasks designed for English AMR parsing in . For German AMR-to-text generation, we choose Englishto-German as the auxiliary fine-tuning task, which is beneficial for the decoder to generate fluent German sentences.

Teacher-Student-based MTL Fine-Tuning
One notable property of the fine-tuning dataset is that the German sentences are produced automatically through machine translation. Noises in such silver fine-tuning dataset may degrade the performance of fine-tuned models. Inspired by the teacher-student framework (Kim and Rush, 2016;Chen et al., 2017), we propose to solve this problem by using a stronger fine-tuning task to help improve fine-tuning tasks on such noisy data. For example, we can use English AMR parsing (as the teacher) to help German AMR parsing (as the student), since English AMR parsing that is fine-tuned on gold data tends to have stronger performance.
Fine-Tuning for German AMR Parsing. We use E, G, A to denote English-side, German-side, and AMR-side, respectively, and (e, g, a) as a triple instance. For German AMR parsing (i.e., G → A), we regard English AMR parsing (i.e., E → A) as its teacher and assume that the probability of generating a target AMR token a i from g should be close to that from its counterpart e, given the already obtained partial AMR a <i . On this assumption, the student model can acquire knowledge from the teacher by applying word-level knowledge distillation for multi-class cross-entropy with the following joint training objective: J e, g, a,θE→A, θG→A + L θ G→A (a | g) , (1) where (e, g, a) ∈ DE,G,A, i.e., F EN , F DE , F AM R , the fine-tuning data for English/German AMR parsing,θ E→A denotes the already learned model parameters for English AMR parsing, 3 and L θ G→A (a | g) denotes the log-likelihood function for translating g into a. The function J in Eq. 1 is defined as: J e, g, a,θE→A, θG→A KL P (a|e, a<i;θE→A) P (a|g, a<i; θG→A) = |a| i=1 a∈Va P (a|e, a<i;θE→A) log P (a|e, a<i;θE→A) P (a|g, a<i; θG→A) , where KL (· ·) denotes the KL divergence between two distributions, and Va is the vocabulary set. 4 To sum up, in MTL fine-tuning we use Eq. 1 as the objective for the fine-tuning task of German AMR parsing while we still use the log-likelihood function for the auxiliary fine-tuning task, i.e., German-to-English translation.
Fine-Tuning for German AMR-to-Text Generation. Considering the fact that the performance of English-to-German translation is also better than that of German AMR-to-text generation, we view English-to-German translation as the teacher and assume that the probability of generating a target German token g i from a should be close to that from its counterpart e, given the already obtained partial German sentence g <i . The joint training objective for German AMR-to-text generation is similar to the aforementioned objective function for German AMR parsing. Due to limited space, we omit definition details of the objective function.

Experimentation
In this section, we report the performance of our approach to AMR parsing and AMR-to-text generation for non-English languages, including German (DE), Spanish (ES), and Italian (IT). The models are pre-trained and fine-tuned on English data and one of either DE, ES, or IT, and are evaluated in the target language.

Experimental Settings
Pre-Training Datasets. For German, we use the WMT14 English-German translation dataset 5 which consists of 3.9M sentence pairs after preprocessing. For Spanish and Italian, we use Europarl parallel datasets, 6 which consist of 1.9M English-Spanish and 1.9M English-Italian sentence pairs, respectively. The English sentences of all the datasets are all parsed into AMR graphs via an English AMR parser trained on AMR 2.0 (LDC2017T10) (Appendix A provides more details on the English AMR parser). We merge English, German (Spanish/Italian) sentences and linearized AMRs together and segment all the tokens into subwords by byte pair encoding (BPE) (Sennrich et al., 2016) with 40K (or 30K for both Spanish and Italian) operations.
In addition, we also train NMT models to translate English into German, Spanish, and Italian on above parallel datasets with Transformer-big settings (Vaswani et al., 2017). These NMT models will be used in preparing fine-tuning datasets (Appendix B provides more implementation details on the NMT models).
Fine-Tuning Datasets. We use English AMR2.0 which contains 36,521, 1,368, and 1,371 English-AMR pairs for training, development, and testing, respectively. We translate the English sentences into German, Spanish, and Italian, respectively. We segment all the tokens into subwords by using the BPE model trained on pre-training datasets.
Pre-Training and Fine-Tuning Model Settings. We implement above pre-trained models based on OpenNMT-py (Klein et al., 2017). 7 For simplicity, we use the same hyperparameter settings to train all the models in both pre-training and fine-tuning 5 https://www.statmt.org/wmt14/ translation-task.html 6 https://www.statmt.org/europarl/index. html 7 https://github.com/OpenNMT/OpenNMT-py by just following the settings for the Transformerbase model in Vaswani et al. (2017). The number of layers in encoder and decoder is 6 while the number of heads is 8. Both the embedding size and the hidden state size are 512 while the size of feedforward network is 2048. Moreover, we use Adam optimizer (Kingma and Ba, 2015) with β 1 of 0.9 and β 2 of 0.98. Warm up step, learning rate, dropout rate, and label smoothing epsilon are set to 16000, 2.0, 0.1 and 0.1 respectively. We set the batch size to 4, 096 (8,196) in pre-training (finetuning). We pre-train (fine-tune) the models for 250K (10K) steps and save them at every 10K (1K) steps. Finally, we obtain final pre-trained (finetuned) models by averaging the last 10 checkpoints.
Evaluation. We evaluate on LDC2020T07 (Damonte and Cohen, 2018), a corpus containing human translations of the test portion of 1371 sentences from the AMR 2.0, in German, Spanish, Italian, and Chinese. This data is designed for use in cross-lingual AMR research. Following Fan and Gardent (2020), we only evaluate on languages of German, Spanish and Italian where we have training data from EUROPARL. For AMR parsing evaluation, we utilize Smatch and other fine-grained metrics Damonte et al., 2017). For AMR-to-text generation, we report performance in BLEU (Papineni et al., 2002).

Baseline Systems
We compare the performance of our approach against two baseline systems.
Baseline scratch . To build this baseline system, we directly train models from scratch on the finetuning datasets. Taking German AMR parsing as example, we train the model on its fine-tuning dataset F DE , F AMR to get Baseline scratch .
Baseline pre-trained . Rather than training models from scratch, we pre-train the models on largescale silver datasets. Taking German AMR parsing as example, we first pre-train the model on the pretraining dataset, i.e., T DE , T AMR , then we finetune the pre-trained model on the corresponding fine-tuning dataset, i.e., F DE , F AMR .  Table 2: Performance of AMR parsing in Smatch F1 and AMR-to-text generation in BLEU for German (DE), Spanish (ES), and Italian (IT). Here, XLPT-AMR none denotes that we test the pre-trained models without finetuning them. XLPT-AMR one4all , XLPT-AMR targeted , and XLPT-AMR T-S indicate that we use one-for-all, targeted and teacher-student as MTL fine-tuning method, respectively. † is for using Google translator while ‡ for pretrained models.

Main Results
From the performance comparison of the two baseline approaches, it is not surprising to find out that pre-training on silver datasets is a very effective way to boost performance (Konstas et al., 2017;. By using silver datasets, we obtain improvements of 6.80 ∼ 7.87 Smatch F1, and 6.21 ∼ 10.54 BLEU for parsing and text generation, respectively.
With any of our fine-tuning methods, our cross-lingual pre-training approach further improves the performance over the strong baseline Baseline pre-trained in both parsing and generation tasks over all languages. It shows that like other fine-tuning methods, vanilla fine-tuning significantly boosts the performance of both parsing and generation. However, it still underperforms any of the MTL fine-tuning methods. This confirms that it is important to optimize for high accuracy of a certain fine-tuning task while preserving the performance of other pre-training. The performance comparison between XLPT-AMR one4all and XLPT-AMR targeted suggests that selectively choosing relevant fine-tuning tasks, rather than including all fine-tuning tasks, could further boost parsing and generation performance with the exception of Spanish generation task.
The XLPT-AMR T-S models perform the best, which reveals that using the teacher-student framework to guide the decoding process also helps the student task. This is owing to fact that the teacher models achieve better performance than the student models. See more in Section 5.4 for performance comparison of teacher and student models.
Finally, we compare our approach to the previous studies. Among them, both Blloshmi et al. (2020) and Fan and Gardent (2020) adopt pretrained models which cover either the encoder part, or the decoder part. From the results we can see even our baseline Baseline pre-trained outperforms them by pre-training the encoder and the decoder simultaneously. The results also show that our XLPT-AMR T-S models greatly advance the state of art. For example, our XLPT-AMR T-S models outperform Sheth et al. (2021) by 3.4∼7.8 Smatch F1 on AMR parsing of the three languages while surpass Fan and Gardent (2020) by around 10 BLEU on AMR-to-text generation. Table 3 compares the performance of finegrained metrics for AMR parsing. It shows that our XLPT-AMR T-S models achieve the best performance on all the metrics with the only exception of Concepts for Italian AMR parsing. It shows that like English AMR parsing, all models predict Reentrancies poorly (Szubert et al., 2020). It also demonstrates that Negations is another metric which is hard to predict. In future work, we will pay particular attention to the two metrics.

Discussion
In this section, we try to answer the following three questions: • First, what is the performance of teacher models when we use teacher models to guide student ones in teacher-student-based MTL finetuning?
• Second, what is the effect of the two machine translation tasks in pre-training?
• Third, in our approach we take English as pivot language by taking advantage of large scale English-to-German (or Spanish, Italian) dataset. What is the performance of English AMR parsing and AMT-to-text generation?
Performance of teacher models in teacherstudent-based MTL fine-tuning. Table 4 compares the performance of teacher and student models. It shows that the performance of teacher models for English AMR parsing and English-to-X translation is much higher than the counterparts of student models (i.e., Stu.(before) in the Effect of machine translation tasks in pretraining. We use German as a representative.
Note that when machine translation tasks are not involved in pre-training, the targeted MTL finetuning method is not applicable since we cannot use machine translation as the auxiliary task. Therefore, we use the vanilla fine-tuning method to finetune the pre-trained models. Table 5 compares the performance with/without machine translation tasks in pre-training. From it, we observe that including machine translation tasks in pre-training achieves improvements of 2.77 Smatch F1 and 2.46 BLEU on German AMR parsing and text generation, respectively. This suggests the necessity to have machine translation tasks in pre-training.
Performance of English AMR parsing and AMR-to-Text generation. Based on the pretrained models, we take the targeted MTL finetuning method (Section 4.3) as a representative. Specifically, for English AMR parsing, we choose English-to-X (X ∈ {German, Spanish, Italian}) as the auxiliary fine-tuning task while for English test generation, we choose X-to-English as the auxiliary task. Table 6 shows that the performance of English parsing and generation is much higher than that of other languages. Moreover, we find that the results of English AMR parsing are quite close when combining English with any of other languages whereas the results of English AMR-to-text generation are considerably different. One possible reason for the phenomenon is that English AMR-to-text generation is relevant to the sizes of machine translation datasets used in pre-training (i.e., 3.9M for EN-DE translation whereas 1.9M for both EN-ES and EN-IT, respectively) while English parsing seems to be less affected by the sizes of (silver) datasets. It indicates that with more English sentences in pretraining, it helps the generation models to generate    more fluent and correct English sentences.

Conclusions
In this paper we proposed a cross-lingual pretraining approach via multi-task learning for zeroshot AMR parsing and AMR-to-text generation. Upon English AMR dataset and English-to-X parallel datasets, we pre-trained models on three types of relevant tasks, including AMR parsing, AMRto-text generation, and machine translation. We also explored and compared four different finetuning methods. Experimentation on the multilingual AMR dataset shows that our approach greatly advances the state of the art.  A English AMR Parser on AMR 2.0 Our English AMR parser is learned in a seq2seq framework and trained on AMR2.0, which consists of 36,521 training AMRs, 1,368 development AMRs and 1,371 testing AMRs. We share vocabulary for the input and the output by segmenting tokens into pieces by byte pair encoding (BPE) with 20K merge operations.
We use OpenNMT-py as the implementation of Transformer. In model setting, we use Transformer base model setting. We use Adam with β 1 = 0.9, β 2 = 0.98 for optimization. Batch size, learning rate, warm-up step, and dropout rate are set to 4096, 2.0, 16000 and 0.1 respectively. We train the model for 250K steps on 1 GPUs and save models every 10K steps. Finally, we obtain final model by averaging the last 10 checkpoints.
The English AMR parser achieves 73.68 and 73.24 Smatch F1 on the dev and test set, respectively.
B NMT Models for English-to-German, English-to-Spanish, English-to-Italian In pre-processing, we tokenize all of MT corpus with Moses scripts. 8 Then we segment words into pieces by BPE with 32K (30K) BPE merge operations for EN-DE (both EN-ES and EN-IT). After filtering long and imbalanced pairs, we get 3.9M parallel sentence pairs for EN-DE and 1.9M for both EN-ES and EN-IT. We again use OpenNMT-py as the implementation of Transformer. In model setting, we use Transformer big model setting. We use Adam with β 1 = 0.9, β 2 = 0.998 for optimization. Batch size, learning rate, warm-up step, and dropout rate are set to 8192, 2.0, 8000 (16000 for both EN-ES and EN-IT) and 0.1, respectively. We train the model for 100K (110K for EN-ES and 150K for EN-IT) steps on 4 GPUs and save models very 5000 steps. For each translation task, we obtain final model by 8 https://github.com/moses-smt/ mosesdecoder averaging the last 5 (20 for both EN-ES and EN-IT) checkpoints.
For evaluation, we use case-sensitive BLEU measured by multi-bleu script. Table 7 shows the performance of the three translation models on the test sets, i.e., newstest2014 for EN-DE and new-stest2009 for both EN-ES and EN-IT.