Smelting Gold and Silver for Improved Multilingual AMR-to-Text Generation

Recent work on multilingual AMR-to-text generation has exclusively focused on data augmentation strategies that utilize silver AMR. However, this assumes a high quality of generated AMRs, potentially limiting the transferability to the target task. In this paper, we investigate different techniques for automatically generating AMR annotations, where we aim to study which source of information yields better multilingual results. Our models trained on gold AMR with silver (machine translated) sentences outperform approaches which leverage generated silver AMR. We find that combining both complementary sources of information further improves multilingual AMR-to-text generation. Our models surpass the previous state of the art for German, Italian, Spanish, and Chinese by a large margin.


Introduction
AMR-to-text generation is the task of recovering a text with the same meaning as a given Abstract Meaning Representation (AMR) (Banarescu et al., 2013), and has recently received much research interest (Ribeiro et al., 2019;Wang et al., 2020;Mager et al., 2020;Harkous et al., 2020;Fu et al., 2021). AMR has applications to a range of NLP tasks, including summarization (Hardy and Vlachos, 2018) and spoken language understanding (Damonte et al., 2019), and has the potential power of acting as an interlingua that allows the generation of text in many different languages (Damonte and Cohen, 2018;Zhu et al., 2019).

It
La loro vita sembra gloriosa.

他们的⽇⼦看起来很光鲜.
De Zh Figure 1: A generation example from English AMR to multiple different languages.
that parsers can be effectively trained to transform multilingual text into English AMR, Mille et al. (2018Mille et al. ( , 2019 and Fan and Gardent (2020) discuss the reverse task, turning meaning representations into multilingual text, as shown in Figure 1. However, gold-standard multilingual AMR training data is currently scarce, and previous work (Fan and Gardent, 2020) while discussing the feasibility of multilingual AMR-to-text generation, has investigated synthetically generated AMR as the only source of silver training data.
In this paper, we aim to close this gap by providing an extensive analysis of different augmentation techniques to cheaply acquire silver-standard multilingual AMR-to-text data: (1) Following Fan and Gardent (2020), we parse English sentences into silver AMRs from parallel multilingual corpora (SILVERAMR), resulting in a dataset consisting of grammatically correct sentences with noisy AMR structures.
(2) We leverage machine translation (MT) and translate the English sentences from the gold AMR-to-text corpus to the respective target languages (SILVERSENT), resulting in a dataset with correct AMR structures but potentially unfaithful or non-grammatical sentences. (3) We experiment with utilizing the AMR-to-text corpus with both gold English AMR and sentences in multi-source scenarios to enhance multilingual training.
Our contributions and the organization of this paper are the following: First, we formalize the multilingual AMR-to-text generation setting and present various cheap and efficient alternatives for collecting multilingual training data. Second, we show that our proposed training strategies greatly advance the state of the art finding that SILVERSENT considerably outperforms SILVERAMR. Third, we show that SILVERAMR has better relative performance in relatively larger sentences, whereas SIL-VERSENT performs better for relatively larger graphs. Overall, we find that a combination of both strategies further improves the performance, showing that they are complementary for this task.

Related Work
Approaches for AMR-to-text generation predominantly focus on English, and typically employ an encoder-decoder architecture, employing a linearized representation of the graph (Konstas et al., 2017;Ribeiro et al., 2020a). Recently, models based on the graph-to-text paradigm (Ribeiro et al., 2020b;Schmitt et al., 2021) improve over linearized approaches, explicitly encoding the AMR structure with a graph encoder (Song et al., 2018;Beck et al., 2018;Ribeiro et al., 2019;Guo et al., 2019;Cai and Lam, 2020b;. Advances in multilingual AMR parsing have focused on a variety of different languages such as Brazilian Portuguese, Chinese, Czech and Spanish (Hajič et al., 2014;Xue et al., 2014;Migueles-Abraira et al., 2018;Sobrevilla Cabezudo and Pardo, 2019). In contrast, little work has focused on the reverse AMR-to-text setting (Fan and Gardent, 2020). We aim to close this gap by experimenting with different data augmentation methods for efficient multilingual AMR-to-text generation.

Multilingual AMR-to-Text Generation
In AMR-to-text generation, we transduce an AMR graph G to a surface realization as a sequence of tokens y = y 1 , . . . , y |y| . As input we use an English-centric AMR graph where the output y can be realized in different languages (see Figure 1).
We define x = LIN(G), where LIN is a function that linearizes G into a sequence of node and edge labels using depth-first traversal of the graph (Konstas et al., 2017). x is encoded, conditioned on which the decoder predicts y autoregressively.
Consequently, the encoder is required to learn language agnostic representations amenable to be used in a multilingual setup for the English AMR graph; the decoder attends over the encoded AMR and is required to generate text in different languages with varied word order and morphology.
To differentiate between languages, we prepend a prefix "translate AMR to <tgt_language>:" to the AMR graph representation. 2 We add the edge labels which are present in the AMR graphs of the LDC2017T10 training set to the encoder's vocabulary in order to avoid considerable subtoken splitting -this allows us to encode the AMR with a compact sequence of tokens and also learn explicit representations for the AMR edge labels. Finally, this multilingual approach allows us to have more AMR data on the encoder side when increasing the number of considered languages. This could be particularly helpful when using languages with little training data.

Data
Since gold-standard training data for multilingual AMR-to-text generation does not exist, data augmentation methods are necessary. Given a set of gold AMR training data for English and parallel corpora between English and target languages, we thus aim to identify the best augmentations strategies to achieve multilingual generation.
As our monolingual AMR-to-text training dataset, we consider the LDC2017T10 dataset (GOLDAMR), containing English AMR graphs and sentences. We evaluate our different approaches on the multilingual LDC2020T07 test set by Damonte and Cohen (2018) Table 1: Results on the multilingual LDC2020T07 test set. When training on multiple seeds, the standard deviation is between 0.1 an 0.3 BLEU. The results of our models compared to the MT baseline are statistically significant. mentary debates; Tatoeba, 4 a large database of example sentences and translations; and TED2020, 5 a dataset of translated subtitles of TED talks. For ZH, we use the UM-Corpus (Tian et al., 2014).

Creating Silver Training Data
We experiment with two augmentation techniques that generate silver-standard multilingual training data, described in what follows.
SILVERAMR. We follow Fan and Gardent (2020) and leverage the multilingual parallel corpora described in §3.2 and generate AMRs for the respective English sentences. 6 While the multilingual sentences are of gold standard, the AMR graphs are of silver quality. Similar to Fan and Gardent (2020), for each target language we extract a parallel dataset of 1.9M sentences.
SILVERSENT. We fine-tune mT5 as a translation model for English to the respective target languages, using the same parallel sentences used in SILVERAMR. Then, we translate the English sentences of GOLDAMR into the respective target languages, resulting in a multilingual dataset that consists of gold AMRs and silver sentences. The multilingual training dataset contains 36,521 examples for each target language.

Experiments
We implement our models using mT5 base from Hug-gingFace (Wolf et al., 2020). We use the Adafactor optimizer (Shazeer and Stern, 2018) and employ a linearly decreasing learning rate schedule without warm-up. The hyperparameters we tune include the batch size, number of epochs and learning rate. 7 The models are evaluated in the multilingual LDC2020T07 test set, using BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), chrF++ (Popović, 2015) and BERTscore (Zhang et al., 2020) metrics. We compare with a MT baseline -we generate the test set with an AMR-to-English model trained with T5 (Ribeiro et al., 2021) and translate the generated English sentences to the target language using MT. For a fair comparison, our MT model is based on mT5 and trained with the same data as the other approaches.
Training Strategies. We propose different training strategies under the setting of §3.2 in order to investigate which combination leads to stronger multilingual AMR-to-text generation. Besides training models using SILVERAMR or SILVERSENT, we investigate different combinations of multi-source training also using GOLDAMR.
Main Results. Table 1 shows our main results. 8 First, SILVERAMR substantially outperforms Fan and Gardent (2020) despite being trained on the same amount of silver AMR data. We believe this is because we utilize mT5, whereas Fan and Gardent (2020) use XLM (Conneau et al., 2020), and our parallel data may contain different domain data.
SILVERSENT considerably outperforms SILVER-AMR in all metrics, despite SILVERAMR consisting of two orders of magnitude more data. We believe the reasons are twofold: Firstly, the correct semantic structure of gold AMR annotations is necessary to learn a faithful realization; Secondly, SILVERSENT provides examples of the same domain as the evaluation test set. We observe similar performance to SILVERSENT when training on both GOLDAMR and SILVERAMR, indicating that the combination of target domain data and gold AMR graphs are necessary for downstream task performance. However, training on both GOLDAMR and SILVERSENT yields small gains, indicating that the respective information is adequately encoded within the silver standard dataset. We observe similar patterns when combining the silver standard datasets. While SILVER-AMR+SILVERSENT complement each other, resulting in the overall best performance, adding GOLDAMR does not yield any notably gains. These results demonstrate that both gold AMR structure and gold sentence information are important for training multilingual AMR-to-text models, while SIL-VERSENT are seemingly more important.
Effect of the Fine-tuning Order. In Figure 2 we illustrate the impact of different data source orderings when fine-tuning in a two-phase setup for IT. 9 Firstly, we observe a decrease in performance for all sequential fine-tuning settings, compared to our proposed mixed multi-source training, which is likely due to catastrophic forgetting. 10 Secondly, training on SILVERAMR and subsequently on SIL-VERSENT (or vice versa), improves performance over only using either, again demonstrating their complementarity. Thirdly, SILVERSENT continues to outperform SILVERAMR as a second task. Finally, GOLDAMR is not suitable as the second task for multilingual settings as the model predominantly generates English text.
Impact of Sentence Length and Graph Size. As silver annotations potentially lead to noisy inputs, models trained on SILVERAMR are potentially less capable of encoding the AMR semantics correctly, and models trained on SILVERSENT potentially generate fluent sentences less reliably. To analyze the advantages of the two forms of data, we measure the performance against the sentence lengths and 9 Other languages follow similar trends and are presented in Figure 4 in the Appendix. 10 The model trained on the second task forgets the first task.  graph sizes. 11 We define γ to be a ratio of the sentence length, divided by the number of AMR graph nodes. In Figure 3 we plot the respective results for SILVERAMR and SILVERSENT, categorized into three bins. We find that almost all SILVERAMR's BLEU increases for longer sentences, suggesting that training with longer gold sentences improves performance. In contrast, with larger graphs, the BLEU performance improves for SILVERSENT, indicating that large gold AMR graphs are also important. SILVERAMR and SILVERSENT present relative gains in performance on opposite ratios of sentence length and graph size, suggesting that they capture distinct aspects of the data.
Out of Domain Evaluation. To disentangle the effects of in-domain sentences and gold quality AMR graphs in SILVERSENT, we evaluate both silver data approaches on the Weblog and WSJ subset of the LDC2020T07 dataset; The domain of this subset is not included in the LDC2017T10 training set. We present the BLEU results in Table 2. 12 While we find that SILVERSENT prevails in achieving better performance -demonstrating that AMR gold structures are an important source for training multilingual AMR-to-text models -SILVERAMR and SILVERSENT perform more comparably than when evaluated on the full LDC2020T07 test set. This demonstrates that the domain transfer factor plays an important role in the strong performance of SIL-VERSENT. Overall, SILVERAMR+SILVERSENT outperforms both single source settings, establishing the
English Reference I wish I could wipe her out of my life -things would be so much better without her. complementarity of both silver sources of data.
Case Study. Table 3 shows an AMR, its reference sentences in ES and EN, and sentences generated in ES by SILVERAMR, SILVERSENT, and their combination. The incorrect verb tense is due to the lack of tense information in AMR. SILVERAMR fails in capturing the correct concept prep-without generating an unfaithful first sentence. This demonstrates a potential issue with approaches trained with silver AMR data where the input graph structure can be noisy, leading to a model less capable of encoding AMR semantics. On the other hand, SILVERSENT correctly generates sentences that describe the graph, while it still generates a grammatically incorrect sentence (wrongly generating que podía after desearía). This highlights a potential problem with approaches that employ silver sentence data where sentences used for the training could be ungrammatical, leading to models less capable of generating a fluent sentence. Finally, SILVERAMR+SILVERSENT produces a more accurate output than both silver approaches by generating grammatically correct and fluent sentences, correct pronouns, and mentions when control verbs and reentrancies (nodes with more than one entering edge) are involved.

Conclusion
The unavailability of gold training data makes multilingual AMR-to-text generation a challenging topic. We have extensively evaluated data augmentation methods by leveraging existing resources, namely a set of gold English AMR-to-text data and a corpus of multilingual parallel sentences. Our experiments have empirically validated that both sources of silver data -silver AMR with gold sentences and gold AMR with silver sentences -are complementary, and a combination of both leads to state-of-the-art performance on multilingual AMRto-text generation tasks.

Appendices A Details of Models and Hyperparameters
The experiments were executed using the version 4.4.0 of the transformers library by Hugging Face (Wolf et al., 2020). Table 4 shows the hyperparameters used to train our models. BLEU is used for model selection using translated sentences of the LDC2017T10 development set. We train until the results on the development set BLEU have not improved for 6 epochs.
learning rate 1e-04 batch size 8 beam search size 6 max source length 350 max target length 200

B Main Results: Additional Metrics
In Table 6 we present additional results on the multilingual LDC2020T07 test set using METEOR (Denkowski and Lavie, 2014), chrF++ (Popović, 2015) metrics.

C Results: Out of Domain Evaluation
In Table 5 we show BERTscore (Zhang et al., 2020) results for out of domain evaluation on the Weblog and WSJ subset of the LDC2020T07 dataset.