Generalising Multilingual Concept-to-Text NLG with Language Agnostic Delexicalisation

Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language. Previous approaches in this task have been able to generalise to rare or unseen instances by relying on a delexicalisation of the input. However, this often requires that the input appears verbatim in the output text. This poses challenges in multilingual settings, where the task expands to generate the output text in multiple languages given the same input. In this paper, we explore the application of multilingual models in concept-to-text and propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings, and employs a character-level post-editing model to inflect words in their correct form during relexicalisation. Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text and that our framework outperforms previous approaches, especially in low resource conditions.


Introduction
Recently, neural approaches to language generation have become predominant in various tasks such as concept-to-text Natural Language Generation (NLG), Summarisation, and Machine Translation thanks to their ability to achieve state-of-the-art performance through end-to-end training (Dušek et al., 2018;Chandrasekaran et al., 2019;Barrault et al., 2019). Specifically in Machine Translation, deep learning models have proven easy to adapt to multilingual output (Johnson et al., 2017) and have been demonstated to successfully transfer knowledge between languages, benefiting both the low and high resource languages (Dabre et al., 2020).
In the concept-to-text NLG task, the language generation model has to produce a text that is an bananaman first aired on the bbc on october 3rd, 1983 and broadcast its last episode on april 15th, 1986.
Таблица 1: Delexicalisation and relexicalisation strategies on English and Russian examples from the WebNLG 2020 Challenge.
Exact Delexicalisation: X1 first aired on the X2 on october 3rd, 1983 and broadcast its last episode on april 15th, 1986.

Language Agnostic Delexicalisation (LAD):
X1 first aired on the X2 on october X3 and broadcast its last episode on X4.
Таблица 2: Delexicalisation and relexicalisation strategies on English and Russian examples from the WebNLG 2020 Challenge.
Generated Output before Relexicalisation: (assuming training with LAD) X1 first aired on the X2 on october X3 and broadcast its last episode on X4.

Exact Relexicalisation:
bananaman first aired on the bbc on october 1983 10 03 and broadcast its last episode on 1986 04 15.
Automatic Value Post-Editing (VAPE): bananaman first aired on the bbc on october 3rd, 1983 and broadcast its last episode on april 15th, 1986.
бананамен впервые вышел в эфир на бибиси 3 октяб- amdavad ni gufa is located in india, where the leader is t s thakur and the demonym for people living there is indian.
Exact Delexicalisation: X1 is located in X2, where the leader is t s thakur and the demonym for people living there is indian.

Language Agnostic Delexicalisation (LAD):
X1 is located in X2, where the leader is X3 and the demonym for people living there is X4.

Таблица 4: Delexicalisation and relexicalisation strategies on English and Russian examples from the WebNLG 2020 Challenge.
Generated Output before Relexicalisation: X1 is located in X2 where X3 is the leader and the people are known as X4.

Exact Relexicalisation:
amdavad ni gufa is located in india where t.s. thakur is the leader and the people are known as indian people.
Automatic Value Post-Editing (VAPE): amdavad ni gufa is located in india where t.s. thakur is the leader and the people are known as indians.
амдавад ни гуфа находится в индии, где лидером является т.с. тхакур. местные жители известны как ин- accurate realisation of the abstract semantic information given in the input (Meaning Representation, MR; see Figure 1). It is common practice to perform a delexicalisation (Wen et al., 2015) of the MR, in order to facilitate the NLG model's generalisation to rare and unseen input; lack of generalisation is a main drawback of neural models (Goyal et al., 2016) but is particularly prominent in concept-to-text. Delexicalisation consists of a preprocessing and a postprocessing step. In prepro-cessing, all occurrences of MR values in the text are replaced with placeholders. This way the model learns to generate text that is abstracted away from actual values. In postprocessing (relexicalisation), placeholders are re-filled with values. The main shortcoming of delexicalisation is that its efficacy is bounded by the number of values that are correctly identified. In fact, a naive implementation of "exact" delexicalisation (see Figure 1) requires the values provided by the MR to appear verbatim in the text, which is often not the case. This shortcoming is more prominent when expanding concept-to-text to the multilingual setting, as MR values in the target language are often only partially provided. Additionally, MR values are usually in their base form, which makes it harder to find them verbatim in text of morphologically rich languages. Finally, relexicalisation also remains a naive process (see Figure 2) that ignores how context should effect the morphology of the MR value when it is added to the text (Goyal et al., 2016).
We propose Language Agnostic Delexicalisation (LAD), a novel delexicalisation method that aims to identify and delexicalise values in the text independently of the language. LAD expands over previous delexicalisation methods and maps input values to the most similar n-grams in the text, by focusing on semantic similarity, instead of lexical similarity, over a language independent embedding space. This is achieved by relying on pretrained multilingual embeddings, e.g. LASER (Artetxe and Schwenk, 2019). In addition, when relexicalising the placeholders, the values are processed with a character-level post editing model that modifies them to fit their context. Specifically in morphologically rich languages, this post editing results in the value exhibiting correct inflection for its context.
Our goal is to explore the application of multilingual models with a focus on their generalisation capability to rare or unseen inputs. In this paper, we (i) apply multilingual models and show that they outperform monolingual models in conceptto-text, especially in low resource conditions; (ii) propose LAD and show that it achieves state-of-theart results, especially on unseen input; (iii) provide experimental analysis across 5 datasets and 5 languages over models with and without pre-training.

Related Work
Multilingual generation techniques have mostly been the focus of Machine Translation (MT) as ний эпизод вышел X4 года.
Таблица 2: Delexicalisation and relexicalisation strategies on English and Russian examples from the WebNLG 2020 Challenge. Generated Output before Relexicalisation: (assuming training with LAD) X1 first aired on the X2 on october X3 and broadcast its last episode on X4.

Exact Relexicalisation:
bananaman first aired on the bbc on october 1983 10 03 and broadcast its last episode on 1986 04 15.
Automatic Value Post-Editing (VAPE): amdavad ni gufa is located in india where t.s. thakur is the leader and the people are known as indians.
Таблица 5: Delexicalisation and relexicalisation strategies on English and Russian examples from the WebNLG 2020 Challenge.
1 Figure 2: Relexicalisation examples; double underlining marks errors that ignore context. the appropriate data (multilingual parallel source and target sentences) are more readily available there. Earlier research enabled multilingual generation with no and partial parameter sharing (Luong et al., 2016;Firat et al., 2016), while Johnson et al.
(2017) explored many-to-many translation with full parameter sharing in a universal encoder-decoder framework. Despite the successes of this many-tomany framework, the improvements were mainly attributed to the model's multilingual input.  improved on one-to-many translation (i.e. the input is always on a single language, while the output is on many) by introducing special label initialisation, language-dependent positional embeddings and a new parameter-sharing mechanism.
In other language generation tasks, the vast majority of datasets are only available with English output. To enable output in a different language, a number of Zero-Shot methods have been proposed with the most common practice being to directly use an MT model to translate the output into the target language (Wan et al., 2010;Shen et al., 2018;Duan et al., 2019). The MT model can be finetuned on task-specific data when those are available (Miculicich et al., 2019). For the purposes of this paper, we do not consider these previous works as multilingual, as the language generation model is disjoint from the multilingual component, i.e. the pipelined MT model. Contrary to this, Chi et al.
(2020) proposed a cross-lingual pretrained masked language model to generate in multiple languages, outperforming pipeline models on Question Generation and Abstractive Summarisation.
An adaptation of Puduppully et al. (2019) was applied to multilingual concept-to-text NLG and participated in the Document-Level Generation and Translation shared task (Hayashi et al., 2019, DGT). However, this shared task, and in extension the dataset and participating systems, heavily focus on content selection and document generation. Additionally, the input's attributes are constant across train and testing, so there are no unseen data and no need to improve on the model's generalisation capability. As the goal of this paper is multilinguality (content selection is a language agnostic task) and generalisation, we opt to not use this dataset.
Multilinguality has also been explored in the related tasks of Morphological Inflection and Surface Realisation in SIGMORPHON (McCarthy et al., 2019) and MSR (Mille et al., 2020) challenges. However, our Automatic Value Post-Editing approach focuses mostly on adapting values to context and does not assume additional input such as dependency trees, PoS tags or morphological information that Surface Realisation and Morphological Inflection often requires.
Particularly for concept-to-text NLG, notable previous works includes the approach of Fan and Gardent (2020) who make use of pretrained language models through the Transformer architecture for AMR-to-text generation in multiple languages, and the WebNLG Challenge 2020 (Castro Ferreira et al., 2020). The goal of WebNLG 2020 was to generate output in both English and Russian but most of the participants focused on monolingual rather than multilingual approaches.

Rare and Unseen Inputs in NLG
Due to the existence of open-set and numerical attributes in the aforementioned datasets, it is common during testing for MRs to contain rare or unseen values. Certain datasets are even more challenging in this regard (e.g. WebNLG Challenge 2020) as they also contain unseen relations in the development and test subsets. Several techniques have been proposed to mitigate this problem.
Delexicalisation, also known as anonymisation or masking, is a pre/post-processing procedure that attempts to mitigate problems with data sparsity. In preprocessing, all values in the MR that appear verbatim in the target sentence are replaced in both input and output with specific placeholders, e.g. "X-" followed by the corresponding attribute (e.g. "X-type") so that the placeholder still captures rele-vant semantic information. In Figure 1 we use numbered placeholders instead, for clarity and space. The model is trained to generate the target text containing these placeholders, which are subsequently replaced with the corresponding true values (i.e. relexicalised) in post-processing. See Figures 1 and Figure 2 for examples; we mark this strategy as Exact due to the exact matching of the values with the text. To improve delexicalisation accuracy, n-gram matching (Trisedya et al., 2018) has been proposed as an alternative. Thanks to its simplicity and efficacy, delexicalisation is widely used by many systems, including the winning systems of major concept-to-text NLG shared tasks (Gardent et al., 2017;Dušek et al., 2018;Castro Ferreira et al., 2020). Mapping the values as such can be sufficient for simple datasets, but otherwise, incorrect or incomplete delexicalisation will lead to inconsistent input and deteriorate performance.
Lastly, problems may also occur during relexicalisation as it does not take into account the context in which the placeholders are situated and may result in disfluent sentence. For a simplified example, observe how placing the unedited dates in the placeholders leads to disfluent output in Figure 2.
Segmentation strategies are commonly used in Neural Machine Translation to improve the generalisation ability of models. The objective is to break down words into smaller units, reducing the vocabulary and the number of unseen tokens (Sennrich et al., 2016). Unfortunately, applying segmentation in concept-to-text NLG, e.g. using Byte-Pair-Encoding (BPE) subword units (Gardent et al., 2017; or using characters as basic units (Goyal et al., 2016;Agarwal and Dymetman, 2017;Deriu and Cieliebak, 2018), underperforms against delexicalisation. Challenges include capturing long dependencies between segmented words, and generating non-existing words.
Copy mechanism is another method to address unseen input, by allowing the decoder of an encoder-decoder model to draw a token directly from the input sequence instead of generating it from the decoder vocabulary (See et al., 2017). While applications of the copy mechanism in concept-to-text NLG have achieved overall good results (Chen, 2018;Elder et al., 2018;Gehrmann et al., 2018), when dealing with rare and unseen inputs delexicalisation is still preferable (Shimorina and Gardent, 2018). To improve the generalisation ability of copy mechanism models, Roberti et al. (2019) propose applying the copy mechanism to character-level NLG systems. This is combined with an additional optimisation phase during training where the encoder and decoder are switched.

Language Agnostic Delexicalisation
In order to address the shortcomings of previous approaches to generalise over rare or unseen inputs, especially in cases of multilingual output, we propose Language Agnostic Delexicalisation (LAD). Figure 3 shows an overview of LAD; the input and output are first delexicalised using pretrained language-independent embeddings, and (optionally) ordered. The multilingual generation model is trained on the delexicalised training data, and the output is relexicalised using automatic value post-editing to ensure that the values fit the context. Each component is described in more detail bellow.
To enable multilingual generation, we adapt the universal encoder-decoder framework via "target forcing" (Johnson et al., 2017) since it can be directly applied to any NLG model without the need to modify the latter's architecture. To do so, we extend the input MR in the encoder with a language token that signals which language the model should generate output in. In addition, we follow  and initialise the decoder with the language token. The rest of the components (i.e. delexicalisation, ordering, and value post-editing) are orthogonal to the model's architecture.

Value Matching
As discussed in Section 3, one of the challenges of delexicalisation is matching the MR values with corresponding words in the text, especially in the multilingual setting. Even when the MR values are in the same language as the target, we observe from the examples in Figure 1 that token overlapping methods (i.e. exact and n-gram matching) are not sufficient to generate a complete and accurate delexicalisation as values may appear differently.
To counter this problem, LAD performs matching by mapping MR values to n-grams based on the similarity of their representations. Specifically, it calculates the similarity between a value v and all word n-grams w i . . . w j in the text, with j − i < n and n set to the maximum value length observed in the training data. LAD employs LASER (Artetxe and Schwenk, 2019) to generate language agnostic sentence embeddings of the values and n-grams, and calculates their distance via cosine similarity. Given an MR and text, all possible value and ngram comparisons are calculated and the matches are determined in a greedy fashion.

Generic Placeholders and Ordering
In Section 3, we discussed how the WebNLG datasets are more challenging because they contain unseen attributes in the development and test subsets, in addition to unseen values. This is problematic when we use attribute-bound placeholders (e.g. "X-type") as unseen attributes will result in unseen placeholders. Following Trisedya et al. (2018), for the WebNLG datasets, LAD uses numbered generic placeholders "X#" (e.g. "X1"). Unfortunately, the adoption of generic placeholders creates problems for relexicalisation as it becomes unclear which input value should replace which placeholder. We address this by ordering the model's input based on the graph formed by its RDF triples, again by following Trisedya et al. (2018). We traverse every edge in the graph, starting from the node with the least incoming edges (or randomly in case of ties) and then visit all nodes via BFS (breadth-first search). We then trust that the model will learn to respect the input order when generating, and follow the order to relexicalise the placeholders.
We note that this is only required for models that employ delexicalisation strategies and for datasets with unseen attributes (i.e. the WebNLG Challenge datasets). Concept-to-text NLG systems do not generally require ordered input (Wen et al., 2015).

Automatic Value Post-Editing
As discussed in Section 3, a naive relexicalisation of the placeholders may lead to disfluent sentences, as the procedure does not take into account the context in which the placeholders have been placed. For example, in the sentence "there are 2 X that have free parking", if we need to replace the placeholder "X" with the MR value "guesthouse" , the value should be pluralised to fit the context. This problem is more evident in morphologically rich languages, where more factors affect the value's form. To alleviate this, the LAD framework incorporates an Automatic Value Post-Editing component, consisting of a character-level seq2seq model that iterates over values as they are placed in the text and modifies them to fit the context of their respective placeholders. Anastasopoulos and Neubig (2019) has already shown the benefits of character models on morphological inflection generation, but no previous work has addressed how relexicalisation should adapt to context. Our proposed VAPE model requires as input the MR placeholder e i , original value v i and corresponding NLG output w 1 . . . w n for context; these are serialised and passed to the encoder. Similar to the multilingual model, we add an appropriate language token L before the NLG output. The output of VAPE is the MR value v i in the proper form.
The training signal for VAPE is obtained during delexicalisation. For a given delexicalisation strategy, we obtain all pairs of MR values and matching n-grams in the training data, and subsequently train VAPE using these n-grams as the targets. Therefore, the VAPE model is dependent on the quality of the delexicalisation strategy; specifically for exact delexicalisation, VAPE cannot be trained as the MR values and matching n-grams are the same.
Most edits VAPE performs concern incorrect inflections, but it is not limited to morphological edits and has the potential to deal with various types of modifications. During our experiments we observed VAPE performing value re-formatting (e.g. "1986 04 15" → "April 15th 1986"), syn-onym generation (e.g. "east" → "oriental") and value translation (e.g. "bbc" from Latin to Cyrillic).
The WebNLG Challenge 2017 (Gardent et al., 2017, WebNLG17) data consists of sets of RDF triple pairs and corresponding English texts in 15 DBPedia categories. For our purposes, we will be using a later work (Shimorina et al., 2019) that introduced a machine translated Russian version of WebNLG17, a part of which was post edited by humans. Due to the limited amount of human corrected Russian sentences, and to facilitate the most accurate evaluation, we use these solely for testing. To ensure that half of the domains in the new test set remain unseen during training, we create our own train/dev/test split by retaining the following DBPedia categories from training and development sets: Astronaut, Monument and University.
The latest incarnation of the WebNLG Challenge (Castro Ferreira et al., 2020, WebNLG20) is fully human annotated for both English and Russian. We use this as the main dataset in our experiments, as it is designed to promote multilinguality. However, due to the fact that the provided test set does not contain unseen Russian instances, we perform our experiment on a custom split (WebNLG20*) ensuring that part of the domains in the test data remain unseen during training. The split was performed similarly to the previously described WebNLG17.
MultiWOZ 2.1 (Eric et al., 2020) and Cross-WOZ (Zhu et al., 2020) are datasets of dialogue acts and corresponding utterances in English and Chinese respectively. The two datasets share the same structure, with MultiWOZ covering 7 domains and 25 attributes, and CrossWOZ covering 5 domains and 72 attributes; 4 of the domains are common in both datasets though CrossWOZ has more attached attributes. Multilingual WOZ 2.0 (Mrkšić et al., 2017) is also a dialogue dataset with utterances available in three languages: English, Italian and German. Its scope is more limited than MultiWOZ and CrossWOZ as it only covers a single domain.
For all models in our experiments, the input consists of a simple linearisation of the MRs. Particularly, for the delexicalisation based models, the values are extended with their respective placehold-ers as shown in the following example: "ENTITY 1 meyer werft location ENTITY 2 germany".

Ablation Study
First we perform an ablation study to determine how the different components of LAD (ordering and VAPE) affect its performance; LAD being our full Language Agnostic Delexicalisation model as described in Section 4. In addition to LAD, where these components are incrementally removed, we explore how their addition would influence exact and n-gram delexicalisation (Trisedya et al., 2018). We do not explore adding VAPE to exact delexicalisation (there is no EXACT + O + V variant), as it cannot be trained in this setting (see Section 4.3).
In Table 1, we observe that both components are beneficial, but less so for seen English data. For the more morphologically rich and lower resourced Russian, the components are helpful for both seen and unseen. VAPE leads to an improvement in performance in almost all cases and even when added on NGram. An exception is unseen English data, where removing VAPE is beneficial; this suggests that VAPE is overeager to make edits in English.
By studying the output, we observe that VAPE modified 20% of values in English, and 66% in Russian; directly copying the value was insufficient in Russian where proper inflection is needed. We identified three consistent errors where copying the original value would be preferable to using VAPE: the removal of date information (e.g. "1969-09-01 → 1st, 1969"), misspelling of proper nouns (e.g. "atatürk monument" → "atat erk monument"), and mishandling of long values (e.g. "ottoman army soldiers killed in the battle of baku" → "ottoman army soldiers killed in the batttle of kiled in the bathe batom"). We observe that these errors occur more frequently for English unseen cases, but could be reduced by extending VAPE with a control mechanism that decides whether copying the values themselves is preferable. Such errors occur in part because VAPE, as a character-level model, suffers from the same challenges as other segmentation methods (see Section 3). However, since VAPE's input is much shorter, the problem is not as prevalent. Overall, LAD outperforms the previous delexicalisation strategies Exact and NGram, and VAPE is shown to be integral to its performance.

Monolingual vs Multilingual
Here we explore the performance of monolingual and multilingual models on concept-to-text  datasets. The Word model has the exact same architecture as LAD but no delexicalisation is performed, and consequently no automatic value post-editing and no ordering. Since there is no relexicalisation that needs to occur during post-processing, the input to the Word model needs not be specifically ordered, and is just a concatenation of the RDF triples as they appear in the original dataset. For multilingual, we add the appropriate language tokens on the input of Word, in the same manner we added them to LAD. For the monolingual (Mono) configuration we train the models to produce a single language, while for multilingual (Multi) we train them to produce all languages available in that dataset. Please refer to Table 2 for the results. We observe that the multilingual models outperform their monolingual counterpart in most datasets and languages, especially with LAD as its delexicalisation and relexicalisation modules are more robust to multilingual input and output. Specifically for the MultiWOZ and CrossWOZ datasets, in the monolingual setting the models are trained exclusively on the respective dataset, i.e. MultiWOZ for English, and CrossWOZ for Chinese. For multilingual, we take advantage of the fact that these datasets share the same structure, and train the models on both datasets. For English,   we observe that the multilingual model improves, suggesting that domain knowledge is transferred from CrossWOZ. For Chinese however, the multilingual Word model underperforms. This is not very surprising, as the overlap between the datasets is favourable to MultiWOZ, i.e. most of the attributes of MultiWOZ also appear in CrossWOZ, while the majority of CrossWOZ's attributes do not appear in MultiWOZ.

Multilingual Generalisation
Tables 3 contains full results for English and Russian on WebNLG20* respectively. We include the Word configuration (see Section 5.2), as well as Char, BPE, and SP, which are variations that use characters, Byte-Pair-Encoding, and SentencePiece as subword units respectively. Copy refers to the copy mechanism model by Roberti et al. (2019). The SP model performs very well for seen categories, but fails to generalise on unseen data. The Copy model performs well for unseen categories in English, but underperforms in Russian as values for it are only partially translated, i.e. some values in the MR may appear in English while others appear in Russian. This is challenging for Copy models as the target reference does not closely match the input, but LAD can handle it more robustly. Observing the output, LAD's main advantage is that it avoids under-and over-generating values as they are being controlled by the placeholders. 1 SP is often the most fluent of the models, but for  Overall, LAD helps the multilingual model outperform all other models in both English and Russian. It is especially beneficial in generalising to unseen data, as was its main objective after all.

Generalising with Pretrained Models
Here we explore the generalisation capabilities of multilingual pretrained models, by replacing the underlying NLG model with mBART (Liu et al., 2020), a multilingual denoising autoencoder pretrained on a large-scale dataset containing 25 languages (CC25). Similarly to Kasner and Dušek (2020), we fine-tune mBART with the default EN-RO configuration for up to 10000 updates. Using mBART as the underlying model also helps facilitate a comparison against a configuration that is similar to many of the state of the art participants in the WebNLG 2020 Challenge, although some of them used different pretrained models. Table 4 shows the performance of the fine-tuned models on the WebNLG20* dataset. The mBARTbased model outperforms the non-delexicalisation SP, and non-pretrained LAD in English. However, LAD still performs better in Russian. This makes sense as the CC25 dataset is heavily biased towards the English language and contains double the amount of tokens compared to Russian, and much more compared to other lower-resource languages. Combining the LAD framework with mBART (mB-LAD) resulted in a general improvement in performance, especially for lower-resource unseen data. However, as discussed in Section 5.1, the VAPE component remains to some degrees susceptible to unseen contexts. To tackle this issue, we improve VAPE by pre-loading mBART and finetuning it for value post-editing as well (mB-LAD+),   achieving 3 and 29 points increase in BLEU score for unseen English over the vanilla mBART and LAD models, and 26 and 20 points for unseen Russian. Additionally, to take advantage of mBART's denoising ability, we extend the fine-tuned VAPE to edit the "exact" relexicalised NLG output and provide a sentence-level output (mB-LAD-SPE), i.e. edits are not exclusively focused on the values. Results show that mB-LAD-SPE improves further mB-LAD+ on Russian in both seen and unseen. Table 5 and 6 also shows the automatic evaluation of the fine-tuned mBART models on the official WebNLG20 Challenge testset; the official test set had no unseen subset of Russian. The results are consistent with the findings in our previous experiments, with small improvements of LAD-based mBART models over the mBART-base.

Synthetic Data
We use the WebNLG17 automatically translated Russian "silver" data, to determine how useful they are for training multilingual concept-to-text NLG. As preliminary results were not promising, we limit the scope of the experiment to only a few systems. Table 7 gathers the results It is apparent that automatically translated data are insufficient; LAD seems to more consistently achieve higher performance than other models, but all scores are too low to draw any sufficiently supported conclusions.

Conclusion
We proposed Language Agnostic Delexicalisation, a novel delexicalisation framework that matches and delexicalises MR values in the text independently of the language. For relexicalisation, an automatic value post editing model adapts the values to their context. Results show that multilingual models outperform monolingual models, and that LAD outperforms previous work in improving the performance of multilingual models, especially in low resource conditions. LAD also improves on the performance of pre-trained language models achieving state-of-the-art results. The automatic value post editing component is especially beneficial in morphologically rich languages.

A Configurations
The multilingual NLG and VAPE use a transformer as underlying architecture. We use the fairseq toolkit for our experiments (Ott et al., 2019). The models are trained with shared embeddings, 8 attention heads, 6 layers, 512 hidden size, 2048 size for the feed forward layers. We trained with 0.3 dropout, adam optimiser with a learning rate of 0.0005. The NLG are trained with early stopping and patience set to 20. Automatic value post edit models are trained with the same configuration but patience was set to 6. For the copy mechanismbased model we use the EDA-CS implementation provided by Roberti et al. (2019) with the default configuration. Due to its extremely high computational training cost, the models are trained for 15 epochs. BPE and SentencePiece (Kudo and Richardson, 2018) models are trained with a vocabulary size set to 12000 tokens. For all models in our experiments, the input consists of a simple linearisation of the MRs. Particularly, for the delexicalisation based models, the values are extended with their respective placeholders as shown in the following example: "ENTITY 1 meyer werft location ENTITY 2 germany.
B Input examples Figure 6 shows some examples of how, during training, LAD maps MR values to n-grams of the target reference, based on the similarity of their representations. We can observe that these values could not have been matched by exact and n-gram delexicalisation as they constitute significant paraphrases of the value. Figure 4 and 5 show some additional examples of delexicalisation and relexialisation for the various approaches from the WebNLG Challenge 2020. Table 8 shows more delexicalisation examples from WebNLG, MultiWOZ and CrossWOZ datasets, where we can observe the shortcomings of exact and n-gram delexicalisation. bananaman first aired on the bbc on october 3rd, 1983 and broadcast its last episode on april 15th, 1986. бананамен впервые вышел в эфир на bbc 3 октября 1983 года, а его последний эпизод вышел 15 апреля 1986 года.

Language Agnostic Delexicalisation (LAD):
X1 first aired on the X2 on october X3 and broadcast its last episode on X4.

Exact Relexicalisation:
amdavad ni gufa is located in india where t.s. thakur is the leader and the people are known as indian people.
Automatic Value Post-Editing (VAPE): amdavad ni gufa is located in india where t.s. thakur is the leader and the people are known as indians.

Gold Target References:
amdavad ni gufa is located in india, where the leader is t s thakur and the demonym for people living there is indian.
Exact Delexicalisation: X1 is located in X2, where the leader is t s thakur and the demonym for people living there is indian.

Language Agnostic Delexicalisation (LAD):
X1 is located in X2, where the leader is X3 and the demonym for people living there is X4.
Таблица 6: Delexicalisation and relexicalisation strategies on English and Russian examples from the WebNLG 2020 Challenge.
2 Figure 6: Examples of LAD's value mapping to target reference n-grams.
MR: William_Anders, dateOfRetirement, "1969-09-01" William_Anders, occupation, Fighter_pilot William_Anders, birthPlace, British_Hong_Kong William_Anders, was a crew member of, Apollo_8 SP: the birth place of greek born , adonis georgiadis , is the company , of which was in office at the same time that m ogenenenenenenenenville , new britain , connecticut , is a member of the order of poales and a division of 45000 kilometres . Copy: william anders was born in british hong kong and has a crew mew member of the fighter pilot . the was a crew member of the was a crew member of the was a crew member of the was a crew member of LAD: william anders , which was followed by 1st , 1969 and fighter pilot , was born in british hong kong and has been a number of apollo 8 . Trane, numberOfEmployees, 29000 SP: trane has a revenue of $ 10,264,000,000 , with a net income of $ 556,300,000 and a revenue of $ 10,264,000,000 . Copy: trane , a company with 29,000 employees , has 29,000 employees and was connected at $ 556,300,000 . LAD: trane , which has a revenue of $ 10,264,000,000 , has a net income of $ 556,300,000 and employs 29,000 people .
MR: William_Anders, dateOfRetirement, "1969-09-01" William_Anders, occupation, Fighter_pilot William_Anders, birthPlace, British_Hong_Kong William_Anders, was a crew member of, Apollo_8 SP: the birth place of greek born , adonis georgiadis , is the company , of which was in office at the same time that m ogenenenenenenenenville , new britain , connecticut , is a member of the order of poales and a division of 45000 kilometres . Copy: william anders was born in british hong kong and has a crew mew member of the fighter pilot . the was a crew member of the was a crew member of the was a crew member of the was a crew member of LAD: william anders , which was followed by 1st , 1969 and fighter pilot , was born in british hong kong and has been a number of apollo 8 .