Promoting Graph Awareness in Linearized Graph-to-Text Generation

Generating text from structured inputs, such as meaning representations or RDF triples, has often involved the use of specialized graph-encoding neural networks. However, recent applications of pretrained transformers to linearizations of graph inputs have yielded state-of-the-art generation results on graph-to-text tasks. Here, we explore the ability of these linearized models to encode local graph structures, in particular their invariance to the graph linearization strategy and their ability to reconstruct corrupted inputs. Our findings motivate solutions to enrich the quality of models' implicit graph encodings via scaffolding. Namely, we use graph-denoising objectives implemented in a multi-task text-to-text framework. We find that these denoising scaffolds lead to substantial improvements in downstream generation in low-resource settings.


Introduction
Parameter-rich pretrained transformer language models succeed at generating text that is prima facie fluent, but that closer inspection will often reveal to be semantically transgressive (Bisk et al., 2020). Indeed, there is limited practical use for unconditional text generation: we expect language to relate to some identifiable, extrinsic meaning. When a system communicates information to an individual in natural language, it will typically rely on a structured representation of that information. Consequently, generating text that faithfully conveys structured data is an important goal in NLP, where inputs can take the form of tables (ToTTo, Parikh et al., 2020), RDF triples (e.g., WebNLG, Gardent et al., 2017), or Abstract Meaning Representations (AMR, Flanigan et al., 2016).
To accomplish this task, models have used neural architectures that explicitly encode graphs, such as graph neural networks (GNNs, Kipf and Welling, * Work undertaken during an internship at AI2. (1) Linearize graph (2) Finetune with one linearization The boy wants to go Pretrained Language Model To go the boy wants Finetuned Language Model (go :arg0 (3) Evaluate with an alternative linearization 2017) and graph transformers, in order to accurately capture the structural properties of the input graph (Zhu et al., 2019;Zhao et al., 2020; to name a few). As an alternative to constraining a model architecture with a graph structure, another line of work linearizes a graph into a string ( Figure 2) and trains a sequenceto-sequence model from scratch (Pourdamghani et al., 2016;Konstas et al., 2017;Vinyals et al., 2015). Initially, this approach was outperformed by graph-based encoders, but such models have recently seen their generation performance far surpassed by pretrained transformer language models (LMs) finetuned on pairs of linearized graphs and their corresponding surface realizations (Mager et al., 2020;Kale and Rastogi, 2020;Harkous et al., 2020;Ribeiro et al., 2020, henceforth termed pretrained linearized models). Moreover, both au-tomated and human assessments indicate that text generated with LMs retains meaning at least as well as graph-encoding baselines (Mager et al., 2020). This is not the sole product of pretrained models' general language knowledge: Mager et al. (2020), using a GPT-2-based (Radford et al., 2019) model, report that ablating structural graph information (e.g., edges) in the linearized representation notably degrades generation performance, particularly in AMR-to-text tasks. The remarkable performance of pretrained linearized models is intriguing: explicit representation of the input graph by way of the model architecture appears to be well-substituted by simply writing the graph as a linear sequence.
In this work, we further investigate the extent to which pretrained models can leverage linearized graph inputs. Focusing on AMR graphs and sets of RDF triples in English-language datasets, we structure our investigation by first testing whether models' encodings are invariant to the linearization strategy-the way in which a graph is traversed and encoded when producing the linearized representation (see Figure 1). We discover that generation suffers under adversarial permutations of the linearization, and embrace a simple-buteffective training strategy to mitigate this problem: adversarial training (Goodfellow et al., 2015). Motivated by this finding, we encourage more faithful encodings of graph structure via denoising objectives in the more complex AMR setting. This multi-task scaffolding  reveals that straightforward masking of the graph input is sufficient to improve generation quality in low resource settings.
Moreover, when treating this denoising performance as a proxy for the quality of models' implicit graph encoding, we find that it explains the semantic fidelity of the resulting generation better than reasonable alternatives, suggesting possibilities for future evaluation metrics. 2 Background: Graph-to-Text Generation In a graph-to-text setting, we transduce graph inputs g to their corresponding surface realization y = y 1 , . . . , y N via a parameterized probabilsitic model p θ (·). In linearized models specifically, the graph g is first mapped to text by way of a (usually deterministic) linearization function x = l(g), where p θ (·) is an off-the-shelf sequenceto-sequence model. This leads to the likelihood objective: is a left-to-right (autoregressive) pretrained transformer, generation quality far exceeds architectures with encoders specifically engineered to encode graphs (Mager et al., 2020;Kale and Rastogi, 2020;Harkous et al., 2020;Ribeiro et al., 2020).

Graph-to-Text Generation Datasets
We explore two datasets for generation from a graph structure to English text.
Abstract Meaning Representation (AMR, Banarescu et al., 2013) is a formalism intended to represent the propositional meaning of utterances-"who is doing what to whom"-using graphs that have minimal dependence on the surface form. AMR graphs are directed and acyclic with a single "top" node (Goodman, 2020). They can be represented as either a graph, a tree, or sets of triples (van Noord and Bos, 2017). For our data, we use the AMR 2.0 release (LDC2017T10), 1 both because it spans a varied set of domains and styles, and because of its extensive use in prior work.
A simpler graph-to-text problem involves converting a set of RDF triples to natural text realizations of the information contained in the set, exemplified by the WebNLG dataset (Gardent et al., 2017). WebNLG pulls information from an existing knowledge base (DBPedia, Mendes et al., 2012) for a specific subset of 15 categories (e.g., "astro-naut"). To generate the paired sentences, crowdworkers verbalize individual triples. Then, for examples consisting of multiple triples, they merge already-annotated sentences and apply minimal changes (leading to reduced sentence complexity relative to AMR, see perplexity scores in Table 1). There can be multiple surface realizations per input.
We modify the T5 implementation from the transformers library (Wolf et al., 2020). 2 We use the Adafactor optimizer (Shazeer and Stern, 2018) with a learning rate of 0.0001, selected from the set {0.001, 0.0001, 3 × 10 −5 , 1 × 10 −5 , 1 × 10 −6 } after tuning on 1000 training examples across five random seeds. 3 We set the batch size to 6 and train until development set BLEU has not improved for 10 epochs. During decoding, we use a beam size of 10 for WebNLG and 5 for AMR.
Evaluation Measures As a primary metric, we evaluate generated text using BLEU (Papineni et al., 2002), calculated with SacreBLEU (Post, 2018). Despite its limitations in generation settings, BLEU still generally accords with rankings of models, either by human evaluations or by alternate metrics (Manning et al., 2020). We also evaluate our scaffolding models ( §4) using BertScore , which measures token similarity with contextual embeddings, permitting a more nuanced measure of semantic similarity. Lastly, we use the M portion of the MF-score (Opitz and Frank, 2020), which measures how well the source AMR graph can be reconstructed from the generated target sentence using an off-the-shelf parser. Unlike BLEU, which applies corpus-wide, this metric provides a best-guess at sentence-level accuracy. particular method used to linearize the input graph. Motivated by the strong graph-to-text performance of these models, we ask: do they implicitly develop a robust internal encoding of the input graph? Whereas a GNN-based model has an architecture designed for graph representation (e.g., information flows between adjacent nodes in a message-passing update), a linearized model must infer how connections are specified in a sequence during training.
If linearized models do form a representation, then the their estimates of the target sentence should be invariant to an alternative linearization of the same graph, so long as the original linearization is in principle recoverable from this alternative. If a model meets this criterion, we call it linearization-invariant.

Experimental Setup
To better understand models' graph-encoding behavior, we experiment with adversarial linearization strategies in two graph-to-text settings.

Permutations of AMR-Graph Linearizations
Standard AMR corpora are linearized as spanning trees over the graphs in PENMAN notation (Matthiessen and Bateman 1991, see Fig. 2a). In the present work, we also linearize graphs using PENMAN, doing so for several reasons: (1) it is sufficiently flexible to accommodate significant changes to the linearization, discussed below; (2) it is more concise than sets of directed triples, both reducing training time and ensuring that inputs fit in the transformer context window; (3) the format leads to superior generation over reasonable alternatives, e.g., DFS traversal paths (Mager et al., 2020).
We will refer to the human-created linearizations in AMR corpora as CANONICAL, since annotators follow a standardized process. There is evidence that this format, in particular the relative ordering of edge types, leaks information about the associated sentence order (Konstas et al., 2017). We speculate that overparametrized models may overfit to such correlations rather than develop robust implicit graph encodings, since it has been repeatedly reported that large models use dataset shortcuts (Jia and Liang, 2017;Gururangan et al., 2018;Geva et al., 2019, among others).
As an alternative linearization, Goodman (2020) defines the RECONFIGURE operation as creating a tree from an AMR graph, where order information from the canonical linearization is ignored, except for the top node (e.g., and in Figs. 2a and 2b).
Although it is not a labeled element in the graph, the top node conveys structural information about the sentence-for instance, it is often the main verb. Reconfiguration can include reversals of edge labels (e.g., ARG0 to ARG0-of), therefore constituting a substantive change to the linearization. We also experiment with a more drastic restructuring of the graph, where we construct a tree from a RANDOMIZED triple set alone, disregarding all order information from the canonical format (Fig. 2c). Since it remains a valid traversal of the graph, in principle a model should be able to use this information to construct the surface sentence.
We parse, reconfigure, and randomize graphs using the Penman library (Goodman, 2020), 4 then replace variable names with their references and remove word sense information, following Ribeiro et al. (2019).

Permutations of RDF-Triple Linearizations
We follow the procedure of Ribeiro et al. (2020) to form our standard linearization: we prepend a special token to each element of the triple, and separate triples with another dedicated token. For the output sentence "Ned is the father of Rod and Todd," we would have: In: (Ned fatherOf Rod), (Ned fatherOf Todd) Out: <rel> <S> Ned <V> father of <O> Rod <rel> <S> Ned <V> father of <O> Todd For our adversarial permutation, we RANDOMIZE the ordering of the relations.
Encouraging Robustness to Linearization We train additional models with the goal of encouraging an agnosticism to graph linearization strategy. We adopt an adversarial training approach (Goodfellow et al., 2015), and alter the graph linearization 4 github.com/goodmami/penman presented to the model at each epoch. We argue that this scheme ought to reduce any model dependence on the human-derived annotation.

Robustness Results
For both tasks, we train the model on the canonical linearization, then evaluate on the various linearizations described in Section 3.1.

Impact of Adversarial Linearizations
The CANONICAL columns of Table 2 show results for models trained on that linearization, then evaluated on permuted graph linearizations. We note a strong negative impact in models' generation capacity for both tasks, with a starker decrease for the AMR data. These results suggest that pretrained linearized models are not linearization-invariant, failing to learn robust implicit graph representations, even in the case of the much simpler WebNLG data.
The remaining columns of Table 2 show that our straightforward adversarial training technique improves robustness, with only minor cost to generation performance. This is the case even with the more drastic RANDOMIZED AMR linearization. Moreover, it only incurs a minor impact on training time-for AMR, the CANONICAL, RECONFIGURE, and RANDOMIZE variants attain 40 BLEU at 2, 3, and 5 epochs, respectively.
Given that elements of canonical annotations are known to correlate with the target sentence order (Konstas et al., 2017), we do not find it surprising that the models trained and evaluated on the permuted linearizations show decreased performance. However, it is meaningful that the canonical linearization at evaluation time still leads to the best results, even for models trained with the randomized inputs-these models did not learn to associate the canonical ordering signal with the input graph. One possible explanation is that the earlier pretrain- ing induces a sensitivity to input token order that persists despite the adversarial fine-tuning, but the behavior merits further exploration.

RQ2: Better Implicit Graph Encodings with Text-to-Text Scaffolding
The positive results of our adversarial training procedure ( §3.2) suggest that pretrained linearized models can form a robust internal graph representation, even though they rely on linearized inputs. Under substantively different linearizations, models retain the ability to generate accurately (even the RANDOMIZE model outperforms best-in-class graph transformers; . Prior work, involving both GNNs and pretrained linearized models, has explored various ways of improving models' sensitivity to the structure of the input graph. To better maintain fidelity to the graph, previous graph-to-text methods incorporate additional loss terms, specialized architectures, or generation-time ranking to influence the semantic accuracy of generation: ranking outputs by the correctness of the AMR parse (Mager et al., 2020;Harkous et al., 2020), jointly "back-parsing" graphs when decoding (Bai et al., 2020), or using distinct components to model different graph traversals (Ribeiro et al., 2019).
These efforts suggest that explicitly accounting for graph structure can assist generation. Can we expand on this idea, and improve generation quality by inducing more robust internal graph representations? To answer this question, we propose secondary objectives designed to promote graph "awareness." In addition to the above graph-to-text approaches, we also draw inspiration from denoising methods used in language model pretraining (Raffel et al., 2020;Lewis et al., 2020), as well as syntactic scaffolds that support semantic tasks with an auxiliary syntax-dependent loss . Intermediate auxiliary pretraining has been repeatedly shown to be successful in other contexts (Phang et al., 2018;Gururangan et al., 2020).

Experimental Setup
In particular, we propose unsupervised graphdenoising tasks that we train alongside AMR-totext generation, following the multi-task setup of Raffel et al. (2020). For each batch, we either optimize the likelihood in Section 2 or one of the objectives described below. 5 Masked Graph Modeling When training transformers to have wide-ranging natural language capabilities, unsupervised denoising objectives like masked language modeling have proved extremely successful (Devlin et al., 2019;Raffel et al., 2020). We argue that a similar principle ought to apply to graph understanding, and therefore apply masking directly to linearized graphs.
In masked language modeling, each word token is masked with probability 15%. Here, we mask different sets of tokens, depending on the experimental condition, always setting the probability such that 15% of all tokens will be masked. Specifically, we mask: all tokens in the linearized graph, the graph components alone (edge labels and parentheses), and the semantic nodes. We also experiment with standard masking of the surface sentence, which mirrors the unsupervised domain-adapted pretraining employed by Ribeiro et al. (2020)  Graph masking can also be performed on any of the linearization variants defined in Section 3.1. 7 Graph Reordering Building on our findings from Section 3.2, we introduce a reordering objective. Specifically, we provide the model with a RECONFIGURED or RANDOMIZED linearization, then task the model with reconstructing the canonical version. We suspect that learning this mapping requires that the model captures the graph structure better, leading to superior graph-to-text generation. Unlike the joint re-generation approach of Mager et al. (2020), where the input graph is copied alongside the target text, our method both requires a nontrivial encoding of the graph and has the effect of augmenting the data (due to the nondeterministic reconfiguration). 8

Scaffolding Results
We find that, overall, denoising objectives drive substantial improvements over the baseline when training on the reduced n = 1000 dataset (Table 3). In fact, using less than 3% of the full data produces results that exceed that of state-of-the-art GNN models from a year prior to this writing (BLEU 27.37, Ribeiro et al., 2019). Moreover, the results 7 We restrict ourselves to the RECONFIGURE setting given that early results showed little difference from RANDOMIZE. 8 Simultaneously generating the surface text and reordering to the canonical linearization did not improve results.  Table 4: Test-set results of scaffolding objectives and baselines trained on the full AMR dataset (LDC2017T10). Bai et al. (2020) is a state-of-theart graph transformer. Ribeiro et al. (2020) finetunes T5-LARGE, which we re-implement as our baseline model. BS is BertScore , and M is the meaning component of the MF-score (Opitz and Frank, 2020  suggest that focusing on the graph representation itself is most important: standard sentence masking (i.e., MLM-style) is less beneficial than graph masking, although it still outperforms the baseline. Surprisingly, the various graph-masking objectives perform similarly to one another-there is little benefit to more complex strategies that specifically account for the graph structure.
While the increased generation quality from the graph-denoising methods is not drastic relative to the MLM case, we contextualize our gains by noting that other ways of promoting greater graph awareness yield similar improvements in absolute terms-and come at the cost of greater model complexity or generation time. For instance, the use of two graph representations in Ribeiro et al.
(2019) achieve a roughly 1-BLEU increase over the use of one alone.
Based on the findings from the n = 1000 setting (Table 3), we select three of the best-Target Both Norway and Sweden have been spared violent terror acts but authorities in both countries have voiced concern about terrorists or terror financiers operating out of Scandinavia. Baseline Norwegian and Swedish authorities have spared Norway and Sweden from violent acts of terror but have voiced concern about terrorists or financiers of terror operating out of Scandinavia.

Ours
Norway and Sweden have been spared terror acts of violence but Norwegian and Swedish authorities have voiced concern about terrorists or financiers of terror operating out of Scandinavia.

Target
The 30-day simple yield fell to an average 8.19% from 8.22%; the 30-day compound yield slid to an average 8.53% from 8.56%. Baseline The simple 30 day yield fell to 8.22 percent from 8.19 percent on average and the compound 30 day yield slid to 8.56 percent from 8.53 percent on average.

Ours
Simple 30 day yields fell from 8.22 to an average 8.19% and compound 30 day yields slid from 8.56 to an average 8.53%.

Target
Many young Saudi radicals have crossed the long and porous border between the Kingdom and Iraq and joined up with Sunni Muslim insurgents there. Baseline Many young Saudi radicals have crossed the porous border from Iraq to the Kingdom and joined up with Sunni Islamic insurgents there.

Ours
Many young Saudi radicals have crossed the porous long-term border with Iraq and joined up with Sunni Islamic insurgents there. performing scaffolding objectives-mask nodes, reconfigure & mask all tokens, and reorder from reconfigured-and train them at n ∈ {500, 1000, 5000, 10000, N }. Results are shown in Fig. 3. At n = 5000, representing 14% of the data, the impact of scaffolding is no longer strong across all objectives. When evaluating on the full dataset, the difference is minor (Table 4). For both BLEU and BertScore, we observe slight improvement over the baseline on average for the mask nodes case, but it is within a standard deviation of the baseline (estimated over 5 seeds). M-score does not vary between models, but it is also not yet established for fine-grained model selection. It appears that the increased size of the data supplants the need for scaffolding losses: the sheer diversity of the source graphs encourages a graph-reasoning ability sufficient to generate accurate sentences. Of course, in a realistic application, hundreds or thousands of training examples are more attainable than tens of thousands. That such straightforward methods can yield strong gains is extremely promising for future work in low-resource graph-to-text generation.
Qualitative Analysis In a manual analysis of 100 random model predictions, we generally observe broad agreement between the model trained with the reordering-from-reconfigured scaffold and the baseline (73% agreement in fidelity), both trained with the full dataset. However, in three cases, the baseline model fails to capture the order of arguments (e.g., "from y to x" when "from x to y" is correct), whereas the scaffolded model remains true to the graph (see Table 5; we did not note instances of the reverse case). While we fail to note "hallucinations"-material information that is not contained in the graph input-both models occasionally drop modifiers (e.g., adjectives or adverbs). Finally, a common error in both models is word-sense confusion (see the third row in Tab. 5, where "long [in length]" is substituted with "long [in duration]"). This is likely due to the removal of word-sense suffixes during preprocessing to avoid sparsity issues (long-03 → long). While currently standard practice, a system aiming to achieve perfect fidelity would require this data.

Encoding Graphs and Generation Performance
The results of Section 4.2 show that the denoising scaffolds impact generation performance. If we consider the sentence-level scaffolding loss as a proxy for the quality of its implicit graph encoding, can it help explain generation fidelity? In order to determine this relationship, we quantify generation accuracy using the M component of the MFscore (Opitz and Frank, 2020). It is calculated by first using an off-the-shelf parser to create an AMR graph from the generated target sentence, then by measuring the overlap with the gold source AMR (from 0 to 1). As seen in Fig. 4, there is a substantial negative relationship (Pearson's ρ = −0.35 * ) between these two variables, measured using outputs from the model trained with the reorderingfrom-reconfigured scaffold on the full data.  To fully operationalize the above question, we estimate a linear regression on the M score of predicted sentences from the validation set. As covariates, we include the above (logged) scaffolding loss, in addition to other metrics that have a significant independent correlation with generation quality. In particular, we use sentence-BLEU, the number of edges in the graph, graph re-entrancies, words in the target sentence, and the (also logged) sentence generation loss. 9 We use the Bayesian information criterion (BIC) to select the model from all possible combinations of the above covariates. We find that the preferred model with p covariates, p = 1 . . . 6, includes the reordering loss in all but one case (p = 2), suggesting its validity as an indicator of graph fidelity above and beyond other alternatives. As seen in Table 6, it has a significant negative relationship with the M score, larger than that of the comparablyscaled generation loss. These results indicate that the reordering loss captures important information about the quality of the graph encoding.

Related Work
Pretrained transformers for Graph-to-Text Generation Mager et al. (2020) condition GPT-2 (Radford et al., 2019) on a linearized AMR graph, then fine-tune on the corresponding surface representation text. Later work using transformers has also found success on both AMR-to-text and data-to-text tasks (Kale and Rastogi, 2020;Harkous et al., 2020;Ribeiro et al., 2020). To our knowledge, across a diverse set of tasks and automated 10 metrics, a pretrained transformer of sufficient capacity will always outperform a specialized GNN, often by a large margin. Ribeiro et al. (2020), following Gururangan et al. 2020, further pretrain on additional in-domain data, using both supervised (silver AMR parses to text) and unsupervised (denoising target text) objectives. Mager et al. (2020) use various heuristics to improve fidelity. During training, they regenerate the input graph, and in inference, they parse generations and rank their consistency with the original graph. Harkous et al. (2020) instead rank with a trained classifier, and introduce additional "state embeddings" to help indicate the ordering of graph components. The encoder-decoder methods cited in the previous paragraph eschew these approaches and nonetheless perform better. In preliminary replications of the Mager et al. experiments with T5, we find that joint re-generation leads to no improvement and moreover that the longer output sequences increase training time. Experimenting with other graphsensitive embeddings is a valuable direction for future work.

Graph-Dependent Losses
Graph Linearization Other work also studies linearizations for AMR-to-text settings.
As opposed to our efforts, the focus is not on enriching or measuring models' graph encoding, but instead on determining what elements of linearization (e.g., parentheses and edge labels) are necessary for generation.
Closest to our work is Konstas et al. (2017), who experiment with alternative graph traversals by randomizing the edge type order (less drastic than either RECONFIGURE or RANDOMIZE) with an LSTM-based model. Rather than randomizing at each epoch, as in our approach, they employ a consistent random ordering for each example during training, and do not evaluate models across different linearizations. The results help establish that LSTMs can be made agnostic to ordering, but fail to measure the extent to which models overfit to the training order (Section 3.2).
Ribeiro et al. (2020) report paired training and evaluation shuffling results (as in Table 2), but they ignore parentheses, only reodering node labels. Hence, their results cannot establish models' graph-encoding ability, instead revealing that node order is informative of word order, corroborating findings in Konstas et al. (2017). Both works, along with Mager et al. (2020), run ablations by removing parenthetical markers, finding that graph structure is necessary for strong generation.
Finally, Kedzie and McKeown (2020), appearing contemporaneously to our work, seek to control the output generation by manipulating the input linearization order, using a randomization similar to ours as an "uncontrolled" baseline. Given their focus on task-oriented dialogue planning, which uses simpler meaning representations and sentences than the AMR dataset used here (i.e., shallower graphs and limited domains), we view their work as complementary to our own.

Conclusion
In this work, we explore the graph-encoding ability of pretrained transformers through the lens of graph-to-text generation that relies on linearized graph inputs. First, we determine the extent to which these models are invariant to the method by which graphs are linearized, finding that models trained on the fixed, canonical linearizations fail to generalize to meaning-preserving alternatives. We rectify this shortcoming by training models on linearizations corresponding to alternative random traversals of the graph. Following prior work that has used graph-aware losses to improve generation quality, we then explore ways of improving models' sensitivity to the input graphs. Motivated by the success of denoising objectives in other text-to-text settings, we encourage robust internal graph encodings through additional scaffolding losses. Although scaffolding leads to tepid improvements in generation quality when training data is plentiful, it yields substantial gains in low-resource settings.