Structural Adapters in Pretrained Language Models for AMR-to-Text Generation

Pretrained language models (PLM) have recently advanced graph-to-text generation, where the input graph is linearized into a sequence and fed into the PLM to obtain its representation. However, efficiently encoding the graph structure in PLMs is challenging because such models were pretrained on natural language, and modeling structured data may lead to catastrophic forgetting of distributional knowledge. In this paper, we propose StructAdapt, an adapter method to encode graph structure into PLMs. Contrary to prior work, StructAdapt effectively models interactions among the nodes based on the graph connectivity, only training graph structure-aware adapter parameters. In this way, we incorporate task-specific knowledge while maintaining the topological structure of the graph. We empirically show the benefits of explicitly encoding graph structure into PLMs using StructAdapt, outperforming the state of the art on two AMR-to-text datasets, training only 5.1% of the PLM parameters.


Introduction
Data-to-text tasks aim to generate meaningful and coherent natural language text that faithfully conveys structured data. Some examples of structured information include tables (Parikh et al., 2020), Knowledge Graphs (KGs) (Gardent et al., 2017;Vougiouklis et al., 2018) and Abstract Meaning Representation (AMR) (Banarescu et al., 2013). In this work, we focus on AMR-to-text generation where the goal is to generate a fluent and grammatical sentence that is faithful to a given AMR graph (See Figure 1a). AMR is a semantic formalism that has received much research interest (Song et al., 2018;Guo et al., 2019;Ribeiro et al., 2019;Opitz et al., 2020Fu et al., 2021) and has been shown to benefit downstream tasks 1 Our code and checkpoints are available at https://github.com/UKPLab/StructAdapt.  such as text summarization (Liao et al., 2018) and machine translation (Song et al., 2019). Both statistical (Flanigan et al., 2016;Pourdamghani et al., 2016) and neural methods (Bai et al., 2020;Cai and Lam, 2020) have been investigated for AMRto-text generation, and dominant methods make use of Graph Neural Networks (GNNs) (Kipf and Welling, 2017) or Transformers (Vaswani et al., 2017) for representing the input graph.
Pretrained language models (PLMs) (Devlin et al., 2019;Radford et al., 2019; have been shown useful as a general text representation method, giving much improved results on a wide range of tasks . However, they cannot be directly leveraged to benefit AMR-to-text generation, and more generally graph-to-text generation, due to the structural nature of the input. One solution is to transform the structured input into a se-quence, which can be directly fed into PLMs (See Figure 1b). Recent studies (Mager et al., 2020;Harkous et al., 2020;Ribeiro et al., 2020a transform AMRs into sequences by top-down linearization (Konstas et al., 2017). It has been shown that such linearized graph representation can be used to fine-tune a PLM and improve graph-to-text generation performances (Kale, 2020).
The above methods, however, suffer from two salient limitations. First, linearized graph structures are different in nature from natural language. As a result, knowledge from large-scale pretraining intuitively cannot be fully transferred, and finetuning a sentence representation using linearized graphs can lead to catastrophic forgetting of such distributional knowledge (Goodfellow et al., 2014;Kirkpatrick et al., 2017). Second, a linearized representation weakens structural information in the original graphs by diluting the explicit connectivity information (i.e., which nodes are connected to each other), and PLMs must infer how edge connections are specified in the sequence. This fact was also observed by Song et al. (2018), Beck et al. (2018) and Ribeiro et al. (2019), who show that GNN encoders outperform sequential encoders for AMR-to-text generation without pretraining.
To mitigate the issues, we aim to explicitly encode the graph data into a PLM without contaminating its original distributional knowledge. To this end, we propose STRUCTADAPT, a novel structureaware adapter that effectively allows leveraging the input graph structure into PLMs (See Figure 1c). The main idea is to add layer-wise modules, which extract information from the pretrained layers and make use of it in a graph-structure encoding. As shown in Figure 2, STRUCTADAPT employs a graph convolution in order to learn representations built upon the graph connectivity over the PLM encoder. Because STRUCTADAPT is added to each encoder layer, deep integration of linguistic knowledge and graph knowledge can be achieved. During finetuning, only the adapter parameters are trained, whereas the PLM parameters remain unchanged, in contrast to previous methods based on the graph linearizations that fine-tune all model parameters.
Empirically we show that STRUCTADAPT significantly outperforms linearized fine-tuning baselines and naive sequential adapters (Houlsby et al., 2019). Moreover, STRUCTADAPT is more robust to different graph linearizations, better treats reentrancies (nodes with more than one entering edge) and long-range node dependencies. Our proposed models, based on STRUCTADAPT, surpass the current state of the art on LDC2017T10 and LDC2020T02 datasets by up to 3.1 BLEU points, training only 5.1% of the original PLM parameters.

Related Work
Fine-tuning for Graph-to-text Generation. While previous approaches (Song et al., 2018;Ribeiro et al., 2019;Cai and Lam, 2020;Schmitt et al., 2021;Zhang et al., 2020b) have shown that explicitly encoding the graph structure is beneficial, fine-tuning PLMs on linearized structured data has established a new level of performance in data-to-text generation (Nan et al., 2021;Kale, 2020;. Our work can be seen as integrating the advantage of both graph structure encoding and PLMs, using a novel adapter module. Mager et al. (2020) employ cycle consistency to improve the adequacy of generated texts from AMRs using GPT-2 (Radford et al., 2019), whereas Harkous et al. (2020) train a classifier to rank candidate generations based on the semantic fidelity. Ribeiro et al. (2020a) investigate encoder-decoder PLMs for graph-to-text generation, and show that task-specific pretraining can lead to notable improvements and that PLMs benefit much more from the graph structure of AMRs than of KGs. Hoyle et al. (2021) explore the extent to which PLMs are invariant to graph linearization, finding that models trained on canonical linearizations fail to generalize to meaning-preserving alternatives. Compared to this line of work, which tunes all PLM parameters, our method obtains a further 19x reduction in task-specific parameters, tuning only 5.1% of the parameters while achieving state-of-the-art performance, being more robust to permutations of the graph representation and better encoding larger graphs.
Lightweight Fine-tuning. Recently, different approaches have emerged as an alternative training strategy in order to avoid fine-tuning all parameters of a PLM.  train a lightweight "side" network that is fused with the pretrained model via summation. Li and Liang (2021) propose to prepend a trainable continuous prefix as an alternative to adapters, maintaining comparable performance in data-to-text tasks using fewer trained parameters. Liu et al. (2021) develop a method to automatically search prompts in the continuous space and evaluate it in few-shot NLU tasks. Ham-bardzumyan et al. (2021) propose adversarial reprogramming attempts to learn task-specific word embeddings to customize the language model for the downstream task.
Adapter-based approaches (Houlsby et al., 2019;Rebuffi et al., 2017;Lauscher et al., 2020;Pfeiffer et al., 2020a) introduce a small number of task specific parameters, keeping the underlying pretrained model fixed. Pfeiffer et al. (2020b) propose an adapter method to arbitrary tasks and languages by learning modular language and task representations. The above works are related to STRUCTADAPT as it trains much fewer parameters, but also different because they do not explicitly encode the input structure, whereas STRUCTADAPT directly aims to encode it.

Graph-to-Text Model
Let G 0 = (V 0 , E 0 , R 0 ) denote a rooted and directed AMR graph with a node set V 0 and labeled edges (u, r, v) ∈ E 0 , where u, v ∈ V 0 and r ∈ R 0 is a relation type. An example of an AMR graph and its corresponding sentence is shown in Figure 1a.

Encoder-Decoder Architecture
Consider a conditional generation task where the input is a context x and the output y = y 1 , . . . , y |y| is a sequence of tokens. In AMR-to-text generation, the context x is the AMR graph and y is the sentence that describes the AMR graph in natural language.
Let p φ (y | x) denote a PLM parametrized by φ, where x is encoded by a bidirectional encoder, and the decoder predicts y autoregressively, conditioned on the encoded x and its left context. We focus on PLMs based on the Transformer encoderdecoder architecture (Vaswani et al., 2017), as they are suitable for conditional text generation. We define x = LIN(G 0 ), where LIN is a function that linearizes G 0 into a sequence of tokens. 2 Following Damonte and Cohen (2019), as shown in Figure 1b, we linearize the AMR into a sequence of nodes and edges using the depth-first traversal of the canonical human-created AMR. 3 In a nutshell, the hidden representation h l i ∈ R d , for all x i ∈ x, is computed by the encoder layer l, where d is the hidden dimension. The decoder hidden representationĥ l i ∈ R d is computed by the layer l of the 2 The variable of a re-entrant node -node with more than one incoming edge -is replaced with its co-referring concept. 3 Other AMR linearizations are discussed in §6.1. autoregressive decoder at time step i.

Fine-tuning
The model is initialized with pretrained parameters φ (e.g. using T5, Raffel et al., 2019) and fine-tuned to optimize the following log-likelihood objective over each gold instance (x, y): (1)

Baseline Adapter
We employ an adapter module after the feedforward sub-layer of each layer on both encoder ( Figure 2a) and decoder (Figure 2b) of the PLM. We modify the adapter architecture from Houlsby et al. (2019), computing the adapter representation at each layer l, given the encoder layer representation h l i (orĥ l i in the decoder), as follows: where σ is the activation function and LN(·) denotes layer normalization. W l o ∈ R d×m and W l p ∈ R m×d are adapter parameters, and m is the hidden dimension of the adapter. Figure 2c illustrates the baseline adapter module, which we call ADAPT. Training. Let the set of adapters' parameters for the encoder and decoder layers be parametrized by θ. The training objective is the same as Equation (1), but the set of trainable parameters changes: the PLM parameters φ are frozen and the adapter parameters θ are the only trainable parameters. In contrast to fine-tuning, adapters substantially reduce the number of trainable parameters that are used to adapt the PLM to the downstream task.

Limitation
Intuitively, the connection between nodes in the input graph can influence the encoding of x by guiding what to extract from x in order to generate y. Note that in both fine-tuning and ADAPT approaches, the self-attention mechanisms of the encoder layers treat the sequence of nodes and edges x essentially as a fully connected graph, greatly diluting the original graph structure. In this way, the model has to retrieve the original connectivity of the graph from x. For example, the AMR linearization in Figure 1b has two mentions of the node she, and the model should capture that both mentions belong to the same node in the original graph.

Structural Adapter
We propose STRUCTADAPT, a lightweight alternative to injecting structural inductive bias 4 into PLMs.
We first describe the intuition in §4.1 and define our method formally in §4.3.

Intuition
Injecting graph structural bias into graph-to-text models trained from scratch improves the performance compared to linearized approaches (Damonte and Cohen, 2019; Ribeiro et al., 2019). However, it is not straightforward how to effectively model the input graph structure when fine-tuning PLMs, which usually are pretrained using natural language and not structured data.
Our key idea is modeling the graph connectivity in the encoder utilizing an adapter module, using information flows between adjacent nodes in a message-passing update, employing a graph convolution (see Figure 2d). In this way, the graph structure substantially impacts the node representations, better encoding the input graph without impacting the knowledge learned during pretraining. This can lead to more efficient and better AMR-to-text generation as we will show in §5 and §6. Moreover, different adapters for distinct graph domains can be used with the same PLM, yielding a high degree of parameter sharing for graph-to-text tasks.

Graph Representation
We convert each G 0 into a bipartite graph G 1 = (V 1 , E 1 ), replacing each labeled edge (u, r, v) ∈ E 0 with two unlabeled edges e 1 = (u, r) and e 2 = (r, v). Similar to Beck et al. (2018), this process converts the graph into its unlabeled version. Figure 3 shows an (a) AMR subgraph and (b) its unlabeled representation.
Note that PLMs typically use a vocabulary with subword units (Sennrich et al., 2016). This presents a challenge in how to represent such a graph using subword tokens. Inspired by Ribeiro et al. (2020b), we transform each G 1 into a new token graph G = (V, E), where each token of a node in V 1 becomes a node v ∈ V. We convert each edge (u 1 , v 1 ) ∈ E 1 into a set of edges and connect every token of u 1 to every token of v 1 . That is, an edge (u, v) will belong to E if and only if there exists an edge (u 1 , v 1 ) ∈ E 1 such that u ∈ u 1 and v ∈ v 1 , where u 1 and v 1 are seen as sets of tokens. Figure 3c shows an example of the token graph.

Method
STRUCTADAPT employs a two-layer architecture in order to re-purpose the PLM for the graph-to-text task using a small number of new parameters. Formally, for each node v ∈ V, given the hidden representation h l v from the encoder layer l, STRUCTADAPT computes: where N (v) is the immediate neighborhood of v in G. GraphConv l (·) is the graph convolution that computes the node representation based on the local neighborhood of v, and W l e ∈ R d×m is a parameter. Figure 2d illustrates STRUCTADAPT. 5 Graph Convolution. The graph convolutional layer allows exploration of distinct strategies for neighborhood aggregation in order to model structural information of the input graph. Different GNN architectures (Velickovic et al., 2018;Xu et al., 2019) can be employed as the graph convolution. Moreover, in this way, we avoid changing the self-attention mechanism of the current pretrained encoder, allowing to also capture global information based on the pretrained knowledge.
Our graph convolution is based on the Graph Convolutional Network (GCN) proposed by Kipf and Welling (2017). At each layer l, we compute the representation of a node v ∈ V as follows: where N (v) is a set of nodes with incoming edges to v and v itself, d v is the degree of v, and W l g ∈ R m×d is a parameter.
We also consider the variant relational GCN (RGCN) (Schlichtkrull et al., 2018) as graph convolution. RGCN allows capturing the reverse edge direction so that we can consider the differences in the incoming and outgoing relations, which has shown to be beneficial (Beck et al., 2018). In particular, the node representation is computed as: where R denotes the set of relations, i.e., the edge types default and reverse, N r (v) denotes the set of neighbors under relation r ∈ R, and W l r ∈ R m×d encodes the edge type between the nodes u and v.
Note that STRUCTADAPT computes the refined structural node representation z l v based on the local node context, using as input the global representation h l v generated by the current PLM encoder layer. In this way, the model is able to capture both the global context based on the PLM linguistic knowledge and the local context based on the graph knowledge. Finally, we employ ADAPT into the decoder in order to adapt the language model to the graph-to-text task.

Experiments
Our models are initialized with pre-trained T5 (Raffel et al., 2019), but our approach can be combined with other PLMs such as BART . Our implementation is based on Hugging Face Transformer models (Wolf et al., 2019). We use T5 base for all experiments and report results with T5 large for the test sets. 6 We use the Adam optimizer (Kingma and Ba, 2015) and employ a linearly decreasing learning rate schedule without warm-up. BLEU is used for the stopping criterion. Following recent work (Mager et al., 2020;Zhang et al., 2020b), we evaluate our proposed models on LDC2017T10 and LDC2020T02 corpora.
Evaluation. We evaluate the results with BLEU (Papineni et al., 2002) and chrF++ (Popović, 2015) metrics. We also report the meaning (M) component of the MF-score , which measures how well the source AMR graph can be reconstructed from the generated sentence. We use BERTScore (Zhang et al., 2020a) allowing a semantic evaluation that depends less on the surface forms. Finally, we also perform a human evaluation ( §5.2).

Main Results
We compare STRUCTADAPT with four methods: finetuning (FINE-TUNE), fine-tuning only the top or bottom 2 layers (FT-TOP2, FT-BOTTOM2) and ADAPT  models use the same graph linearization generated by the depth-first traversal. We also report recent state-of-the-art results on both datasets. Tables 1 and 2 show the results.
We find that training only 5.1% task-specific parameters, STRUCTADAPT-RGCN achieves a BLEU score of 46.6 in LDC2017T10, substantially improving over FINE-TUNE and other lightweight baselines (ADAPT, FT-TOP2, FT-BOTTOM2), and outperforming Ribeiro et al. (2020a) and Hoyle et al. (2021) which fine-tune T5 updating significantly more parameters. STRUCTADAPT also achieves stateof-the-art performance on LDC2020T02, considerably improving over Bevilacqua et al. (2021), which implicitly models the graph structure information using linearization techniques.
In general, STRUCTADAPT is better than ADAPT when training the same number of parameters, and slightly better even when training only 1.7% of the parameters for both datasets. This highlights that the gains not only come from using an adapter architecture, but from considering the graph connectivity. STRUCTADAPT-RGCN is more effective than STRUCTADAPT-GCN using fewer parameters, demonstrating that considering reverse relations is advantageous. ADAPT is consistently better than FINE-TUNE, agreeing with our intuition of catastrophic forgetting when fine-tuning. Interestingly, in contrast to popular strategies that focus on upper layers in fine-tuning (Howard and Ruder, 2018;Houlsby et al., 2019;Li and Liang, 2021), FT-BOTTOM2's performance is better than FT-TOP2's, suggesting that lower layers have a significant impact in adapting the PLM to structured data. Different from our work, both Mager et al. (2020) and Ribeiro et al. (2020a) use the PENMAN notation which makes the input much longer (containing more tokens), and demonstrate that this representation is able to achieve strong results -this is orthogonal to our STRUCTADAPT representation and  can be incorporated in future work. Overall, the results indicate that explicitly considering the graph structure using an adapter mechanism is effective for AMR-to-text generation, significantly reducing the number of trained parameters while improving generation quality.

Human Evaluation
To further assess the quality of the generated texts by the adapter-based models in LDC2020T02, we conduct a human evaluation via crowdsourcing using Amazon Mechanical Turk. We follow previous work (Ribeiro et al., 2019;Castro Ferreira et al., 2019) and evaluate the meaning similarity, i.e., how close in meaning is the generated text to the reference sentence. 7 We divide the datapoints into 3 different sets by by the graph size, i.e., the number of nodes, after converting edges into nodes (cf. §4.2). This setting allows us to evaluate the performance of the models based on the complexity of the AMR graph. We randomly select 100 generated texts for each set and each model (total of 600), which annotators then rate on a 1-7 Likert scale. For each text we collect scores from 3 annotators and use MACE (Hovy et al., 2013), a Bayesian model that incorporates the reliability of individual workers, to merge sentence-level labels. 8 Table 3 shows that STRUCTADAPT improves the meaning similarity over ADAPT with statistically significant margins (p<0.05). Note that the gains mainly come from datapoints with >60 nodes, indicating that STRUC-TADAPT is better when encoding larger graphs.

Detailed Discussion
Parameter/Performance Trade-off. We investigate how the number of parameters affects the models. A higher hidden dimensionality means more trainable parameters, and smaller adapters introduce fewer parameters at a possible cost to performance. That is, the adapter size controls the parameter efficiency. Figure 4a shows the effect of the number of trained parameters in the performance measured using BLEU. Each point in the ADAPT and STRUCTADAPT curves represents a hidden dimension in the range [8, 16, . . . , 2048]. STRUCTADAPT-GCN is consistently better than ADAPT over all model capacities, even though both approaches train the same number of parameters. STRUCTADAPT-RGCN achieves similar performance than FINE-TUNE when training only 0.8% of the parameters whereas ADAPT achieves similar performance to 8.5%, demonstrating the effectiveness of injecting the graph structure into the PLM.
Low-data Setting. Previous work (Li and Liang, 2021) has shown that lightweight fine-tuning has an advantage in some generation tasks when the training size is smaller. Therefore, we investigate how STRUCTADAPT behaves in a low-data setting. We subsample the LDC2017T10 training set to analyze different smaller training sets. For each size, we sample 5 different datasets and average over 2 training random seeds. Thus, we average over 10 models to get an estimate for each low-data setting. 9 Figure 4b shows the results. First note that both adapter-based approaches improve over FINE-TUNE. When training with only 1000 datapoints, STRUCTADAPT outperforms FINE-TUNE by 8.2 BLEU points. Also note that the gap between ADAPT and FINE-TUNE decreases when the size of the training set increases. In general, STRUCTADAPT outperforms FINE-TUNE and ADAPT in low-resource scenarios by 7.3 and 4.8 BLEU points on average, respectively, whereas requiring much fewer trained parameters 9 We use the LDC2017T10 dev set to choose hyperparameters and do early stopping.
Case Study. We perform a case study to provide a better understanding of the STRUCTADAPT's performance. Table 4 shows an AMR graph in PENMAN notation containing reentrancies (marked in bold) and sentences generated by FINE-TUNE and STRUCTADAPT trained on the LDC2017T10 full training set and in a low-data setting where the models are trained with 2000 data points. FINE-TUNE fails in generating a sentence with the correct concept break-up whereas STRUCTADAPT correctly generates a sentence that describes the input graph. The incorrect verb tense is due to lack of tense information in AMR. FINE-TUNE-2000 mixes the semantic relation between I and son (i.e., mistranslation of the edges in the graph) whereas STRUCTADAPT-2000 generates a correct sentence (except by generating the number 8). Overall, STRUCTADAPT produces a more accurate text output than FINE-TUNE by generating correct pronouns and mentions when control verbs and reentrancies are involved, in both full and lowdata scenarios.
Model Variations. In Table 5, we report an ablation study on the impact of distinct adapter components, using adapters only in the encoder or decoder. We evaluate different architecture configurations keeping the same number of parameters for a fair comparison. We find that only training adapters in  Table 5: Impact of the adapter modules in the encoder or decoder in the LDC2017T10 dev set. All adapterbased models have the same number of parameters.
the decoder is not sufficient for a good performance, even having the same number of parameters. This suggests that adapting the PLM encoder to handle graph structures is key in AMR-to-text tasks. Interestingly, the model that only employs STRUCTADAPT in the encoder (i.e., no ADAPT is used in the decoder) has a better performance (+1.7 BLEU) than using ADAPT in both encoder and decoder, highlighting STRUCTADAPT's strong graph encoding abilities. Finally, the best performance is achieved when we employ STRUCTADAPT in the encoder and ADAPT in the decoder, reaching 41.7 BLEU points.

Graph Representation Evaluation
In this section, we explore how different graph properties impact the models' abilities to encode the input graph structure.

Impact of the Graph Representation
Inspired by Damonte and Cohen (2019), we investigate two different approaches when linearizing the AMR: (i) only nodes have explicit representations, whereas edge relations are represented by the adapter parameters using the RGCN; 10 and (ii) the sequence of nodes and edges using depth-first traversal of the graph. We also propose and evaluate three different graph structures based on subwords (cf. §4.2): rep1: for each edge, we connect every token from the source node to every token of the target node; rep2: we connect the last token of the source node to the first token of the target node and connect the tokens of a node sequentially; rep3: we connect the first token of the source node to the first token of the target node and connect the token of a node sequentially. Figure 3 shows an example of the three representations for an AMR graph structure.  Additionally, we also investigate a fully connected graph structure (complete graph), that is, similarly to the self-attention mechanism in Transformers, all nodes and edges are connected. As shown in Table 6, explicitly considering nodes and edges in the graph linearization is beneficial. This approach has the advantage of allowing the model to handle new edge relations during inference, as they are not encoded as model parameters. Note that the complete graph representation has relatively inferior performance, again demonstrating the advantage of explicitly encoding the input graph connectivity.
Finally, we observe that the best configuration is using nodes and edges with rep1 (see an example in Figure 3c). We believe that this is because rep1 allows direct interactions between all source and target tokens, making all token representations of an AMR node directly influenced by the neighbouring tokens.

Robustness to Graph Linearization
A critical advantage of modeling the graph structure is to be less dependent on linearization strategies because the graph connectivity is invariant to the graph linearization. We thus are interested in measuring the impact of the graph linearization in the models.
Following Hoyle et al. (2021), we investigate three different graph linearizations: (i) CANON: the original order of the canonical human-created linearizations in AMR corpora; (ii) RECONF: the order from the canonical graph linearization is ignored, except for the top node; 11 and (iii) RANDOM: constructs a linearization from a random node in the graph, disregarding all order information from the canonical format, but it remains a valid traversal of the graph. All linearizations are converted to a  sequence of node and edge labels using depth-first traversal and used for both training and evaluation. Examples of such graph linearizations are shown in Appendix C. Table 7 presents the results. Note that while RECONF has a negative impact on all models, STRUC-TADAPT has the best performance. ADAPT has similar performance gains over FINE-TUNE in all graph linearizations. Finally, note that for RANDOM, there is a drastic performance drop in FINE-TUNE and the gap between STRUCTADAPT and FINE-TUNE is widest (+5.9 BLEU), demonstrating that explicitly encoding the graph structure is beneficial and that STRUC-TADAPT is much less impacted by different graph linearizations. Table 8 shows the effects of the graph size, graph diameter and reentrancies in the performance. First, note that the BLEU scores decrease as the graph size increases since larger graphs often are more complex. The performance gap between STRUC-TADAPT and FINE-TUNE becomes larger for relatively larger graphs, showing that STRUCTADAPT is able to better encode complex graphs. As ADAPT is not aware of the graph connectivity, it has much worse scores compared to STRUCTADAPT, especially for larger graphs.

Graph Properties
It is expected that the benefit of the STRUCTADAPT will be more evident for AMR graphs containing larger diameter as the encoder is aware of the input graph structure. As seen in Table 8, similarly to the graph size, the scores decrease as the graph diameter increases. STRUCTADAPT achieves a clear improvement when handling graphs with ≥20 diameter, with a improvement of +4.2 BLEU points over FINE-TUNE. Previous work (Damonte and Cohen, 2019;Szubert et al., 2020) showed that reentrancies (nodes with multiple parents) pose difficulties in encoding AMRs correctly. Because STRUCTADAPT is the only approach to model reentrancies explicitly, we expect it to deal better with these structures. The  gap between STRUCTADAPT and the other models is widest for examples with more reentrancies, confirming our hypothesis. In particular, when graphs contain ≥4 reentrancies, STRUCTADAPT has an improvement of +3.6 BLEU points compared to ADAPT.

Conclusion
We presented STRUCTADAPT, a novel adapter architecture to explicitly model graph structures into pretrained language models, providing an extensive evaluation of our approach and showing that it achieves state-of-the-art results on two AMR-totext benchmarks, training much fewer parameters. We also found that STRUCTADAPT is more effective when encoding complex graphs, when trained on fewer datapoints, and is more robust to different graph linearizations and reentrancies. In future work, we plan to consider other graph-to-text tasks, such as those based on Knowledge Graphs.

Appendices
In this supplementary material, we detail experiments' settings and additional information about the human evaluation and graph representations.

A Details of Models and Hyperparameters
The experiments were executed using the version 3.3.1 of the transformers library released by Hugging Face (Wolf et al., 2019). In Table 9, we report the hyperparameters used to train the models presented in this paper. We train until the development set BLEU has not improved for 5 epochs.

B Details on the Human Evaluation
The human evaluation was conducted via Amazon Mechanical Turk. We randomly select 100 generated texts for each of the 3 sets and each adapter model (ADAPT, STRUCTADAPT-GCN), with a total of 600 texts to be evaluated. The annotators then rate the meaning similarity on a 1-7 Likert scale. For each text, we collect scores from 3 annotators. We use MACE (Hovy et al., 2013) to further improve upon these raw answers by unsupervised estimation of worker trustworthiness and subsequent recovery of the most likely score. Models are ranked according to the mean of sentence-level scores. We defined a filter for all our evaluations, allowing to participate only workers who have more than 5000 HITs approved and with an acceptance rate of 95% or higher. The task took workers a median time of 1.6 minutes per pair of sentences. We apply a quality control step filtering workers who do not score some faked and known sentences properly or did the experiment in a very short time.

C Example of Graph Linearizations
In Table 10, we present three different linearizations for the same AMR graph and its corresponding reference sentence. Figure 5 shows the two possible graphs that are represented by the linearizations. In particular, Figure 5a shows a graph that is represented by CANON and RECONF linearizations and Figure 5b shows a graph that is represented by RANDOM. Note that whereas the linearizations can greatly differ from each other, the graph structure for all linearizations remains very similar.