ReGen: Reinforcement Learning for Text and Knowledge Base Generation using Pretrained Language Models

Automatic construction of relevant Knowledge Bases (KBs) from text, and generation of semantically meaningful text from KBs are both long-standing goals in Machine Learning. In this paper, we present ReGen, a bidirectional generation of text and graph leveraging Reinforcement Learning to improve performance. Graph linearization enables us to re-frame both tasks as a sequence to sequence generation problem regardless of the generative direction, which in turn allows the use of Reinforcement Learning for sequence training where the model itself is employed as its own critic leading to Self-Critical Sequence Training (SCST). We present an extensive investigation demonstrating that the use of RL via SCST benefits graph and text generation on WebNLG+ 2020 and TekGen datasets. Our system provides state-of-the-art results on WebNLG+ 2020 by significantly improving upon published results from the WebNLG 2020+ Challenge for both text-to-graph and graph-to-text generation tasks. More details at https://github.com/IBM/regen.


Introduction
Graph representation of knowledge is a powerful tool to capture real-world information where complex relationships between node entities can be efficiently encoded. Automatic generation of Knowledge Bases (KBs) from free-form text and its counterpart of generating semantically relevant text from KBs are both active and challenging research topics.
Recently, there has been an increased interest in leveraging Pretrained Language Models (PLMs) to improve performance for text generation from graph, or graph-to-text (G2T) task (Ribeiro et al., 2020). Indeed, large PLMs like T5 (Raffel et al., 2020) and BART (Lewis et al., 2020) that have been pretrained on vast amount of diverse and variedly structured data, are particularly good candidates for generating natural looking text from graph data.
BART-and T5-related models have been employed by top performers in public challenges such as the WebNLG+ 2020 Challenge (Castro Ferreira et al., 2020b) where both graph-to-text and textto-graph (T2G) tasks are offered, under the names RDF-to-Text and Text-to-RDF (semantic parsing) respectively; RDF stands for Resource Description Framework, a standard for describing web resources. One can notice that more teams entered the competition for the G2T task than for T2G as the latter is a much harder task. Best models generally use PLMs and fine-tune them for the target modality at hand (either graph or text). This is possible by re-framing the T2G and G2T generations as a sequence to sequence (Seq2Seq) generation problem, which suits fine-tuning PLMs well. One can therefore hope to leverage the large pretraining of PLMs to improve the overall generation quality.
The Seq2Seq formulation requires any input graph to be linearized as a sequence, which is not unique. This creates an opportunity for data augmentation where multiple linearizations are provided to the model at training time so the model learns the content represented by the graph, not the order of its sequential representation.
In this work, we are interested in leveraging the power of PLMs for both G2T and T2G generation tasks, and will demonstrate the strength of our approach by improving upon the best results of the WebNLG+ 2020 Challenge (rev 3.0) as reported by Castro Ferreira et al. (2020a) for both T2G (Semantic Parsing) and G2T (Data-to-Text) tasks. We will also present results for the TEKGEN Corpus (Agarwal et al., 2021) to show performance on a different, much larger dataset. To illustrate the task of generation, Fig. 1 provides examples of G2T and T2G outputs obtained using the proposed generation framework. The first two sentences of the abstract of this paper were used as input for T2G using our best model. The model generates a graph from the input text by simultaneously extracting Figure 1: Actual examples of generation for Text-to-Graph and Graph-to-Text tasks using our best RL models. The first two sentences of the abstract were processed through our best models. First, a graph was created capturing the facts from the input sentences. Then, this graph was used as input to generate text. Despite a strong domain mismatch between input data and models, the generated paragraph is capturing most of the original sentences content. Both models were trained using RL, specifically Self-Critical Sequence Training (SCST). relevant nodes and linking them coherently. For the G2T task, another model starts from the generated graph and generates semantically relevant text from it. As one can appreciate, the final text is quite readable and captures most facts from the original abstract sentences despite a strong domain mismatch between input data and training data, which both models were built on.
Since both T2G and G2T generative tasks can be formulated as a Seq2Seq problem, we propose to use Reinforcement Learning (RL) as part of the PLMs fine-tuning on the target domain data. For both G2T and T2G tasks, a differentiable function such as the cross-entropy (CE) loss function is often used, since minimizing it results in maximizing the probability of generating the correct token/word. However, when it comes to evaluating a model's performance, benchmarks often use BLEU (Pa Pa Aung et al., 2020), METEOR (Lavie and Agarwal, 2007), and chrF++ (Popović, 2017) for G2T, or simply F1, Precision, and Recall scores for T2G, none of which being differentiable. During training, one hopes that by minimizing the CE loss, the model will tend towards better prediction of the target tokens, hence improving on evaluation metrics as a beneficial by-product. Thankfully, RL provides a framework where we can update our model parameters so to improve evaluation metrics directly. Mixed Incremental Cross-Entropy Reinforce from Ranzato et al. (2016) introduced using REINFORCE (Williams, 1992) for sequence training. We propose to use one of its variant known as Self-Critical Sequence Training (SCST) (Rennie et al., 2017) for both T2G and G2T training.
In summary, our main contributions are: • We propose to use RL-based sequence training, specifically SCST, for both G2T and T2G tasks. This is the first time that RL based training is proposed to the bi-directional generation of text and graph. To the best of our knowledge, the present work is the first time it is introduced for a T2G task. • We demonstrate that our approach provides better performance than the best systems reported for the WebNLG 2020+ Challenge. • We provide a thorough investigation of SCSTbased training for both T2G and G2T tasks, including best rewards combination. • We constructed subject and relation-object boundaries from TEKGEN sentence-triples pairs and showed performance of our approach for both T2G and G2T tasks. • We adapted the large-scale TEKGEN corpus (Agarwal et al., 2021) for T2G and G2T tasks and confirmed the benefit of SCST-based fine-tuning approach over CE-trained baselines.

Related work
In the WebNLG+ 2020 Challenge, most top performing models relied on fine-tuning of PLMs. Interestingly, all four top teams in this Challenge proposed quite different approaches while leveraging PLMs. 1 st place Amazon AI (Guo et al., 2020a) pipelined a relational graph convolutional network (R-GCN) and a T5 PLM with some canonicalization rules. 2 nd place OSU Neural NLG (Li et al., 2020), the closest to our approach in spirit, used T5 and mBART PLMs to fine-tune after special data preprocessing. 3 rd place FBConvAI (Yang et al., 2020) used BART PLM and multiple strategies to model input RDFs. 4 th place bt5 employed a T5 PLM trained in a bi-lingual approach on English and Russian, even using WMT English/Russian parallel corpus.
Recently, Dognin et al. (2020); Guo et al. (2020bGuo et al. ( , 2021 proposed models trained to generate in both T2G and G2T directions, with consistency cycles created to enable the use of unsupervised datasets.
In contrast, our approach of fine-tuning a T5 PLM is fully supervised but can produce either the specialized models for T2G and G2T tasks alone, or a hybrid model that can handle both T/G inputs simultaneously to generate the corresponding translated G/T outputs.
Note that in contrast to many WebNLG+ 2020 Challenge participants, e.g. Li et al. (2020), no preprocessing of the data is performed for text, while for graph triples, we add tokens to mark subject, predicate, and object positions in their linearized sequence representation. Moreover, data augmentation is performed by allowing random shuffling of triples order in graph linearization to avoid a model to learn the exact order of triples, especially for the T2G task.
While the use of RL training in PLM has been explored in many works, the approach of Chen et al. (2020) is closest to ours. However, their work focuses on the improved text generation in the context of natural question generation, while in our algorithm we use it for graph-to-text and text-to-graph generations.

Models
Models are trained on a dataset D composed of a set of (x T , x G ) i samples, where superscript i denotes the i-th sample in D, x T is made of text (one or more sentences), and x G is a corresponding graph represented as a list of triples x G = [(s 1 , p 1 , o 1 ), . . . , (s K , p K , o K )], where the k-th triple is composed of a subject s k , predicate (relationship) p k , and object o k . For G2T, the model is given x G as input and must generatex T . A cross-entropy loss is computed as an expectation: where p G2T θ (x T ) is the distribution of the generated sequencex T = T G2T (x G ), T G2T (.) being the transformation from graph to text. Our model is parameterized by θ, and x T is effectively sampled from the marginal distribution of text samples from D.
x T = [ŵ 1 ,ŵ 2 , . . . ,ŵ T ] is a sequence of generated tokens/words. Similarly, for training a T2G model, the cross-entropy loss used in training is simply where p T2G θ (x G ) is the distribution of the generated graphx G = T T2G (x T ), T T2G (.) being the transformation from text to graph, and where x G is drawn from the marginal distribution of graph samples from D.
In both Eq. (1) and Eq. (2), x G must be expressed as a sequence of tokens t j such that a list of triples x G turns into a list of tokens [t 1 , t 2 , · · · , t M ]. This is simply done by adding tokens marking the subject, predicate, and object boundaries in the sequence such that each triple (s k , p k , o k ) is turned into a sequence such as , assuming our subject is made of 1 token, our predicate of 2 tokens, and our object of 3 tokens in this example. <S>,<P>, and <O> are just special marker tokens to help the model know where subject, predicate and objects are located in the sequence.
We start from a pretrained encoder-decoder M model that we fine-tune on either T2G to get M T , or G2T task to get M G . We also propose a third kind of model M T+G to be fine-tuned on both T2G and G2T samples, i.e. the model will learn to generate in any direction, by supplying an input sample x = [x T ; x G ] and corresponding target for it. Input from each modality is prefixed by a task specific string to distinguish transfer directions ("Text to Graph:" for x T and "Graph to Text:" for x G ). For M T+G models, the cross-entropy loss is similarly defined as for Eq. (1) and Eq.
(2) such that L T+G CE = E x∼D [ log p θ (x)]. All models are shown in Fig. 2. By convention, we refer to models in this paper by their input modality T, G, or T+G.

Reinforcement Learning
Sequence generation can be seen as an agent making sequential decisions of picking words from a given vocabulary. The agent reacts to its environment by accounting for past predictions and getting rewarded along the way, while its state is defined by the partial sequence generated so far. This interpretation enables the reformulation of Seq2Seq generation within the Reinforcement Learning (RL) framework (Sutton and Barto, 2018;Silver, 2015). More precisely, a sequence generation task can be recast as a Markov Decision Process (MDP) where the agent behavior follows a policy π(a t |s t ). Action a t corresponds to picking a particular word w t at time t from a vocabulary V, conditioned on state s t expressed as the partial sequence generation s t =x 1:t = [ŵ 1 , . . . ,ŵ t ], that is sequence of words/tokens already picked. π(a t |s t ) is a stochastic policy that defines a probability distribution of a t . Once the action a t is taken, Notations: x .t : text x .g : graph Figure 2: Specialized and hybrid models rely on the same losses for fine-tuning. However, specialized models are dedicated to a particular generation task while hybrid models can handle both generation directions.
the agent receives a reward r t = r(s t , a t ) before it transitions to the next state s t+1 . A sequence of actions a 1:T = [a 1 , . . . , a T ] is selected until the end of generation is reached. The agent aims at maximizing the expectation of cumulative reward where γ is a discounting factor used to control the horizon of the cumulative reward, γ ∈ [0, 1]. The expectation is taken over trajectories τ , sequences made of {s 1 , a 1 , r 1 , . . . , s T , a T , r T }, where a t was chosen from policy π(a t |s t ). RL provides both on-policy and off-policy approaches to maximize J(π) in Eq. (3). We are particularly interested in on-policy techniques that rely on data samples generated from the model to train, especially since our models start from large fine-tuned PLMs that can already generate good samples. This helps avoid the common drawback of on-policy techniques of generating poor samples at first when trained from scratch. . Policy-based methods focus on a parameterized policy π θ where θ is optimized to maximize J(π θ ). The policy π θ (a t |s t ) is the PLM generative model p θ , CE fine-tuned as described at the beginning of Section 3. REINFORCE, presented by Williams (1992), allows the optimization of a model's parameters θ by maximizing the expected value of the wordbased reward R w (x T ) of generated sequencex T = [ŵ 1 , . . . ,ŵ T ]. For notation convenience, note that R w (x T ) = R(τ ) since we are now dealing with sequence of words/tokensx T selected by the actions in trajectory τ . We will also use the R(x T ) notation for simplicity. In order to match common Deep Learning conventions, we can minimize a loss expressed as the negative value of the expected cumulative reward: is the reward for the generated text which is often associated with non-differentiable metrics such as BLEU, METEOR, chrF, etc. Note that in sequence generation, these metrics-based rewards are available only once a whole sequence is generated, trading sparsity/delay of reward for quality (i.e. we use the full sequence reward, not an estimation of partial future reward). We circumvent the non-differentiability issue by using the REIN-FORCE policy gradient method: where b is a baseline used to reduce the variance of our gradient estimate. b can be any function, even a random variable, as long as it is independent of the actions taken to generatex T , as described in Chapter 13.4 from Sutton and Barto (2018). In Self-Critical Sequence Training (SCST) (Rennie et al., 2017), b is chosen to be the reward of x * T , the output generated by the model by greedy max generation, hence the model serving as its own critic: wherex T is sampled from our model and x * T is generated by greedy max. An interesting property of the baseline is that if R(x T ) > R(x * T ), sampledx T has higher reward than x * T , then the model is updated to reinforce the choices made by this generation. In the opposite case where R(x T ) < R(x * T ), the model update will take the negative gradient to subdue such generation. When R(x T ) = R(x * T ), no update is performed on the model since the gradient is effectively zeroed out, regardless of the individual values R(x T ) and R(x * T ). This happens whenx T and x * T are identical (greedy-max and sampled sequences are the same). In that case the sample is lost for RL as no update to the model will result from this sample. Basically, REINFORCE is a Monte Carlo method of learning where a gradient update is applied in the direction decided by how R(x T ) compares to baseline b, the role of b being to reduce the variance of the gradient estimate. Variations around REINFORCE exist on how to apply the gradients, such as MIXER from Ranzato et al. (2016), or on how to evaluate the baseline (Luo, 2020) to minimize the gradient variance.
In our training, PLMs are first fine-tuned using L CE loss. Once they reach a good generation quality, the training is switched to RL fine-tuning by minimizing L SCST .

Experimental Setup
In this Section, we present the experimental setup used for all the results reported in this paper. Models We used T5 PLMs from Wolf et al. (2020) for our experiments for two distinct models, t5large (770M parameters) and t5-base (220M parameters), with a special focus on t5-large as it is the best performing of the two on various NLP tasks. Models were fine-tuned to be either specialized on T2G (M T ) or G2T (M G ) task, or to accommodate both directions of generation (M T+G ). Data processing Graphs are often represented as list of triples. However our model expects a sequence of input words/tokens to work on. The linearization of graph triples is obviously ambiguous as there are many ways to traverse a graph (Breadth First Search, Depth First Search, random walk, etc.). In practice, we linearize the triples in the order of the list provided by the dataset, but use this inherent linearization ambiguity as an opportunity to do data-augmentation. Indeed, models are first fine-tuned using cross-entropy loss that strongly penalizes generation if it is in any different order than the ground truth order. To avoid the model to overfit to our data and memorize observed triples order, we augment the data by including a few permutations of the graph triples.
During graph linearization, we encode the subject, predicate, and object positions by using <S>,<P>,<O> tokens. In practice, we expand the model vocabulary with these special indivisible tokens that are not split during tokenization. No other preprocessing is done on the data for training. We explored masked and span-masked LM fine-tuning to match T5 pretraining (Raffel et al., 2020) which did not lead to any noticeable improvements.

Datasets
WebNLG+ 2020 We report results on WebNLG+ 2020 (v3.0) used in the WebNLG 2020 Challenge (Castro Ferreira et al., 2020b). The Challenge comprises of two tasks: RDF-to-text generation (G2T), and Text-to-RDF semantic parsing (T2G). The Resource Description Framework (RDF) language is used to encode DBpedia and is commonly used in linked data framework. WebNLG+ uses RDF to encode graphs as sets of triples which are associated to one or more lexicalizations of one or more sentences each. Data for English and Russian are provided, but we only worked on the English subset made of 13,211 train, 1,667 dev, 2,155 testA (semantic parsing), and 1,779 testB (data-to-text) samples (triples sets w/ lexicalizations). The data is clustered semantically into 16 categories seen in train and dev sets (Airport, Astronaut, Building, etc.), while 3 categories (Film, Scientist, and Musical-Work) were introduced in test and are unseen, i.e. not present in training; see Castro Ferreira et al. (2020a) for more details. Results are aggregated for all, seen, and unseen categories during evaluation. Note that in the literature, prior works sometimes report 'WebNLG' results on previous dataset version, with completely different performance ranges. We compare all our results to WebNLG+ 2020 (v3.0) numbers reported by Castro Ferreira et al. (2020a) in their Table 6 for G2T,  and Table 10 for T2G tasks, using the provided official scoring scripts. TEKGEN To further study the robustness of our system, we also provide experiments using TEK-GEN dataset recently introduced in Agarwal et al. Query Service. Additionally, we limit the validation set and test set to 5K and 50K sentence-triples pairs respectively. Our training split after processing contains 6.3 million sentence-triples pairs. As a contribution to the work, we will present the steps to augment TEKGEN dataset with appropriate subject, object and relation boundaries, which enables conventional evaluation of research systems. An example of the processed TEKGEN is shown in Fig. 3 in Appendix.
Metrics WebNLG+ 2020 provides automatic metrics to evaluate models. For G2T, we used BLEU, BLEU_NLTK, METEOR, and chrF++ that are provided by the challenge. For T2G, F1, Precision, and Recall scores are utilized and computed for 4 levels of match: Exact, Ent_Type, Partial and Strict as described in Castro Ferreira et al. (2020a), which loosely correspond to different levels of relaxation of how close a match of an entity must be to the ground truth in content and position in a triple. Note that when generating graphs/RDFs, scoring metrics explore all possible permutations of a graph edges. For TEKGEN, we use the same metrics as for WebNLG+ 2020.

Results
For all experiments, PLMs were first exposed to the target datasets (WebNLG+, TEKGEN) by finetuning using L CE loss. They were then switched to RL training by optimizing the L SCST loss. Although no exact recipe has been established for Seq2Seq RL-training, starting from a good CE model helps RL training performance in practice (Ranzato et al., 2016;Rennie et al., 2017). Therefore, we followed the subsequent simple approach: During fine-tuning, the evaluations are conducted on the validation set. From the CE phase, the best performing model iteration is selected based on the METEOR and F1 score for the G2T and T2G tasks, respectively, to pursue RL fine-tuning. In case of G2T, potential ties in METEOR scores among candidate models, are resolved by using BLEU_NLTK, followed by the chrF++ metric. Note that early stopping selection of CE models led to good performance for t5-base models as well. During the SCST phase, the best model iteration on the validation set is selected and its performance numbers on the test set are reported in our tables.
WebNLG+ 2020 G2T For the WebNLG+ 2020 Challenge, the results of the top four systems for RDF-to-text task can be found in Tab   all metrics achieving state-of-the-art results to our knowledge. The gain obtained by SCST alone is quite significant and demonstrates the benefits of RL fine-tuning for this task. We report our best model results in Tab. 1, as well as mean and standard deviation results for multiple random number generator seeds in Tab. 10 in Appendix. When averaging results for few seeded models, sustained gains from SCST are seen for all metrics. Multiple reward candidates were investigated (BLEU, BLEU_NLTK, METEOR, chrF) as well as some linear combinations of pairs of them, as can be seen in Tab. 7 in Appendix. In Tab. 7, for t5-large, METEOR is consistently the best SCST reward, and improves all the other metrics scores as well. However, for 'smaller' models such as t5-base, BLEU_NLTK is revealed to be the best reward for improving BLEU performance as expected. Again, SCST brings significant gains across all the metrics in that case. Note that for t5-base model, selecting a METEOR reward improves METEOR results significantly as reported in Tab. 9 in Appendix.
Another interesting fact is that early stopping of CE model G2T.CE.ES (at 5 epochs) leads to the best SCST model G2T.RL.ES for t5-base, while selecting the best CE model G2T.CE.best (at 11 epochs) still showed some gains from SCST model G2T.RL.best. SCST needs a good starting point, but a better CE model that has seen a lot more epochs of our dataset maybe harder for SCST to stir in a better solution in the parameter space.
Moreover, the test split contains unseen categories not present in the validation dataset which render choices based on validation sub-optimal for the test dataset. The best models we report in this work are specialized models M G . Early in our investigation, hybrid models were the best performing model for G2T reaching 0.547 BLEU, 0.543 BLEU_NLTK and 0.417 METEOR, and first to beat the Challenge winning team. However, when batch size became larger (20-24 samples), the specialized models took the lead and retain it still.
For training, we optimized all our models using AdamW (Loshchilov and Hutter, 2017), variant of the Adam optimizer with default values of β = [0.9, 0.999] and weight decay of 10 −2 . For learning rate, we used 5.10 −6 for all our experiments as it was better than 10 −5 and 10 −6 as seen in Tab. 8 in Appendix. All our models were trained with 20-24 minibatch size on WebNLG. Further details on our experimental setup are provided in the Appendix in Section A.
WebNLG+ 2020 T2G Results for the Text-to-RDF task are reported in Tab. 2 for all categories. Results for our best model on seen and unseen categories are given in Tab. 6 in Appendix. Amazon AI and bt5 are the top performing teams. Again, the proposed ReGen T2G.CE model shows strong results that are better in term of all metrics, for all matching categories. In themselves, these numbers are a de-facto new state-of-the-art for this dataset, as far as we know. SCST model T2G.RL fails to improve on this model though. The exact F1 metric was used as reward, but the model could never pull ahead of the CE model in our experiments. The exact F1 metric may not be a strong enough reward to really capture the dynamics of graph generation properly for WebNLG+ as it is very rigid in its measure (one must have an exact match), although the same reward gave good results on our second dataset TEKGEN. A more sensitive metric could possibly help. We even tried to use n-gram based metrics (like BLEU) but to no avail. We further address this issue at the end on this Section. TEKGEN G2T For the TEKGEN dataset, we present our results on Graph-to-Text generation in Tab. 3. Similar to the experiments in WebNLG+, we pick the best model during the CE fine-tuning based on the METEOR score and proceed with the RL fine-tuning. We observe that the RL fine-tuning step helps boost the test split scores on all metrics. It is worth noting that the scores are slightly under-  estimating the potential of our system because of the nature of the sentences in the TEKGEN dataset. Unlike WebNLG+, in a paired text-graph sample in TEKGEN, the linearized graph does not usually cover all the concepts described in the corresponding text. This leads to underestimating when the hypothesis is scored against the reference using n-gram metrics.

TEKGEN T2G
Results for the Text-to-Graph for TEKGEN are reported in Tab. 4. Once the CE finetuning is done, we continue with the RL fine-tuning using exact F1 as reward. The performance is consistent with what we observe in G2T task for TEK-GEN, where SCST step boosts the performance of the model. Since, we reformulate this dataset (refer Section 4.1) to offer as T2G and G2T tasks, our approach is the first attempt in understanding the nature of TEKGEN dataset and our methods provide a baseline for future research. Please note that for both T2G and G2T tasks in TEKGEN, we only start a t5-large PLM.
Summary Results on WebNLG+ 2020 and TEK-GEN demonstrated that RL fine-tuning of models leads to significant improvements of results for T2G and G2T, establishing new state-of-the-art results for both tasks. For WebNLG+, T2G was a challenging task for RL fine-tuning. In further work, we plan to address this issue by investigating two points: First, look into a more sensible graphdependent sampling for graph structures, rather than the current multinomial sampling of the best tokens at each generation step. Second, try a different reward schemes where the reward is more attuned to the challenges of graph generation as well as graph structure, allowing for some curriculum learning, or increasing the harshness of rewards gradually during training. Results on TEKGEN showed that RL fine-tuning is a viable option even on large-scale datasets. To enrich this quantitative study of ReGen, we provide a few qualitative cherry picked results in Tab. 11 and Tab. 12 in Appendix.

Conclusions
In this paper, we proposed to use RL for improving upon current generation for text-to-graph and graph-to-text tasks for the WebNLG+ 2020 Challenge dataset using pre-trained LMs. We not only defined a novel Seq2Seq training of models in T2G and G2T generation tasks, but we established stateof-the-art results for WebNLG+ for both tasks, significantly improving on the previously published results. We provided extensive analyses of our results and of the steps taken to reach these improvements. We then expanded our approach to large scale training by means of TEKGEN where we demonstrated that RL fine-tuning provides a robust way to improve upon regular model finetuning within a dataset that is orders of magnitude larger than the WebNLG+ starting point. We established gains despite a weaker content overlap in text-graph data pairs for TEKGEN. Along the way, we constructed subject, and relation-object boundaries from TEKGEN sentence-triples pairs that we plan on releasing to benefit the research community. Future work will focus on developing a variant of SCST that leverages the unique structure of graph by either performing of more sensible graphdependent sampling, or by investigating different reward schemes more attuned to integrating the content and structure of graphs.

Broader Impact Statement
The techniques proposed in this paper are inherently dependent on the training data and the PLMs used for fine-tuning on this data. The models do benefit from the large amount of data seen by the PLM they are derived from, however it is fair to assume that any detectable bias in the original data or PLMs would most likely be transferred to the text-to-graph and graph-to-text generative models. This is something to keep in mind when building these generative models. Public datasets were used for all experiments. The TEKGEN with recreated boundaries does not change the underlying data and should not add any further noise nor bias to the original data.

A Training Setup
All our experiments were run using NVIDIA V100 GPUs for training and validation, some trainings were done on A100. We distributed our training to 2-4 GPUs depending on availability. Each training epoch for CE ranged from 30 minutes to 1 hour depending on number of GPUs utilized.
Validation and testing (1,779 and 2,155 samples for testA and testB of WebNLG+ 2020) lasted from 40 minutes to 1 hour depending on machines. Computation was dominated by beam search generation as we used beam search with beam size of 5 and a max sequence length of 192 (since linearized graph sequence can be quite long). We used the official scoring scripts released by WebNLG+ 2020 Challenge to score all our experiments. The evaluation of graph being the most computationally expensive as all possible matching combinations are tested in what looks like a factorial complexity, taking scoring of set of triples larger than 8 from impractical to not feasible.
All our models were built using PyTorch. Total effective batch sizes were set to either 20 or 24 samples for our distributed training. We adjusted the batch size on each worker to ensure consistent global batch size of 20 or 24.
We did some search on learning rates for t5large training and SCST rewards, see discussion and results in Section C.
All our trainings have a seeded random number generator for reproducibility. We also report results on WebNLG+ 2020 G2T tasks for each training setup by showing results for 3 models from different seeds, and provide means and standard deviations of these results in Tab. 10.

B WebNLG+ 2020 Results per
Categories for Best G2T and T2G Models In Tab. 5, we are reporting results for all WebNLG+ 2020 categories for our best CE and RL models. While results for unseen categories are much worse than for seen categories, RL fine-tuning manages to improve on both seen and unseen categories.
Tab. 6 provides results for seen, unseen and all categories for our best CE model ReGen T2G.CE which established state-of-the-art results on T2G task of WebNLG+ 2020 Challenge dataset.

C Ablation Studies
In Tables 7 and 8 we present ablation studies of different optimized metrics and learning rates for SCST training. As can be seen from Table 7, when METEOR is used as a reward, we get the best performance across all the metrics. We also tried using a combination of multiple rewards with different scaling but did not get any gain over the single metric rewards. In Table 8. we also show the effect of learning rate on SCST performance. Using lr = 5 · 10 −6 gave us the best performance, while higher rates, such as 10 −4 , led to unstable training and collapse of SCST.

D G2T Results t5-base models for SCST with METEOR Reward
Results for SCST fine-tuning of t5-base models using a METEOR reward are compiled in Tab. 9. Clearly, these models achieve better METEOR results as expected since they are RL optimized on this metric.

E G2T Results for Models from Multiple Random Seeds
All our training have a seeded random number generator for reproducibility. We also report the mean and standard deviations for all our G2T models. Each model setup was run 3 times using three independent and distinct seeds, following the same exact process. This is to ensure that our results are not just the product of a lucky system configuration or otherwise advantageous random shuffling of our training dataset. All results are reported in Tab. 10. The gain reported between CE and RL for our t5large models are clearly still showing after average of all 3 models from distinct random seeds. For t5-base, gains between CE and RL are still present, albeit smaller than for our best systems.

F Generation Examples for G2T and T2G
We present some cherry-picked examples for G2T in Tab. 12 and for T2G in Tab. 11 for both WebNLG and TEKGEN datasets.

G Processed TEKGEN Dataset
In Fig. 3 we show an example of our processing of TEKGEN dataset in establishing subject, relation, object boundaries. This enables both training and evaluating systems for T2G and G2T tasks.