Stage-wise Fine-tuning for Graph-to-Text Generation

Graph-to-text generation has benefited from pre-trained language models (PLMs) in achieving better performance than structured graph encoders. However, they fail to fully utilize the structure information of the input graph. In this paper, we aim to further improve the performance of the pre-trained language model by proposing a structured graph-to-text model with a two-step fine-tuning mechanism which first fine-tunes model on Wikipedia before adapting to the graph-to-text generation. In addition to using the traditional token and position embeddings to encode the knowledge graph (KG), we propose a novel tree-level embedding method to capture the inter-dependency structures of the input graph. This new approach has significantly improved the performance of all text generation metrics for the English WebNLG 2017 dataset.


Introduction
In the graph-to-text generation task (Gardent et al., 2017), the model takes in a complex KG (an example is in Figure 1) and generates a corresponding faithful natural language description (Table 1). Previous efforts for this task can be mainly divided into two categories: sequence-to-sequence models that directly solve the generation task with LSTMs (Gardent et al., 2017) or Transformer (Castro Ferreira et al., 2019); and graph-to-text models (Trisedya et al., 2018;Marcheggiani and Perez-Beltrachini, 2018) which use a graph encoder to capture the structure of the KGs. Recently, Transformer-based PLMs such as GPT-2 (Radford et al., 2019), BART (Lewis et al., 2020), * This research was conducted during the author's internship at Salesforce Research. 1 The programs, data and resources are publicly available for research purpose at: https://github.com/ EagleW/Stage-wise-Fine-tuning and T5 (Raffel et al., 2020) have achieved stateof-the-art results on WebNLG dataset due to factual knowledge acquired in the pre-training phase (Harkous et al., 2020;Ribeiro et al., 2020b;Kale, 2020;Chen et al., 2020a). Despite such improvement, PLMs fine-tuned only on the clean (or labeled) data might be more prone to hallucinate factual knowledge (e.g., "Visvesvaraya Technological University" in Table  1). Inspired by the success of domain-adaptive pre-training (Gururangan et al., 2020), we propose a novel two-step fine-tuning mechanism graph-totext generation task. Unlike (Ribeiro et al., 2020b;Herzig et al., 2020;Chen et al., 2020a) which directly fine-tune the PLMs on the training set, we first fine-tune our model over noisy RDF graphs and related article pairs crawled from Wikipedia before final fine-tuning on the clean/labeled training set. The additional fine-tuning step benefits our model by leveraging triples not included in the training set and reducing the chances that the model fabricates facts based on the language model. Meanwhile, the PLMs might also fail to cover all relations in the KG by creating incorrect or missing facts. For example, in Table 1, although the T5-large with Wikipedia fine-tuning successfully removes the unwanted contents, it still ignores the "sports Governing Body" relation and incorrectly The Acharya Institute of Technology is located in the state of Karnataka . It was given the Technical Campus status by the All India Council for Technical Education which is located in Mumbai . The institute offers tennis and has Telangana to its northeast and the Arabian Sea to its west. [International Tennis Federation] T5-large + Position The Acharya Institute of Technology is located in the state of Karnataka which has Telangana to its northeast and the Arabian Sea to its west. It was given the Technical Campus status by the All India Council for Technical Education in Mumbai . The Institute offers tennis which is governed by the International Tennis Federation . T5-large + Wiki + Position The Acharya Institute of Technology in Karnataka was given the 'Technical Campus' status by the All India Council for Technical Education in Mumbai . Karnataka has Telangana to its northeast and the Arabian Sea to its west. One of the sports offered at the Institute is tennis which is governed by the International Tennis Federation .  Figure 1. We use the color box to frame each entity out with the same color as the corresponding entity in Figure 1. We highlight fabricated facts, [missed relations], and incorrect relations with different color. links the university to both "Telangana" and "Arabian Sea". To better capture the structure and interdependence of facts in the KG, instead of using a complex graph encoder, we leverage the power of Transformer-based PLMs with additional position embeddings which have been proved effective in various generation tasks (Herzig et al., 2020;Chen et al., 2020a,b). Here, we extend the embedding layer of Transfomer-based PLMs with two additional triple role and tree-level embeddings to capture graph structure.
We explore the proposed stage-wise fine-tuning and structure-preserving embedding strategies for graph-to-text generation task on WebNLG corpus (Gardent et al., 2017). Our experimental results clearly demonstrate the benefit of each strategy in achieving the state-of-the-art performance on most commonly reported automatic evaluation metrics.

Method
Given an RDF graph with multiple relations G = {(s 1 , r 1 , o 1 ), (s 2 , r 2 , o 2 ), ..., (s n , r n , o n )}, our goal is to generate a text faithfully describing the input graph. We represent each relation with a triple (s i , r i , o i ) ∈ G for i ∈ {1, ..., n}, where s i , r i , and o i are natural language phrases that represent the subject, type, and object of the relation, respectively. We augment our model with addi-  tional position embeddings to capture the structure of the KG. To feed the input for the large-scale Transformer-based PLM, we flatten the graph as a concatenation of linearized triple sequences: Ribeiro et al. (2020b), where |S, |P, |O are special tokens prepended to indicate whether the phrases in the relations are subjects, relations, or objects, respectively. Instead of directly finetuning the PLM on the WebNLG dataset, we first fine-tune our model on a noisy, but larger corpus crawled from Wikipedia, then we fine-tune the model on the training set. Positional embeddings Since the input of the WebNLG task is a small KG which describes properties of entities, we introduce additional positional  embeddings to enhance the flattened input of pretrained Transformer-based sequence-to-sequence models such as BART and TaPas (Herzig et al., 2020). We extend the input layer with two positionaware embeddings in addition to the original position embeddings 3 as shown in the Figure 2: • Position ID, which is the same as the original position ID used in BART, is the index of the token in the flattened sequence |S s 1 |P r 1 |O o 1 ... |S s n |P r n |O o n .
• Triple Role ID takes 3 values for a specific triple (s i , r i , o i ): 1 for the subject s i , 2 for the relation r i , and 3 for the object o i .
• Tree level ID calculates the distance (the number of relations) from the root which is the source vertex of the RDF graph.

Results and Analysis
We use the standard NLG evaluation metrics to report results: BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and TER (Snover et al., 2006) , as shown in Table 2. Because Castro Ferreira et al. (2020) has found that BERTScore (Zhang* et al., 2020) correlates with human evaluation ratings better, we use BERTscore to evaluate system results 5 as shown in Table 4. When selecting the best models, we also evaluate each model with PARENT (Dhingra et al., 2019) metric which measures the overlap between predictions and both reference texts and graph contents. Dhingra et al. (2019) show PARENT metric has better human rating correlations. Table 3 shows the pre-trained models with 2-step fine-tuning and position embeddings achieve better results. 6 We conduct paired t-test between our proposed model and all the other baselines on 10 randomly sampled subsets. The differences are statistically significant with p ≤ 0.008 for all settings.
Results with Wikipedia fine-tuning. The Wikipedia fine-tuning helps the model handle unseen relations such as "inOfficeWhileVicePresident", and "activeYearsStartYear" by stating "His vice president is Atiku Abubakar." and "started playing in 1995" respectively. It also combines relations with the same type together with correct order, e.g., given two death places of a person, the model generates: "died in Sidcup, London" instead of generating two sentences or placing the city name ahead of the area name.
Results with positional embeddings. For the KG with multiple triples, additional positional embeddings help reduce the errors introduced by pro- 5 We only use BERTScore to evaluate baselines which have results available online. 6 For more examples, please check Appendix for reference. noun ambiguity. For instance, for a KG which has "leaderName" relation to both country's leader and university's dean, position embeddings can distinguish these two relations by stating "Denmark's leader is Lars Løkke Rasmussen" instead of "its leader is Lars Løkke Rasmussen". The tree-level embeddings also help the model arrange multiple triples into one sentence, such as combining the city, the country, the affiliation, and the affiliation's headquarter of a university into a single sentence: "The School of Business and Social Sciences at the Aarhus University in Aarhus, Denmark is affiliated to the European University Association in Brussels".

Remaining Challenges
However, pre-trained language models also generate some errors as shown in Table 5. Because the language model is heavily pre-trained, it is biased against the occurrence of patterns that would enable it to infer the right relation. For example, for the "activeYearsStartYear" relation, the model might confuse it with the birth year. For some relations that do not have a clear direction, the language model is not powerful enough to consider the deep connections between the subject and the object. For example, for the relation "doctoralStudent", the model mistakenly describes a professor as a Ph.D. student. Similarly, the model treats an asteroid as a person because it has an epoch date. For KGs with multiple triples, the generator still has a chance to miss relations or mixes the subject and the object of different relations, especially for the unseen category. For instance, for a soccer player with multiple clubs, the system might confuse the subject of one club's relation with another club.

Related Work
The WebNLG task is similar to Wikibio generation (Lebret et al., 2016;, AMRto-text generation (Song et al., 2018) and RO-TOWIRE (Wiseman et al., 2017;Puduppully et al., 2019). Previous methods usually treat the graphto-text generation as an end-to-end generation task. Those models (Trisedya et al., 2018;Gong et al., 2019;Shen et al., 2020) usually first lineralize the knowledge graph and then use attention mechanism to generate the description sentences. While the linearization of input graph may sacrifice the inter-dependency inside input graph, some papers (Ribeiro et al., 2019(Ribeiro et al., , 2020aZhao et al., 2020)   use graph encoder such as GCN (Duvenaud et al., 2015) and graph transformer (Wang et al., 2020a;Koncel-Kedziorski et al., 2019) to encode the input graphs. Others (Shen et al., 2020; try to carefully design loss functions to control the generation quality. With the development of computation resources, large scale PLMs such as GPT-2 (Radford et al., 2019), BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) achieve state-ofthe-art results even with simple linearized graph input (Harkous et al., 2020;Chen et al., 2020a;Kale, 2020;Ribeiro et al., 2020b). Instead of directly fine-tuning the PLMs, we propose a two-step finetuning mechanism to get better domain adaptation ability. In addition, using positional embeddings as an extension for PLMs has shown its effectiveness in table-based question answering (Herzig et al., 2020), fact verification , and graph-to-text generation (Chen et al., 2020a). We capture the graph structure by enhancing the input layer with the triple role and tree-level embeddings.

Conclusions and Future Work
We propose a new two-step structured generation task for the graph-to-text generation task based on a two-step fine-tuning mechanism and novel treelevel position embeddings. In the future, we aim to address the remaining challenges and extend the framework for broader applications.

Acknowledgement
This work is partially supported by Agriculture and Food Research Initiative (AFRI) grant no. 2020-67021-32799/project accession no.1024178 from the USDA National Institute of Food and Agriculture, and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.