GraDA: Graph Generative Data Augmentation for Commonsense Reasoning

Recent advances in commonsense reasoning have been fueled by the availability of large-scale human annotated datasets. Manual annotation of such datasets, many of which are based on existing knowledge bases, is expensive and not scalable. Moreover, it is challenging to build augmentation data for commonsense reasoning because the synthetic questions need to adhere to real-world scenarios. Hence, we present GraDA, a graph-generative data augmentation framework to synthesize factual data samples from knowledge graphs for commonsense reasoning datasets. First, we train a graph-to-text model for conditional generation of questions from graph entities and relations. Then, we train a generator with GAN loss to generate distractors for synthetic questions. Our approach improves performance for SocialIQA, CODAH, HellaSwag and CommonsenseQA, and works well for generative tasks like ProtoQA. We show improvement in robustness to semantic adversaries after training with GraDA and provide human evaluation of the quality of synthetic datasets in terms of factuality and answerability. Our work provides evidence and encourages future research into graph-based generative data augmentation.


Introduction
Recent work has seen the emergence of several datasets for improving commonsense reasoning of language models through tasks like question answering (QA) (Sap et al., 2019b;Talmor et al., 2019;Bisk et al., 2020) and natural language inference (Bhagavatula et al., 2020;Zellers et al., 2019;Sakaguchi et al., 2020).Some of these datasets are based on existing knowledge graphs that represent different aspects of commonsense through entities and relations.For example, annotators for SocialIQA (Sap et al., 2019b) were shown an event 1 Code and synthetic data files are available at https: //github.com/adymaharana/GraDA. from the inferential knowledge graph ATOMIC (Sap et al., 2019a) and instructed to turn it into a sentence by adding names, filling placeholders and adding context, etc.For multiple-choice QA datasets, annotators are also instructed to write distractor choices for each question.These useful datasets are collected through a time-taking and money-intensive crowdsourcing process which is hard to scale.Large pretrained models like GPT2 (Radford et al., 2018) can be finetuned to generate sentences from narrow data distributions, and it has recently been leveraged to augment datasets for text classification (Anaby-Tavor et al., 2020) and question answering (Puri et al., 2020;Yang et al., 2020).However, it is challenging to generate augmentation data for commonsense reasoning because the generated questions and answers (referred to as "synthetic" in rest of the paper) need to depict plausible real-world scenarios accurately.Hence, we develop GRADA, a graph-based generative data augmentation framework to generate synthetic samples from existing knowledge graphs that encode information about the real world.We focus on generating synthetic samples for models that perform discriminative and generative commonsense question answering.
Each sample in commonsense reasoning datasets comprises a question which describes a real-world scenario and can be mapped to a set of predefined entities and relations from knowledge bases like ConceptNet and ATOMIC.For instance, the question "Besides a mattress, name something people sleep on."from the ProtoQA dataset (Boratko et al., 2020) can be mapped to the single-hop path (mattress, RelatedTo, people) using ConceptNet.If a pretrained language model is trained to conditionally generate questions from such input paths, we can expect it to generate sensible questions when it is provided new paths with similar relations.The model will likely generalize to unseen entity nodes and generate questions containing unique commonsense knowledge.Following this intuition, we finetune GPT2 (Radford et al., 2019) to generate questions which explicitly depict the entities and relations in input path.When trained on the aforementioned example (alongside other similar examples) and provided with the new path (mattress, RelatedTo, soft), our model generates "Besides a mattress, name something that's soft.",which is a valid question for probing real-world commonsense.Usually, these paths contain multiple nodes with several hops and hence are referred to as graphs in rest of the paper.In order to represent the graph, we explore both (a) encoding of linearized graph and (b) augmentation of linear encodings with structure-aware encoding of graph, and find that the latter improves the transfer of semantic knowledge from graph to text.Synthetic questions need to be accompanied by synthetic answers and distractor choices (for multiple-choice datasets), which are similarly generated by finetuning GPT2 for conditional generation of answers/distractors from the question.However, Yang et al. (2020) report that human annotators find it hard to pick a unique/unambiguous answer in more than 50% of the synthetic dataset generated in this manner.Therefore, we explore an alternative where we finetune the generative model within a GAN framework (Nie et al., 2019a) where it is continuously challenged by a discriminator model to generate unique distractors that can fool the discriminator (see OptionGAN, Figure 1).
The synthetic questions and answers thus generated are assembled into synthetic samples which are then used in a two-stage training pipeline (Mitra et al., 2019).Additionally, since the generative pipeline is only an approximate imitation of the human annotation process, we are left with several ambiguous and inaccurate samples in the synthetic pool.Hence, we retain the most informative data samples from the synthetic pool by using Question Answering Probability (Zhang and Bansal, 2019) to measure accuracy by answerability.Our contributions can be summarized as follows: • We present a generative framework consisting of (i) a graph-to-text model to convert knowledge graphs to questions, (ii) a model finetuned with GAN loss to generate distractors for commonsense reasoning QA datasets, and (iii) combined with a filter for selecting the most informative samples from synthetic datasets.
• We improve performance on commonsense reasoning datasets, and perform ablation analysis to show the impact of various modules in our framework as well as human evaluation of synthetic dataset quality.

Related Work
Explicit reasoning over knowledge graphs has been a popular approach for improving commonsense understanding of QA models.and Wang, 2018) to QA samples for data augmentation.Yang et al. (2020) generate randomly initialized samples from finetuned GPT2 as augmentation data for target datasets.We ground the generated samples to real-world facts by providing knowledge graphs as input to the model.There has been a surge of efforts in neural graphto-text modeling in the recent years.Marcheggiani and Perez-Beltrachini (2018) encode input graphs using a graph convolutional encoder (Kipf and Welling, 2017).Koncel-Kedziorski et al. (2019) propose the model GraphWriter which improves on the graph attention networks presented in Velickovic et al. (2018) by replacing self-attention encoder with Transformer blocks (Vaswani et al., 2017).Several recent works have shown that pretrained generative models can be finetuned with or without structure-aware graph encoding to improve graph-to-text generation (Mager et al., 2020;Ribeiro et al., 2020;Hoyle et al., 2020;He et al., 2020;Ke et al., 2021).Query or question generation has also been shown to benefit from knowledge graphs in Shen et al. (2022); Bi et al. (2020).We combine the structure-aware encoding capabilities of graph-to-text models with the rich contextual knowledge of pretrained models in GraphGPT2 and generate rich real-world scenarios from sparse sub-graphs (Shen et al., 2022;Chen et al., 2020;Kumar et al., 2019).
Good distractors are necessary for a task model to learn the right reasoning towards answering multiple-choice datasets.To this end, Liang et al. (2018) rank distractors using feature-based ensemble methods.Offerijns et al. (2020); Yang et al. (2020) finetune GPT2 to generate distractors.Chung et al. (2020) approach distractor generation as a coverage problem and select distractors for maximizing sample difficulty.Cai and Wang (2018) use adversarial training to sample high quality negative training examples for knowledge graph embeddings.In a similar line of work, we use generative adversarial networks (GANs) (Goodfellow et al., 2014) with the Gumbel-Softmax relaxation (Kusner and Hernández-Lobato, 2016;Nie et al., 2019b) and train a generator with GAN loss to imitate the creation of human-authored tricky, incorrect answer options.Most NLP applications use REINFORCE (Sutton et al., 2000) algorithm and its variants (Yu et al., 2017;Cai and Wang, 2018;Qin et al., 2018;Zhang et al., 2018) to circumvent the discrete sampling issue for text-based GANs.

Methods
In this section, we describe the various modules in the GRADA framework.

Graph-to-Text Generation
In the first module of our pipeline, we generate synthetic questions by using knowledge graphs as input.Given a dataset of input graphs (g i ), we finetune GPT2 with cross-entropy loss for conditional generation of questions (q i ) from the graphs i.e., L q = N i=1 log p(q i |f (g i )), where f (.) is the function for encoding the graph and p(.) represents the probabilities.We explore linearized graph encoding as well as structure-aware encoding of graph.
Linearized Graph Input.Graph linearization is a simple way to use graphs like text when finetuning GPT2.We adopt depth-first-search to linearize the input graphs and preserve edge information to some extent by augmenting GPT2 vocabulary with special tokens for edges.GPT2 is finetuned for conditional generation of target question from this linearized graph input.
Using linearized graphs with pretrained language models (PTLMs) surpasses graph-based architectures at data-to-text generation by a large margin (Ribeiro et al., 2020).However, Mager et al. (2020) show that omitting the edge information from linearized graphs notably degrades performance, implying that graph structure is beneficial for generation.Hence, we propose GraphGPT2.
GraphGPT2 for Structure-aware Graph Input.Instead of linearizing the input graph, we encode the graph using a Transformer-based graph encoder f s (.) which preserves the graph structure by performing masked self-attention over edges and nodes.We use the Transformer-based graph encoder from Graph Writer (Koncel-Kedziorski et al., 2019) for structure-preserving encoding of graphs.First, we convert the input graphs g i into unlabeled connected bipartite graphs G i = (v i , e i ), where v i is the list of entities, relations and global vertex, and e i is the adjacency matrix describing the directed edges (Beck et al., 2018).The global vertex is connected to all entity vertices and promotes global context modelling by allowing information flow between all parts of the graph.Next, v i is projected to a dense, continuous embedding space V i and is sent as input to the graph encoder (see Figure 2).The encoder is composed of L stacked Transformer blocks; each Transformer block consists of a N -headed self-attention layer followed by normalization and a two-layer feed-forward network.The resulting encodings i.e. f s (g i ), are referred to as graph contextualized vertex encodings.These encodings are prepended to the embedded representation of linearized graph in the form of past key values, and sent as input to the decoder.The decoder i.e., pretrained GPT2, is finetuned to generate a coherent question from the combined embeddings.The graph encoder is initialized with GPT2 embeddings to force continuity in word representation across modules.Figure 2 shows the integration of graph contextualized encodings with GPT2 in GraphGPT2.The combined generative model is finetuned end-to-end for maximizing the conditional log-likelihood of target question q i i.e.
During inference, both of the above models are provided with graphs that do not appear in training dataset to generate synthetic questions containing new knowledge.See Sec.4.1 for details on creation of training and inference datasets.

Answer & Distractor Generation
We finetune a GPT2 model for conditional generation of answers from questions i.e., L a = N i=1 log p(a i |q i ).However, as we discussed in Sec. 1, a similar method for conditional generation of distractors does not guarantee good distractors.Hence, we finetune GPT2 within a GAN framework to generate maximally adversarial distractors, in a bid to imitate the best human annotator.
OptionGAN for Adversarial Choices.We train a model to generate distractors (in the multiplechoice QA task) for the synthetic questions obtained from GraphGPT2 (see Figure 1) using a generator-discriminator adversarial framework.The discriminator D is a sequential classification model that takes the question q i , concatenated with the ground truth correct answer a i i.e., [q i ; a i ] or the distractor di generated by generator G i.e., [q i ; di ] as input and classifies the pair as correct or otherwise.While training, the generator runs the risk of learning to generate correct answers instead of distractors, since it's goal is to be able to fool the discriminator into classifying the questiondistractor pair [q i ; di ] as correct.To prevent this, we heavily bias the model by first pretraining it to generate only distractors using the conditional cross-entropy loss and then continue with adversarial training from the saved weights.Mathematically, we pretrain the generator G with the loss , where q i , d i are question and distractor, respectively.We use the question as input instead of the knowledge sub-graph, since most generated questions contain additional semantics from the latent knowledge of the pretrained generative model which is not present in the original sub-graph.Then, the pretrained generator is finetuned within an adversarial framework to produce distractors that successfully fool the discriminator, so that we get adversarial options that are as tricky as human-annotated options (see Figure 3).We use the Gumbel-Softmax relaxation (Nie et al., 2019a) while sampling from generator to allow flow of gradients through the discriminator model i.e. z = sof tmax( 1 τ (h + g)), where h, g and τ are the logits generated from G, Gumbel distribution sample and temperature respectively.The temperature is annealed using an exponential function during training.Following RelGAN (Nie et al., 2019a), we use the Relativistic standard GAN loss for the adversarial training i.e. min Generator G is trained to minimize the loss while discriminator D is trained to maximize the loss.In practice, we use GPT2 for both roles i.e., generator as well as discriminator.

Filtering and Selection of Samples
Inspite of the careful construction of synthetic samples using knowledge graphs, the pool of synthetic samples can be noisy and may consist of incoherent text, incorrect question-answer pairs or outof-distribution samples.Hence, we use Question Answering Probability (QAP) (Zhang and Bansal, 2019) to measure accuracy of synthetic samples.
The QAP score (µ) is the prediction probability of the true class by a model with parameters θ which has been trained on the original dataset i.e.
Samples with low prediction probabilities for the correct choices are either annotated incorrectly or are especially difficult instances for the model.We define a low and high threshold for the QAP filter and samples lying within this range are retained in the dataset.See supplementary for a comparison of QAP with two other methods for filtering i.e.Energy (Liu et al., 2020) and Model Confidence & Variability (Swayamdipta et al., 2020).

Datasets
SocialIQA (Sap et al., 2019b) and Common-senseQA (Talmor et al., 2019) are annotated using knowledge graphs, making them a suitable choice for testing our approach.SocialIQA is a question answering dataset based on ATOMIC (Sap et al., 2019a) (Speer et al., 2017) containing an official split of 9741/1221/1241 samples.Following Yang et al. (2020), we also test our method on HellaSwag-2K (Zellers et al., 2019) and CODAH (Chen et al., 2019) for low-resource scenario.HellaSwag-2K is created by sampling 2000/1000/1000 examples from HellaSWAG training and validation sets.We test our approach on the CoDAH folds (2.8k samples) released by Yang et al. (2020) for comparison.Apart from these four MCQ datasets, we also experiment with the generative QA dataset ProtoQA (9762/52/102) (Boratko et al., 2020) and find that our approach works especially well with it.See Appendix for details.Data Preparation.To prepare graph-to-text datasets for training GraphGPT2, we map the ques-tions to multi-hop paths in ConceptNet (Bauer et al., 2018).We use Spacy2 to tag the questions with part-of-speech and extract verbs and nouns as concepts, retaining those that appear in Concept-Net as entities and the connecting relations (see example in Fig. 4). 3 We remove inverse relations from the set of triples.The graphs extracted in this manner are acyclic and can be linearized with a depth-first search.For SocialIQA, we map the questions to a combination of ATOMIC and Con-ceptNet.ATOMIC events contain nouns and verbs which are representative of the social scenario being described in the event and are further extended in the context by SocialIQA annotators.We tokenize and stem the events and contexts to extract these representative words, and compute the percentage of overlapping words in the context with respect to each event.The event with maximum overlap with context is selected as the corresponding ATOMIC subject.The ATOMIC relation is selected from the predefined map of ATOMIC relations to SocialIQA questions.This way, we recover the ATOMIC alignments of nearly 20,000 samples from training set of SocialIQA (88% acc.).
Generation of Synthetic Data.In order to prepare synthetic datasets, we create a dataset of unseen input graphs by mutating the graphs from training sets of graph-to-text datasets.One or two entities are replaced by a randomly selected entity (or relation-entity pair) with similar adjacency to other entities in the input graph, to create a mutated graph.The maximum sequence length of graph contextualized embeddings is set to 64, while that of GPT2 is set to 128.The synthetic dataset size (pre-filtering) is 100k/50k/10k/10k/50k for SocialIQA, CQA, HellaSwag-2K, Codah, and ProtoQA respectively.For generation of synthetic data for SocialIQA, we use the set of tuples from ATOMIC that do not appear in the original dataset.To prepare the synthetic dataset for Common-senseQA, we select two adversarial choices from ConceptNet and two choices generated by Option-GAN.For ProtoQA, we find accurate answers by generating 30 sets of answers for each synthetic question, ranking the answer choices by frequency and retaining the ones that appear at least 5 times in the 30 sets.See example of synthetic context generation in Fig. 4. Evaluation.To evaluate graph-to-text generation, we define an ORACLE score which measures the semantic relevance of synthetic question when paired with the original answer options.We replace the original question in validation set samples with the synthetic question and re-evaluate models on this modified dataset.In addition, we adopt the following NLG metrics: BLEU-4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), CIDEr4 (Vedantam et al., 2015) and BERTScore (F1 score) (Zhang et al., 2020).Models trained on the synthetic and original commmonsense reasoning datasets are evaluated using their respective task-specific accuracies (see Appendix).For Pro-toQA, we report the accuracy in top-k answers where k = 1, 3, 5.We also perform human evaluation of the samples generated using GraphGPT2 and OptionGAN.

Results & Analysis
First, we present results from the complete GRADA framework followed by results from ablation experiments.Then, we discuss evaluation of the various generative models in GRADA using automated metrics as well as human annotators.Finally, we evaluate the robustness of models trained with and without GRADA to semantic adversaries and discuss upper bounds of our data augmentation pipeline.See Appendix for visualization of the quality of the synthetic datasets.

Data Augmentation Results
Results from the best GRADA model are presented in Table 1. 5 The baseline row represents results from the same task models used for GRADA but trained without any data augmentation i.e.T5-3B for ProtoQA and RoBERTa for all other datasets.We see 1-2% improvements over baseline across all multiple-choice datasets using GRADA.For the best GRADA models (selected using validation results), synthetic samples are generated from structured GraphGPT2 and OptionGAN, and filtered using QAP. 6GRADA results in large improvements for ProtoQA i.e. 4-6% higher values on the Max Answers 1/3/5 metrics (see Appendix), suggesting the effectiveness of our approach for similar generative tasks.We see 0.3%, 0.3% and 0.26% improve-ment with GRADA over G-DAUG for CQA, Codah and HellaSwag-2K respectively.Our approach also performs similar to the Option Comparison Network in HyKAS (Ma et al., 2019) for CQA (row 3 in Table 1).Our approach is orthogonal to HyKAS, KG-Fusion as their instance-level approach retrieves information for each sample while GRADA augments knowledge on a global level.
Ablation results from the GRADA framework on validation sets are presented in Table 2.The first row of Table 2 presents results from baseline task models i.e., trained without data augmentation.Next, we compare results from two-stage training and see upto 1.7% (p<0.05 for all datasets) improvements (row 1 vs. 4 in Table 2) with the addition of synthetic data without filtering. 7Using structured GraphGPT2 leads to 0.47% (p=0.043),0.39% (p=0.078),1.46% (p=0.12)8improvements over linearized GraphGPT2 for SocialIQA, CQA, ProtoQA and diminishing improvements for the smaller datasets.We see consistent but modest improvements which are not significant, from addition of distractors generated from OptionGAN.Even though improvements with OptionGAN are marginal, it is necessary for the completeness of the pipeline for synthetic generation.Next, adding filter to denoise the synthetic pool unequivocally improves results by large margins for all datasets except CQA.Filtering by QAP (row 5 in Table 2) provides additional benefit (p=0.069 and p=0.093 for SocialIQA and CQA, p<0.05 for other datasets) to downstream task models over unfiltered synthetic data augmentation (row 4). 9See examples of high and low quality synthetic data samples filtered using QAP in Table 7. Smaller datasets benefit the most from GRADA.
Single-hop vs. Multi-hop Paths.Additionally, we finetune GraphGPT2 with sub-graphs made of single-hop paths only to generate the context.We perform data augmentation using the synthetic questions generated through this approach and compare to the GRADA results on validation sets.See results in Table 4.We observe 0.92%, 0.08%, 1.48% and 1.05% drops in performance for validation sets of SocialIQA, CQA, CODAH and Hel-
Generalization to Unseen Concepts.We looked for %overlap of entity nodes and single-hop paths (subject-relation-object) between the multi-hop KGs spanning the questions of correctly answered samples after GraDA training and the questions of synthetic data, and observed 5-60% entity overlap and <20% path overlap.This suggests GRADA also promotes reasoning capabilities of the downstream models for unseen concepts.

Generative Model Evaluation Results
ORACLE scores for the two variations of GraphGPT2 are presented in Table 3.The scores in first column refer to the validation set performance of baseline models on original datasets.These models are re-evaluated on the questions generated by GraphGPT2 (as described in Sec.4.1).The largest improvement i.e. 2.16% (p=0.068) is observed for SocialIQA, which may be attributed to its large dataset size.We see diminishing improvements for low-resource scenarios i.e.Codah and HellaSwag-2K.We observe a similar trend when the synthetic questions are evaluated using NLG metrics (see Appendix).More importantly, since phrase-matching metrics are not ideal for NLG evaluation (Novikova et al., 2017), we also perform human evaluation to judge the quality of generation for SocialIQA and CQA as we see significant improvements from structured GraphGPT2 vs. linearized GraphGPT2.We ask annotators on Amazon Mechanical Turk10 (AMT) to select the sentence which is more representative of the information encoded in input graph, for 100 samples from validation set.Questions generated from GraphGPT2 are preferred 46% and 53% of the times for SocialIQA and CQA resp., compared to those from linearized inputs only, showing that the addition of graph encoder improves integration of knowledge in generated text.We perform human evaluation (AMT)  of answerability of the generated questions/answers/distractors on 50 randomly selected samples from the filtered augmentation data (see Table 5).Annotators were provided with the question, answer and distractors, and asked to evaluate a) if the question can be answered in a few words (b) if the question can be answered by the given answer and (c) if the distractors are wrong answers for the question.More than 90% of the questions were judged as answerable, 75-90% of the answers were judged as correct answers for the respective questions.The quality of distractors ranged from 50% for SocialIQA to 20-30% for smaller datasets.However, the overall quality of distractors is high enough to benefit data augmentation.See examples in Table 7.We also perform human evaluation for the factuality of samples generated using our method GraDA and GDaug (Yang et al., 2020).We picked a randomly sampled set of 100 synthetic QA pairs from G-Daug for the datasets CQA, Codah and HellaSWAG-2K.For a fair comparison, we collected 100 synthetic pairs from GraDA for the same datasets.We asked an annotator to evaluate if each of the synthetic QA pair adheres to a plausible real-world scenario, and found that 56% G-Daug samples were judged as factual as compared to 68% of the GraDA samples (see examples in Table 6).

Upper Bounds
We ran experiments for augmentation with 20%, 40%, 60%, 80% and 100% training data from the original set (see Fig. 5).The improvement margins from the augmentation dataset is upto 4% at 20% of the original SocialIQA dataset.We see similar trends for CODAH, HellaSwag and ProtoQA, while the improvements for CQA were <1.5%.

Robustness Evaluation
We expect that data augmentation exposes the task model to diverse language and improves its robustness to semantic adversaries in addition to boosting its performance on the target task.To evaluate this, we use the TextFooler system (Jin et al., 2020;  Yang et al., 2020;Wei and Zou, 2019) to generate adversarial text by computing word importance ranking and replacing the most influential words with their synonym in the vector space.Overall, GRADA benefits the robustness of task models and improves their failure rate by 1-3% (see Table 9).

Semantic Analysis of OptionGAN
As outlined in Sec. 1, we use OptionGAN to provide better adversarial choices for the synthetic questions generated using GraphGPT2.In this section, we perform qualitative analysis of the adversarial choices generated with and without Option-GAN in order to define the scenarios where Op-tionGAN provides more effective choices.First, from the analysis of a few examples, we find that OptionGAN improves the synthetic QA examples in two main ways: • It provides wrong choices rather than the equally-correct choices (false negatives) provided by the non-adversarial choice generation (see example A in Table 8).
• It provides wrong choices that require more complex reasoning (i.e.harder true negatives) than the ones provided by the non-adversarial choice generator.(see example B in Table 8).
We conducted a larger human evaluation study of randomly picked 50 synthetic samples from the four synthetic QA datasets (SocialIQA, H2K, CQA, CODAH) and compared the synthetic adversarial choices generated with and without OptionGAN.We observe that, in 30.7% of the cases, Option-GAN choices were more adversarial than the ones generated without OptionGAN.Within those examples, nearly 60% samples were categorized into the first category and 30% were categorized into the second category, as described above.However, the improvements from OptionGAN are limited in the downstream task, suggesting that 'harder negatives' are required for effective training.

Conclusion
We present GRADA, a graph-based data augmentation framework for commonsense reasoning QA datasets.We train a graph-to-text question generator and GAN-based adversarial choice generator for creating synthetic data samples, which are used to augment the original datasets.GRADA promotes factuality in synthetic samples and improves results on five downstream datasets.

Ethical Considerations
The usage of pretrained generative models in any downstream application requires careful consideration of the real-world impact of generated text.
In our approach, we provide concrete inputs for grounding the generated text to specific entities and relations which encode real-world facts, thereby reducing the possibility of propagating unintended stereotypical and social biases embedded within the pretrained models.However, since these entities and relations are derived from existing knowledge bases like ConceptNet (Speer et al., 2017), there is potential for transfer of bias present in these resources to the generated texts.Additionally, the graph-to-text generative models in GRADA pose the same risk as other data-to-text generative models (Ribeiro et al., 2020;Hoyle et al., 2020;Mager et al., 2020) i.e. the models can be made to generate incorrect facts by providing incorrect data as input.Therefore, we recommend restricting the use of GRADA to low-risk, unbiased graphs inputs.

A Experiment Setup
Datasets: Social IQA (Sap et al., 2019b) and CommonsenseQA (Talmor et al., 2019) are popular datasets based on knowledge graphs, making them a suitable choice for testing our approach.Social IQA is a multiple-choice question answering dataset.Each sample consists of a context, question and three multiple choices.CommonsenseQA is also a multiple-choice QA dataset, wherein each sample consists of a context and five multiple choices.Of those 5 choices, three are taken from ConceptNet and the other two are authored by annotators.We only use the human-authored incorrect choices to train our adversarial choice generator OptionGAN.The ATOMIC knowledge graph contains 24K base events and 877K tuples describing a variety of social scenarios.We use the 710K training split introduced in Bosselut et al. (2019) to randomly sample 100K tuples as the seed subgraphs for generation of synthetic data dataset for Social IQA.For CommonsenseQA, we use the entire ConceptNet knowledge graph, subject to pruning as outlined in Talmor et al. (2019), to sample seed tuples for synthetic dataset generation.For SocialIQA, CQA, Codah and HellaSwag-2K, we use simple accuracy for model evaluation.
ProtoQA (Boratko et al., 2020) is a generative QA dataset which is evaluated by 7 different metrics 11 .We report the first 3 metrics i.e.Max Answers 1/3/5.For tables showing only one number for ProtoQA, such as the ablation table in main text, we report the Max Answer 1 metric.In order to train T5-3B for ProtoQA, we concatenate the ranked choices for each question and finetune the model for conditional generation of this concatenated string from the input question.
All of the above datasets are being for their intended purposes i.e. research only, in our work.All of these datasets are in the English language.
Data Preparation: To prepare graph-to-text datasets for training GraphGPT2, we map the questions to multi-hop paths in ConceptNet (Bauer et al., 2018).We use Spacy12 to tag the questions with part-of-speech and extract verbs and nouns as concepts, retaining those that appear in ConceptNet as entities 13 .For SocialIQA, we map the questions to a combination of ATOMIC and ConceptNet.ATOMIC events contain nouns and verbs which are representative of the social scenario being described in the event and are further extended in the context by Social IQA annotators (see Table 6).We tokenize and stem the events and contexts to extract these representative words, and compute the percentage of overlapping words in the context with respect to each event.The event with maximum overlap with context is selected as the corresponding ATOMIC subject.The ATOMIC relation is selected from the predefined map of ATOMIC relations to Social IQA questions.This way, we recover the ATOMIC alignments of 20,000 samples from training set of SocialIQA with 88% accuracy.We introduce additional tokens to the vocabulary of GPT2, in order to represent the set of relations present in the knowledge graph.Multiword entities are embedded using an average of embeddings across individual tokens.Synthetic Data Generation.In order to prepare synthetic datasets, we create a dataset of unseen input graphs by mutating the graphs from training sets of graph-to-text datasets.or two entities are replaced by a randomly selected entity (or relation-entity pair) with similar adjacency to other entities in the input graph, to create a mutated graph.The synthetic dataset size (prefitering) is 100k/50k/10k/10k/50k for SocialIQA, CQA, HellaSwag-2K, Codah, and ProtoQA respectively.For generation of synthetic data, we use the set of tuples from ATOMIC and ConceptNet that do not appear in SocialIQA and CommonsenseQA datasets respectively.To prepare the synthetic dataset for CommonsenseQA, we select two adversarial choices from ConceptNet and two choices generated by OptionGAN.For ProtoQA, we find accurate answers by generating 30 samples of answers for each synthetic question, ranking the answer choices by frequency and retaining the ones that appear atleast 5 times in the 30 samples.After this, the synthetic question and answer (concatenation of high-frequency answer choices) is subjected to filtering.Due to lack of option for supplementary in this submission, we have included a sample of the generated synthetic examples in Table 10.

A.1 Filtering and Selection of Samples
Inspite of the careful construction of synthetic samples using knowledge graphs, the pool of synthetic samples can be noisy and may consist of incoherent text, incorrect question-answer pairs or out-of-distribution samples.Hence, we compare the effect of three different methods to filter samples on downstream task performance.
Question Answering Probability (QAP).The QAP score (µ) (Zhang and Bansal, 2019) is the prediction probability of the true class by a model with parameters θ which has been trained on the original dataset i.e. µ i = p θ (y * i |x i ).Samples with low prediction probabilities for the correct choices are either annotated incorrectly or are especially difficult instances for the model.We define a low and high threshold for the QAP filter and samples lying within this range are retained in the dataset.

Model
Confidence and Variability.Swayamdipta et al. (2020) propose the model confidence ( μi ) and variability ( σi ) measures to identify the effect of data samples on the model's generalization error.Specifically, μi = 1 , where E is training epochs.They find that ambiguous samples i.e., high variability and mid-range confidence, contribute the most to test performance on downstream task.Following this, we define low and high thresholds for both confidence and variability in order to find the most informative samples.
Energy.Liu et al. (2020) show that the energy score can be reliably used for distinguishing between in-and out-of-distribution (OOD) samples, as compared to the traditional approach of using the softmax scores.We introduce an energy threshold to select samples which are out-of-distribution i.e.E i = −log C j e p θ (y j i |x) where C is the number of choices in the QA sample, and measure the effect of using OOD samples as augmentation data.

A.2 Training Details
Baselines: We use pretrained RoBERTa LARGE (Liu et al., 2019) for multiple-choice datasets and T5-3B (Raffel et al., 2020) for ProtoQA as the task models.The baseline task model is finetuned on original datasets with no data augmentation, and is used as scoring model for filtering.We use GPT2 MEDIUM for GraphGPT2, GPT2 SMALL as the pretrained generator and discriminator for Option-GAN.For GRADA, the model is first finetuned on synthetic samples using label smoothing (Szegedy et al., 2016) and then on original dataset.We refer the reader to Koncel-Kedziorski et al. (2019)

B.1 Generative Model Evaluation
As shown in Table 12, we see small improvements for BLEU-4 and METEOR, but larger improvements in other metrics from GraphGPT2 i.e., 3.07% (p=0.027),2.87% (p=0.035) in CIDEr, and 2.71% (p=0.042),1.58% (p=0.056) in BERTScore for Social IQA and CQA, resp.The phrase-matching metric scores are low for CQA, which may be attributed to its small sample size.However, BERTScore for CQA lies between 85-88%, showing that the model manages to convey similar meaning as human-annotated context albeit with different words.More importantly, since phrase-matching metrics are not ideal for NLG evaluation (Novikova et al., 2017), we also perform human evaluation to judge the quality of generation for SocialIQA and CommonsenseQA as we see significant improvements from structured GraphGPT2 vs. linearized GraphGPT2.We ask annotators on Amazon Mechanical Turk 14 to select the sentence which is more representative of the information encoded in input graph, for 100 samples from validation set.Results are shown in  more times than those from linearized inputs, for both SocialIQA and CQA, showing that addition of a graph encoder improves representation of knowledge in generated sample.
Additionally, we perform human evaluation of the samples generated using GraphGPT2 and Op-tionGAN.We randomly select 50 samples from the filtered augmentation datasets for each of the five datasets, and ask 2 annotators to answer 3 yes/no questions about the quality of the question, answer and distractors respectively.We present results from the survey in Table 5.More than 90% of the questions in each dataset were judged as answerable, showing the effectiveness of GraphGPT2 as well as the QAP-based filtering method.Similarly, 75-90% of the answers were judged as correct answers for the respective questions.The quality of distractors were relatively lower, ranging from 50% for larger datasets like SocialIQA to 20-30% for rest of the datasets.The inter-annotator agreement was also low (<0.6) for distractor judgements, suggesting the general difficulty of both tasks: distractor generation and measurement of distractor quality.However, the overall quality of distractors in our datasets is high enough to benefit data augmentation.
For both human evaluation annotation tasks, it was made clear in the instructions that the data is being collected for research purposes only.

B.2 Comparison of Filtering Methods
Table 14 demonstrates the effect of using various methods of filtering i.e.QAP, Energy and Model Confidence/Variability. Results are shown on the validation sets the commonsense reasoning datasets.We see largest improvements with using QAP as the filter.Similar improvements are seen with the confidence/variability scores; however, it requires scores from multiple finetuned models from various training checkpoints.

B.3 Robustness Evaluation
We expect that data augmentation exposes the task model to diverse language and improves its robustness to semantic adversaries in addition to boosting its performance on the target task.To evaluate this, we use the TextFooler system (Jin et al., 2020;Yang et al., 2020;Wei and Zou, 2019) to generate adversarial text by computing word importance ranking and replacing the most influential words with their synonym in the vector space.Failure rate is the %examples for which TextFooler fails to change the original model prediction, and average perturbation ratio is the average % of words replaced when TextFooler succeeds at changing the prediction.We use our best GRADA models in comparison with baseline models (Table 9).Overall, GRADA positively impacts the robustness of task models to TextFooler and improves the failure rate by >3% for Codah and upto 1% for all other datasets.We observe similar trends for the perturbation ratios too.This shows that GRADA improves semantic robustness of the models.It is also worthwhile noting that generative task models like T5-3B for ProtoQA are especially prone to adversarial attacks like TextFooler with a mere 5-6% failure rate and there needs to be more research towards improving their robustness.

B.4 Cartography Quality Evaluation
We use dataset cartography Swayamdipta et al. (2020) to visualize the quality of our synthetic datasets.Samples in top left of figure are easy, while samples towards bottom and right of the figure are difficult and ambiguous respectively.We can observe from the figure that the synthetic dataset for CQA (left) has a higher % of easy samples than HellaSwag-2K, suggesting that the quality of synthetic samples generated by GRADA improves with original dataset size.Moreover, when applying QAP filtering, using the entire synthetic dataset yields largest improvements for CQA whereas for HellaSwag-2K (right), the lower cutoff for QAP is 0.3 which filters out most of the samples present in bottom part of the plot.This suggests that in low-resource scenarios, it is important to remove inaccurate samples, while larger datasets benefit from ambiguous and inaccurate samples.

Figure 1 :
Figure 1: GRADA framework: The original dataset is used to train GraphGPT2, a graph-to-text question generator and OptionGAN, a distractor generator.The synthetic dataset is subjected to filtering and used to train the model in combniation with the original dataset.

Figure 2 :
Figure 2: GraphGPT2: The Graph Encoder is composed of L Transformer blocks and its output is concatenated with GPT2 embeddings for input to GPT2.

Figure 4 :
Figure 4: Example of synthetic context generated from GraphGPT2 for the CODAH dataset.
putting rubber on furniture.They should do this before .. front of the mirror.S: PersonX provides __ for PersonY's children Taylor provided meals for Kendall's children and they all enjoyed it greatly.There was a large, cold bite of ice on my where?R: xIntent Why did Taylor do this?He hated flying, the controls were what?O: To be helpful [A] to be a bad friend [B] to be helpful [C] to be rude What is a square leg made of made out of?S: weasel R: AtLocation The man was a weasel, he was part of a powerful what?What country does a cow go to make a milk run?O: mafia organization [A] out of doors [b] terrarium [c] mafia organization [D] farmyard [E] backyard Table 6: Comparison of randomly generated synthetic data from G-Daug (Yang et al., 2020) (left) and knowledgegrounded synthetic data generated using GRADA (right).(S=Subject, R=Relation, O=Object) High-quality synthetic samples SIQA Riley provided help to the community through his many charity events over the years.How would Others feel as a result?[A] selfish [B] appreciative [C] bored CQA When a child is upset by something, what may they do?[A] fall down [B] wish to fly [C] start crying [D] play tag [E] boy or girl PQA Name something you worry you're still doing when you're not supposed to.drinking, smoking, sleeping, working, using cell phone Low-quality synthetic samples SIQA Tracy raised her arm to her face to cover her eyes during the scary movie.What does Tracy need to dobefore this? [A] scared [B] be scared of the movie [C] to have a fundraiser CQA What will you do if you want to go public?[A] prepare for worst [B] tell family first [C] own private company [D] telegram [E] charming PQA Name a family tradition that has deep roots in the dialect of suzh.cooking, caroling, knitting, hunting, fishing

Figure 5 :
Figure 5: % improvement in accuracy over baseline with different % of original dataset.Baseline is RoBERTa finetuned on the same % of original dataset.

Table 1 :
Results on test sets of commonsense datasets and comparative results from other approaches taken from leaderboards.*We use T5-3B for ProtoQA baseline and GRADA results and RoBERTa for all other datasets.

Table 2 :
Ablation results on validation set of commonsense reasoning datasets.*We use sample perplexity for filtering ProtoQA samples.

Table 5 :
Results from human evaluation of generated questions, answers and distractors.

Table 7 :
High and low quality synthetic samples generated through GRADA for SIQA, CQA, ProtoQA (PQA) and ranked using QAP scores (and perplexity for PQA).Labels are marked in green.

Table 8 :
Examples of two scenarios were OptionGAN adversarial choices improve the synthetic question.Answers are marked in green.OptionGAN adversarial choices are marked in blue.

Table 9 :
Robustness Evaluation.Failure rate / perturbation ratio (higher is better) from TextFooler experiments are shown on development sets.

Table 12 :
for full implementation details of the Graph Encoder Comparison of performance for GPT2 and GraphGPT2 on development sets.

Table 13 :
Results from comparative human evaluation of generated questions.Wins and Loses refer to the %times synthetic question generated from structured graph input was chosen over linearized graph.

Table 13 .
Samples generated from structured input are selected significantly 14 Located in United States, HIT Approval Rate>98%, Number of HITs Approved>10K.

Table 14 :
Ablation results on validation set of commonsense reasoning datasets using various filtering methods.*We use sample perplexity for filtering ProtoQA samples.