Leveraging QA Datasets to Improve Generative Data Augmentation

The ability of generative language models (GLMs) to generate text has improved considerably in the last few years, enabling their use for generative data augmentation. In this work, we propose CONDA, an approach to further improve GLM’s ability to generate synthetic data by reformulating data generation as context generation for a given question-answer (QA) pair and leveraging QA datasets for training context generators. Then, we cast downstream tasks into the same question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which are in turn used as synthetic training data for their corresponding tasks. We perform extensive experiments on multiple classification datasets and demonstrate substantial improvements in performance for both few- and zero-shot settings. Our analysis reveals that QA datasets that require high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings.


Introduction
Recent advances in NLP have substantially improved the capability of pretrained language models to generate high-quality text (Radford and Narasimhan, 2018;Radford et al., 2019;Lewis et al., 2020;Brown et al., 2020).Various approaches (e.g., Kumar et al., 2020;Anaby-Tavor et al., 2020;Mekala et al., 2021) leverage this capability for generative data augmentation.This process usually involves first fine-tuning the GLM on training samples prepended with their target label and then generating synthetic data by prompting the GLM with a given target label.However, it is not evident that the model parameters learnt during pretraining or fine-tuning should support * Jingbo Shang is the corresponding author.

General text classification format
Question-Answer-Context format data generation using such unintuitive formulations with label tokens as prompts: In low data regimes, fine-tuning can be unstable (Devlin et al., 2019) and relies on the pretrained parameters to be reasonably well-suited for the target task (Phang et al., 2018).Therefore, for target domains that are different from the pretraining domain, such formulations may result in poor quality generation (Feng et al., 2020).
To address this challenge, we propose CONDA, an approach to leverage existing QA datasets for training Context generators to improve generative Data Augmentation.We propose to use a question answering (QA) formulation as a consistent format to prompt GLMs for synthetic data: We use QA datasets for training GLMs to be context generators for a given question and answer.
As illustrated in Figure 2, our method consists of two steps.The first step is QAC fine-tuning, where we fine-tune a pretrained language model on a QA dataset to obtain a general context generator that is capable of generating contexts for given questions and answers.To this end, we view the QA dataset in question-answer-context format instead of the context-question-answer format used to solve QA tasks (Radford and Narasimhan, 2018;Radford et al., 2019;Raffel et al., 2020).Then, we adapt the general context generator to the target domain by further training it on available few-shot data, resulting in a target-domain context generator.Inspired from recent work in converting several NLP tasks into a common format (McCann et al., 2018;Raffel et al., 2020), we format the target tasks into a question-answer schema.For example, as shown in Figure 1, topic classification and sentiment analysis data can be cast into the question-answer-context format with its respective label as answer, and text as context.We adapt the context generator to the target task domain by further training on target task few-shot supervision, resulting in target task context generator.Finally, we generate synthetic training data for the target task by generating contexts for questions and answers pertaining to the respective dataset.Then, we add the generated samples to the few-shot supervision and train a target task model on the augmented data.
We perform extensive experiments on multiple sentiment analysis and topic classification datasets with several abstractive, extractive, and commonsense reasoning QA datasets.Through rigorous experiments and thorough analysis, we observe that QA datasets that require high-level reasoning abilities such as abstractive and common-sense QA datasets suit the best for generating high-quality data.
Our contributions are summarized as follows: • We propose to use QA datasets for training generative language models to be context generators for a given question and answer.• We formulate various classification tasks into a QA format and model synthetic training data generation for these tasks as context generation.• We perform experiments on multiple sentiment analysis and topic classification datasets to demonstrate the effectiveness of our method in zero-and few-shot settings.
• We release the code on Github1 .

Related Work
Data Augmentation Wei and Zou (2019) propose a simple data augmentation method using synonym replacement, random insertion, random swap, and random deletion.Sennrich et al. (2016) augment samples by translating them into foreign language and then back to English.Du et al. (2021) compute task-specific query embeddings to retrieve sentences from unlabeled documents from the Internet.After a rise in pretrained generative language models, the generation capabilities of these models have been explored to generate synthetic data.Anaby-Tavor et al. (2020); Kumar et al. (2020); Schick and Schütze (2021b); Mekala et al. (2021) generate labeled documents using the GLMs and (Yang et al., 2020) do so specifically for commonsense reasoning.Puri et al. (2020) use GLMs to synthesize questions and answers and improve performance on question answering.Vu et al. (2021) generate data for NLI tasks.
Few-shot Learning Our work is closely related to few-shot learning as we take a few annotated samples as supervision.The idea of formulating classification as a prompting task is getting increasingly popular.Brown et al. (2020) introduce a new paradigm called in-context learning to infer from large language models using few annotated samples.Schick and Schütze (2021a) formulate input samples as cloze-style phrases and assign pseudo-labels that are used for training the classifier and Tam et al. (2021) improves their approach further without using any task-specific unlabeled data.(McCann et al., 2018;Raffel et al., 2020) format several NLP tasks into a question-answer and text-to-text schema.Lin et al. ( 2021) train multilingual autoregressive language models to enable few-shot learning in multiple languages.Gao et al. (2021) propose to generate prompts and convert smaller pretrained language models to few-shot learners.Other work proposes to pretrain prompts by adding soft prompts into the pretraining stage (Gu et al., 2022;Vu et al., 2022b,a).
Language Model Fine-Tuning Pre-trained language models are applied to downstream tasks by fine-tuning them using task-specific objectives (Howard and Ruder, 2018).However, this process requires significant annotated downstream task data (Yogatama et al., 2019).Many methods have been proposed to address this challenge.Gururangan et al. (2020) propose to continue training on unlabeled data from the target task domain.and (2) we use the fine-tuned GLM to generate synthetic data instead of training directly for the downstream tasks.It also differs from (Vu et al., 2021) in terms of the generated data, where they consider NLI as an auxiliary task and generate synthetic samples in target-domain for the NLI task irrespective of the target task and perform intermediate task fine-tuning.CONDA formats target tasks into question-answer format and directly generates samples relevant for target task.
3 CONDA: QA Datasets for Generative Data Augmentation In this section, we describe the problem statement, and explain our method including QAC fine-tuning, target-domain adaptation, and synthetic training data generation.

Problem Formulation
For a given task T , the input in a few-shot setting contains a very small labeled dataset L T = {(D 1 , l 1 ), (D 2 , l 2 ), . . ., (D |L T | , l |L T | )} and m target classes C = {C 1 , C 2 , . . ., C m }.Our method requires users to provide a question per dataset that is representative of the task to be solved.Our aim is to build a model for the task T that assigns a label C j ∈ C to each document D.

QAC Fine-tuning
We consider question-answering datasets Q containing triplets (q, a, c) of a question q, the corresponding answer a, and a context c required to derive the correct answer.Question-answering datasets can roughly be divided into extractive (Rajpurkar et al., 2016;Trischler et al., 2017;Joshi et al., 2017;Reddy et al., 2019) and abstractive datasets (Kočiský et al., 2018;Huang et al., 2019;Xiong et al., 2019;Sap et al., 2019).For extractive QA datasets, the answer can be found as a contiguous span in the context, whereas in abstractive QA datasets, the answer needs to be generated in natural language without being able to rely on the vocabulary of the question or context.We transform the QA dataset Q into training data D QAC for fine-tuning GLM.To this end, each triplet (q, a, c) is converted into a single text by prepending "question:", "answer:" and "context:", respectively, and concatenating q, a and c separated by newlines.For example, a preprocessed training document in D QAC from an extractive QA dataset might look as follows: question: when did battle of plassey happen?answer: 23 june 1757 context: the battle of plassey was a decisive victory of the british east india company over the nawab of bengal and his french allies on 23 june 1757.
We fine-tune a pretrained GLM G on D QAC to obtain a general context generator G Q using a language modeling objective to maximize the log-likelihood of the (q, a, c) triplet.The general context generator G Q is capable of generating contexts for given questions and answers.

Domain Adaptation and Synthetic Training Data Generation
We adopt G Q to the target domain by fine-tuning it further on available few-shot data.To preserve its context generating ability, we perform QAC fine-tuning instead of regular language model fine-tuning.This is enabled by transforming the few-shot supervision into our question-answercontext format.First, we manually design one question per dataset that is representative of the task and the dataset.Furthermore, following Schick and Schütze (2021a), we define a verbalizer as a mapping v: C → V that maps each label in C to a word from G Q 's vocabulary V. Finally, for every document D i and its respective label l i in our few-shot data, we consider the verbalizer mapping of the label, v(l i ), as answer and the text D i as context.For example, a negative review "I hate this movie" from the IMDb dataset (Maas et al., 2011) is transformed as follows: question: is the movie good or bad? answer: bad context: i hate this movie.
We fine-tune G Q on the converted few-shot data to obtain a target task context generator G T .
Synthetic Training Data Generation Recall that our method requires a question q for every dataset that is representative of the task to be solved.To obtain synthetic training data, for every distinct label C j , we create a question-answer prompt with q as question, v(C j ) as answer and let G T generate the context c gen .The generated context c gen and label C j are considered as a synthetic training sample.We repeat this process multiple times to generate n samples that we collect in a synthetic training dataset denoted by D gen .
As a final step, we train the target task model on the combination of D gen and our original few-shot dataset L T .We use this trained target-task model for inference.

Experiments
In this section, we evaluate our method against several data augmentation and few-shot methods on sentiment analysis and text classification tasks.

QA Datasets
We consider several extractive, abstractive, and common-sense QA datasets.datasets are also abstractive datasets that require common-sense reasoning to answer the questions.The QA dataset statistics are provided in Table 1.The details of these datasets are as follows: • SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) is a collection of questions and answers based on Wikipedia articles.• NewsQA (Trischler et al., 2017)

Target Task Datasets
We evaluate our method on six English text classification datasets.In particular, we consider the three sentiment analysis datasets: IMDb reviews (Maas et al., 2011), Yelp2 , and SST-2 (Socher et al., 2013), as well as three topic classification datasets: Yahoo (Zhang et al., 2015), The New York Times 3 (NYT), and AGNews (Zhang et al., 2015).The dataset-representative questions, and their respective verbalized labels of target task datasets are mentioned in Table 2.We follow and adapt Mc-Cann et al. (2018) for questions in sentiment analysis datasets.The question for topic classification is intuitive and straightforward.More details about the datasets can be found in Appendix A.1.

Compared Methods
We compare with a wide range of data augmentation and intermediate-task fine-tuning (ITFT) methods described below: • BERT-FT trains the BERT-base-uncased classifier (Devlin et al., 2019) on the few-shot supervision.• ITFT-X (Phang et al., 2018) first trains a model on dataset X and fine-tunes it further on the target task.We compare with ITFT-MNLI and ITFT-SQuAD fine-tuned intermediately on MNLI (Williams et al., 2018) and SQuAD datasets respectively.• BackTranslation (Sennrich et al., 2016) augments samples by translating them into a non-English language and translating them back to English.We translate them to French, Spanish, and Portuguese thereby augmenting three synthetic samples for every sample.• PEGASUS (Zhang et al., 2019) is a state-of-theart paraphrasing model.We paraphrase the input text and consider it as a synthetic sample and augment it to the training set.• EDA (Wei and Zou, 2019) generates synthetic samples by synonym replacement, random insertion, random swap, and random deletion and augment them to the training set.• LAMBADA (Anaby-Tavor et al., 2020) finetunes a GLM on few-shot supervision prepended with their target labels and then generates synthetic data by prompting the GLM with a given target label.We denote our method as CONDA, which includes QAC fine-tuning, domain adaptation, synthetic samples generation, and training the target task classifier.CONDA-X represents that the QAC fine-tuning of GLM is performed on QA dataset X.We also compare with CONDA\QA where we perform no QAC fine-tuning and directly fine-tune the GLM on target dataset.

Experiment Settings
We consider two low-data regimes: few-shot and zero-shot.We consider 8 annotated samples per label in the few-shot setting.In the zero-shot setting, we skip the domain adaptation step and use G Q directly for synthetic training data generation and train the target task model only on the generated synthetic training data.We use GPT2-Medium (Radford et al., 2019) as our GLM and fine-tune it for 3 epochs in QAC-fine-tuning and domain adaptation steps.While generating synthetic training samples, we use top-k sampling with k = 20, a maximum length of 200 tokens, and generate n = 450 synthetic samples per label.We use BERT-base-uncased (Devlin et al., 2019) as target task classifier.We feed [CLS] representation into the classification head and train all the parameters on the downstream target tasks.Following (Devlin et al., 2019), we fix the number of epochs of target task BERT classifier training to 4 unless mentioned otherwise.We perform 3 random restarts and report the mean and standard deviation. 4We use the Transformers library (Wolf et al., 2020) and NVIDIA RTX A6000 GPUs for our experiments.
To enable a fair comparison, we generate the same number of samples per label as CONDA (i.e., 450) for all data augmentation baselines.We use BERT-base-uncased as the target task classifier for all baselines.CONDA\QA for zero-shot setting implies a pre-trained GPT2.While training the target task classifier, since the number of training samples for baselines like BERT-FT, ITFT are different than data augmentation baselines and our method CONDA, we set the number of epochs for all baselines such that the number of update steps remain the same for a fair comparison.

Results and Discussion
Results for few-and zero-shot settings are shown in Table 3 and Table 4, respectively, using Microand Macro-F1 as evaluation metrics.We discuss the effectiveness of our method below.CONDA vs Baselines.In the few-shot setting, CONDA with abstractive and common-sense based datasets outperforms all baselines for most of the datasets, beating the best baseline in five out of Sentiment Topic six cases.CONDA performs better than BERT-FT on all datasets, achieving up to 14% improvement on SST-2.Although ITFT performs better than vanilla fine-tuning, CONDA demonstrates better performance than ITFT on all datasets.For example, CONDA-TweetQA shows 11% improvement over ITFT-SQuAD on AG-News.CONDA demonstrates higher performance than data-augmentation baselines on all datasets except NYT.The comparison between CONDA and LAMBADA shows that our QA formulation prompt is more intuitive and informative than just the target label.We attribute the superior performance of CONDA to the contextgenerating ability acquired during QAC fine-tuning that is efficiently leveraged by generating synthetic samples, which are added to the training set.
Abstractive vs Extractive QA Datasets.We observe that the performance of CONDA with abstractive QA datasets is significantly better than CONDA with extractive QA datasets in both fewshot and zero-shot settings.For example, CONDA-TweetQA has an improvement of more than 20% over CONDA-SQuAD on IMDb in few-shot setting.We surmise that this is because of the intrinsic nature of extractive QA datasets (i.e., the answer always being present in the context as a contiguous span).We observe that GLMs finetuned on an extractive QA dataset retain the ability to generate contexts that encompass the answer.Note that, while generating synthetic training samples, the answer in the prompt is its respective topic.For example, out of 500 generated samples by CONDA-SQuAD for Yelp dataset, 213 samples contain at least one occurrence of its corresponding verbalized label whereas it is only 73 for CONDA-CosmosQA.Thus, many synthetic samples generated contain their corresponding label in text.Therefore, a classifier trained on synthetic samples that have their corresponding labels in the text, easily overfits on the label tokens and does not generalize well to unseen test data.
Comparison with CONDA\QA.CONDA with abstractive QA datasets perform better than CONDA\QA in both few-shot and zero-shot settings, attaining improvements up to 40% and 35% respectively in macro-F1 on SST-2.This demonstrates that the context generating abilities are learnt and reinforced during the QAC fine-tuning on QA datasets which is efficiently utilized by generating synthetic samples.Zero-shot Performance.The zero-shot performance of CONDA follows a similar trend as fewshot performance: abstractive and common-sense reasoning QA datasets lead to better performance than extractive datasets and no QAC fine-tuning.

Ablation Study
To understand the impact of domain adaptation and few-shot samples, we compare CONDA with two ablated versions in Table 5: (1) CONDA-DA represents our method without domain adaptation (i.e., generating synthetic data using G Q and training the classifier on combined few-shot supervision and synthetic data generated by G Q ), (2) CONDA-Few Shot represents the classifier trained only on the samples generated by G T .We also present the results of our complete pipeline for reference.CONDA performs better than CONDA-Few shot in most cases, demonstrating the importance of including few-shot samples in the training set for the classifier.The comparison between CONDA and CONDA-DA suggests that fine-tuning the language model further on the target dataset helps in some scenarios but does not always improve performance.This is in line with previous research findings (Du et al., 2021;Vu et al., 2021;Pryzant et al., 2022).We conjecture that domain adaptation is important when the structure of the target task dataset is very different from the QA dataset.For example, domain adaptation helps most of the QA datasets on SST-2 dataset because the text in SST-2 is a single sentence, whereas most of the QA datasets have paragraphs as context.Moreover, it also depends on the number of samples the language model is fine-tuned on during domain adaptation.We observe that the higher the number of samples, the more positive their impact.For CONDA-L-SocialIQA 81.6 2.4 43.9 3.4 77.5 1.3 89.0 0.4 62.0 0.5 80.9 2.2 CONDA-L-CosmosQA 83.4 0.6 43.2 2.2 77.2 1.7 86.5 2.4 61.0 0.6 79.5 3.9 Table 6: Few-Shot Evaluation Results with GPT2-Large as GLM (-L denotes GPT2-Large).Macro-F1 is used as evaluation metric.All results of CONDA-L that perform better than CONDA\QA-L are in bold.example, the number of few-shot samples is the highest in Yahoo compared to other datasets and domain adaptation positively contributes to the performance on Yahoo for all QA datasets.

Larger Generative Language Models
Experimental results with GPT2-Large as the GLM are shown in Table 6.We observe that the relative performance trend remains the same as GPT2-Medium i.e.CONDA with abstractive datasets performs better than CONDA with extractive datasets and CONDA\QA-L.This indicates that QAC finetuning improves the performance of generative data augmentation with larger GLMs as well.

Performance vs No. of Generated Samples
We fix the few-shot supervision size to 8 samples per label and vary the number of generated samples per label and plot the performance of CONDA-TweetQA, CONDA-SocialIQA, and baselines such as LAMBADA and EDA on AGNews and IMDb datasets, shown in Figure 3.We repeat each setting with three different seeds and plot the mean performance.We observe that the performance increases and it plateaus after a while.This shows that synthetic training data can give a substantial boost to the few-shot training data, minimizing the human effort in manual annotations; however, it cannot replace the original training data completely as it  requires more human annotated data to improve beyond some limit.

Performance vs Few-shot supervision Size
We fix the number of generated samples to 450 per label and vary the number of annotated samples and plot the performance of CONDA-CosmosQA and CONDA-SocialIQA on SST-2 and Yahoo datasets in Figure 4. We also plot the performance of baselines such as BERT-FT, EDA, BackTranslation for comparison.We repeat each experiment with three random seeds and plot the mean performance.We observe that the performance of CONDA increases with the size of supervision and the improvement over baselines in the low-data regime is substantial.For example, with only 4 annotated samples per label in Yahoo dataset, the macro F1 of CONDA-CosmosQA outperforms BERT-FT by 22% and EDA by 15%.However, we also observe that the performance gap between CONDA and baselines decreases with increase in supervision size and gets stagnated after a while.As the size of supervision increases, the supervision by itself is sufficient for high performance, thus reducing the performance boost due to synthetic training data.

Self-Training
We perform an experiment to demonstrate that the performance can be further improved through self- training when in-domain unlabeled samples are provided.In-domain unlabeled samples are often easily available in real-world scenarios.Self-training is a commonly-used approach to bootstrap the classifier on unlabeled samples (Mekala and Shang, 2020;Mekala et al., 2020;Vu et al., 2021).Following Vu et al. (2021), we obtain pseudo-labels by predicting on unlabeled samples using the trained classifier and train the classifier further on the available labeled and pseudo-labeled data.We consider the training set without ground truth labels as unlabeled data and experiment on SST-2, NYT, and AGNews datasets.We repeat this process for 3 iterations without any filtering of pseudo-labels.From the results in Table 8, we can observe a significant performance improvement up to 4 points with selftraining.It is noteworthy that this improvement is consistent for both GPT2-Medium and Large models respectively.

Synthetic Data Adds Value
Unsupervised language model pre-training(LMPT) on target-task unlabeled data can improve performance (Gururangan et al., 2020).We consider training set without ground truth labels as unlabeled data for LMPT and present a comparison in few-shot setting in Table 7.We observe CONDA performs better than LMPT demonstrating the quality and importance of generated synthetic data.

Case study: Evaluating Context Generator
We hypothesize that our method results in highquality context generators that are capable of generating context for a given question and answer.

Conclusion
In this paper, we propose to train generative language models to be context generators for a given question and answer.To facilitate this, we use question answer as a format and utilize QA datasets for training generative language models into context

Limitations
One limitation of our approach is the synthetic training data generated can boost the performance up to an extent and beyond that it requires more annotated samples.So, the generated synthetic training data cannot replace the training data altogether but could minimize the annotation effort significantly.Moreover, some tasks such as NER are challenging to cast into question-answering format, which hinders generating synthetic data using our method.

A Appendix
A.1 Target Task Datasets The details of target task datasets are as follows: • IMDb: (Maas et al., 2011) is a movie review dataset with positive and negative as sentiments.• Yelp: 5 is a collection of reviews written by Yelp users with five fine-grained sentiment ratings.• SST-2: (Socher et al., 2013) is a binary sentiment classification dataset with single sentence texts.• Yahoo: (Zhang et al., 2015) is a topic classification dataset with question and answer pairs.Using these pairs, the task is to predict their corresponding topic.• The New York Times (NYT): : contains news articles written and published by The New York Times that are classified into 5 wide genres.• AGNews: (Zhang et al., 2015) is a topic categorization dataset in news domain from AG's corpus.The size of test sets is mentioned in Table 10.

A.2 Performance vs k
We vary k in top-k sampling and plot the performance of CONDA-SocialIQA on IMDb, SST-2, AGNews, and Yahoo datasets in Figure 5.We fix the few-shot supervision size to 8 samples per label and generate 450 samples per label.We repeat each experiment thrice and plot the mean performance.Upon manual inspection, We observe that the samples generated with k=20 are more diverse than k=10, however, the influence of k on performance is not significant.

A.3 Experiments with a validation set
We perform experiments with a validation set.Since large validation sets are impractical in fewshot settings (Oliver et al., 2018), we consider the validation set to be of same size as the few-shot training set i.e. 8 annotated samples per label.In the experiments with validation set, we perform early stopping based on validation set performance.We present experimental results on few-shot setting with validation set in Table 11.We seldom observe significant improvement upon introducing the validation set.This is because a small validation set which is of same size as few-shot supervision is not large enough to tune the hyperparameters.

A.4 Examples of Generated Training Data
Table 12 shows a few examples of synthetic training data corresponding to IMDb and AGNews datasets generated by our method with all QA datasets.

Figure 1 :
Figure 1: Examples of converting topic classification and sentiment analysis data into question-answercontext format.

Figure 2 :
Figure2: We propose to use QA datasets for transforming pre-trained generative language models into high-quality target task data generators.We view QA datasets in question-answer-context format and fine-tune a pre-trained GLM (G) to obtain a general context generator (G Q ).Then, we adapt it to the target domain by training it further on few-shot target dataset supervision, resulting in G T .Finally, using G T , we generate synthetic training data for the target task, use it to augment the few-shot target dataset and train the target task model on the augmented data.

Figure 3 :
Figure 3: Macro-F 1 scores of CONDA-TweetQA and CONDA-SocialIQA w.r.t.number of generated samples per class.We fix the few-shot supervision size to 8 samples per label.Each experiment is repeated with three different seeds and the mean performance is plotted.

Figure 4 :
Figure 4: Macro-F 1 scores of CONDA-CosmosQA and CONDA-SocialIQA w.r.t.number of few-shot annosamples per class.Each experiment is repeated with three different seeds and mean performance is plotted.

Figure 5 :
Figure 5: Macro-F 1 scores of CONDA-SocialIQA w.r.t.k.Each experiment is repeated with three different seeds and mean performance is plotted.

Table 1 :
Common-sense QA Relevant statistics of the QA dataset used in our experiments.

Table 2 :
Questions and Verbalized labels of the target task datasets considered in our experiments.

Table 3 :
Few-Shot Evaluation Results.Micro-and Macro-F1 are used as evaluation metrics.All experiments are repeated with three random seeds.Mean and standard deviation (in the subscript) are reported.The best baseline for each dataset is underlined and all results of CONDA that outperform the best baseline are in bold.

Table 5 :
Ablation Study.Macro-F1 is used as evaluation metric.
15.2 38.9 1.5 64.3 7.2 84.3 2.9 60.3 1.3 81.1 1.7 CONDA-L-NewsQA 72.4 9.4 38.3 0.3 58.1 6.2 85.5 2.4 61.4 1.1 82.2 1.2 CONDA-L-TweetQA 76.4 5.7 45.0 1.3 74.6 2.3 84.4 0.1 61.6 0.1 79.7 3.1 BERT-base-uncased QA model on the augmented data.We compare it with the BERT model trained only on the original training set.We report F1 scores on test set in Table9.We observe a boost of 4% using our synthetic training data, validating our hypothesis in the in-domain setting.Out-of-domain Analysis.In this experiment, we validate our hypothesis in the out-of-domain setting i.e. the domain of target dataset is different than the QA dataset used for QAC fine-tuning.

Table 9 :
Case Study: We evaluate our context generators in in-domain and out-of-domain settings.In both cases, we observe substantial improvement in the performance demonstrating the effectiveness of our method.We view sentiment and topic classification tasks in question-answer form and generate contexts using our fine-tuned generative language models.These generated contexts are used as synthetic training data to augment existing fewshot data for training a classifier.Extensive experiments on multiple sentiment and topic classification datasets demonstrate strong performance of our method in few-shot and zero-shot settings.