Is ChatGPT the ultimate Data Augmentation Algorithm?

,


Introduction
Textual data augmentation (DA) is a rich and complicated field, with the goal of finding ways to generate the most informative artificial examples to add to a training set without additional labelling cost.Many techniques have been developed and thoroughly tested in order to create informative artificial data.Recently, the new model ChatGPT has put into question much of what was known for dataset construction, with researchers even wondering if its capacities meant the end of human labelling (Kuzman et al., 2023).
While extensive studies have been done on the use of ChatGPT for many Natural Language Understanding tasks, its use for DA has been surprisingly little investigated yet.To our knowledge, only two papers have looked into this, namely (Møller et al., 2023) and (Dai et al., 2023).Both however explore limited settings, which makes it still unclear whether ChatGPT is a good tool for data augmentation or not.In particular, Møller et al. (2023) study the performance of ChatGPT and GPT-4 on classification tasks of medium size (500 examples), comparing zero shot data generation and few-shot+data augmentation to crowdsourcing annotation, but not to other DA techniques.Dai et al. (2023) compare the use of DA with ChatGPT for large datasets (several thousand examples), focussing on bio-medical data.Their algorithm however does not isolate ChatGPT for data augmentation, but instead combines it with pre-training, preventing an objective evaluation of the capacities of ChatGPT for DA.
In this paper, we compare the use of ChatGPT for paraphrasing as well as for zero-shot data generation, on five classification datasets (three binary and two multiclass ones), to seven algorithms that have shown good performances in the past.We show that the performance of ChatGPT is highly dependent on the dataset, due mainly to poorly defined tasks, which makes prompting difficult.These tasks were chosen because they are standard on the textual DA literature, and as such these biases are important to point out.With efficient prompting, generating new data with ChatGPT remains the best way to perform textual DA.

Related Work
There are roughly three main types of DA techniques: word-level augmentation, paraphrase, and generative methods. 1n word-level DA, operations modify genuine sentences' words to create variations.Commonly, the operation is word substitution, replacing it with a synonym (Wei and Zou, 2019;Liesting et al., 2021), a neighboring word in pre-trained embedding (Marivate and Sefara, 2020), or by masking and predicting with a neural network (Kobayashi, 2018;Wu et al., 2019;Kumar et al., 2020).
Paraphrasing techniques attempt to create paraphrases from the available sentences.
The most seminal technique of this family is Back-Translation (BT), a technique in which a sentence is translated to a pivot language and then back into English (Hayashi et al., 2018;Yu et al., 2018;Edunov et al., 2018;Corbeil and Ghadivel, 2020;AlAwawdeh and Abandah, 2021).Neural networks to directly generate paraphrases have also been used, with specialized decoding techniques for RNN (Kumar et al., 2019), or by using a BART model trained on a corpus of paraphrases generated from BT (Okur et al., 2022).
Generative methods learn the distribution of the training data and generate new data from it.While the obvious advantage is that data should be more diverse, generative models are often more complicated to train and fine-tune correctly.Examples of this family of methods includes using GPT-2 for generating new data (Kumar et al., 2020;Liu et al., 2020;Queiroz Abonizio and Barbon Junior, 2020), other generative models such as VAEs (Malandrakis et al., 2019;Qiu et al., 2020;Piedboeuf and Langlais, 2022) or conditional VAEs to generate examples conditionned on the class (Zhuang et al., 2019;Malandrakis et al., 2019;Rizos et al., 2019;Wang et al., 2020).
Finally, there have also been interest into the use of proprietary models for DA.Both Yoo et al. (2021) and Sahu et al. (2022) show that GPT-3 is able to generate excellent new data, either by completing a list of sentences from one class, or by asking to generate both new sentences and their labels.ChatGPT has also been studied to generate new data, asking it to paraphrase existing data (Dai et al., 2023) for few-shot learning, but the authors first fine-tune the classifier with a large dataset from the same distribution, making it hard to isolate the impact of the generated sentences.Finally, Møller et al. (2023) look at the performance of ChatGPT and GPT-4 for data augmentation compared to human annotation, and conclude that for simple datasets (such as review analysis of products), ChatGPT is better, but otherwise human annotation outperforms data generated by ChatGPT.As they do not compare to other DA techniques, it is also hard to know how ChatGPT performs.

Algorithms
It is not clear from the literature which DA algorithms perform best, and so in order to thoroughly test the capacities of ChatGPT we select a variety of techniques to compare DA objectively: EDA, AEDA, CBERT, CBART, CGPT, BT, T5-Tapaco, ChatGPT-Par and ChatGPT-Desc.We briefly describe each of those algorithms, and refer to the code for the full details of the implementations and hyper-parameters. 2DA and AEDA are two simple word-level algorithms that achieved great performances in the past.In EDA, one of four operations is chosen (insertion of related words, swapping words, deleting words, and replacing words by synonyms) and applied to a percentage of the words in the sentence.3In AEDA (Karimi et al., 2021), punctuations are randomly inserted in the sentence (among "?", ".", ";", ":", "!", and ","), the number of insertion being RANDINT(1, len(sentence)/3).
CBERT and CBART have very similar methodologies.We prepend the class of the example to all genuine sentences, mask a fraction of the tokens, and fine-tune the model on the available training set to predict the masked words.For generation, we then give the modified sentence (masked and with the class prepended) and pass it through the transformers.The main difference between CBERT and CBART is that the latter can predict spans instead of tokens, which allows more flexibility.
CGPT also works by prepending the class to the sentence, which then allows GPT-2 to learn to generate conditionally to it.For generation, we give the class as well as the separator token and let GPT-2 generate new sentences. 4n BT, we first translate the sentence to a pivot language and retranslate it in English, creating paraphrases.We use the FSMT model from hugging face5 , with the intermediary language being German, which has been shown to obtain good performances (Edunov et al., 2018).Okur et al. (2022) propose to fine-tune BART on a corpus of in-domain paraphrases created with BT.We found in our experiments that we could get results just as good by using T5-small-Tapaco 6 , which is the T5 model fine-tuned on the corpus of paraphrases TaPaCo (Scherrer, 2020).
Finally, we test the use ChatGPT, either by asking for paraphrases of genuine sentences (ChatGPT-Par), or by giving a short description of the task and classes and asking for novel sentences (ChatGPT-Desc).We give the exact prompts in Appendix A. Because part of the experiments were done before the API became publicly available, we use in this paper the Pro version of the Web interface of Chat-GPT and leave further fine-tuning for future work7 .

Datasets and Methodology
We test on five datasets with various complexity and characteristic to fully assess the performance of the algorithms.We use SST-2 (Socher et al., 2013), a binary dataset of movie reviews classification, FakeNews8 , a dataset of news to classify into real news or fake ones, Irony and IronyB (Van Hee et al., 2018), a binary and multiclass version of a task consisting into classifying tweets as ironic or not, and which kind of irony for the multiclass version (polarity clash, situational irony, other irony), and TREC6 (Li and Roth, 2002), a multiclass dataset where the goal is to classify questions into six categories (abbreviation, description, entities, human beings, locations, and numeric values).More information is available in Appendix C.
These datasets were chosen to get a spread of tasks, and because they are commonly used in the literature in data augmentation.SST-2 and TREC6 are both fairly standard in DA research, being used for example in (Kumar et al., 2020;Quteineh et al., 2020;Regina et al., 2021;Kobayashi, 2018).The Irony datasets are also used quite regularly, for example in (Liu et al., 2020;Turban and Kruschwitz, 2022;Yao and Yu, 2021).Finally, while FakeNews has not been used in DA to our knowledge, it is still commonly used for Fake News detection, for example in (Verma et al., 2023;Chakraborty et al., 2023;Iceland, 2023).
We test data augmentation on two settings: few shot learning (10 or 20 starting examples), and classification with dataset sizes of 500 and 1000 examples.While sampling the starting set, we make sure to balance the classes to be able to observe the performance of data augmentation without the additional factor of imbalanced data.We also tested the process on the full dataset but, similarly to other

Results
Table 1 shows results for the dataset sizes of 10 and 20 and Table 2, for the sizes of 500 and 1000.
For small dataset sizes, we observe that the performance of ChatGPT-Par is on par with the best algorithms, but doesn't beat them by a significant margin on average.While ChatGPT-Desc performs exceptionally well, its performance comes almost exclusively from SST-2, for which it generates highly informative data.For other datasets, it mostly either brings no gain in the performance or a very small one.Overall, all the algorithms provide a reasonable augmentation of the performance, except maybe CGPT, which performs poorly.Excluding ChatGPT-Desc, BART and T5-TaPaCo obtain some of the best performance, although not by much.Given the generative nature of ChatGPT-Desc and its performance, one could also wonder if it would perform better if we generated more sentences.In Appendix C we show that this is not the case, and that the performance for all datasets plateau quickly.
For larger training sets, ChatGPT performs better, while BART and T5-TaPaCo degrade the performance.We believe that this is due to these algorithms creating paraphrases which are closer to the original sentences, leading to less diversity overall.While on few-shot learning this is not a problem because the goal is to give the neural network enough relevant data to learn, on larger datasets diversity seems to become a prevalent factor.Nevertheless, the overall effect of augmentation on moderately GPT3.5-Desc 87.4/88.971.9/75.964.1/66.941.1/44.479.8/84.068.9/72.0sized datasets are very small, which brings the question of whether data augmentation is relevant at all in these cases.

Discussion
Of all the algorithms used here, T5 and ChatGPT present the greatest novelty as well as show some of the best performances.As such, we center our discussion on these two algorithms.When we observe T5 sentences (see Appendix B), we can see that they are not as good as one would expect, often being ungrammatical or badly formed.Still, it has been noted before that having correctly formed sentences is not an important criterion for the efficiency of a DA algorithm, which might explain why its performance is high (Karimi et al., 2021).
ChatGPT-Desc often has difficulty generating sentences of the desired class, accounting for its overall poor performance, while ChatGPT-Par creates excellent paraphrases, bringing diversity to the dataset while maintaining class coherence.Nevertheless, there is a hidden cost that we found was not discussed in other papers, namely the need for data reparation.ChatGPT-Par quite often refuses to create paraphrases, especially for the FakeNews and Irony/IronyB datasets which contain occasional mentions of rape and sex.In these cases, we had to manually find the "bad" element of the batch and rerun it without it, adding considerable complexity to the data augmentation algorithm.Another option would be to simply not correct them, but our preliminary studies indicate that this degrades the performance of DA significantly.10 6.1 Poorly defined tasks and dataset biases.
Despite the description strategy being able to add something akin to external data, our experiments show that ChatGPT underperforms with this method, the performance often being worse than when paraphrasing the existing data.This raises many questions as adding more diverse data should augment performance.
We found that for most part, the poor performance of ChatGPT was related to the poor health of the datasets.Except for SST-2, we found that FakeNews, Irony, IronyB, and TREC6 have poorly defined labels in relation to the task, and that examples in the datasets were often ambiguous to human eyes.Under these conditions, it is difficult to expect ChatGPT to perform well.We underline these problems here because poor dataset health is not a rare phenomenon.
Irony and IronyB are two datasets of the Se-mEVAL 2018 competition.Data was collected from Twitter by collecting tweets that contained some specific hashtag such as #irony, #not, or #sarcasm, which were then removed to form the data of the ironic class.The non-ironic class was then formed by collecting other tweets.This creates a heavy bias in the dataset which shift the task from predicting if the tweet is ironic to predicting if there was a #irony hashtag coming with it.Without the clue that the hashtags give us, it is often impossible to know if the tweet is ironic or not, and we show in Appendix C some examples of ironic tweets which, from manual labelling, we found were ambiguous in their classes.
TREC6 is a dataset of the Text REtrieval Conference and consists in classifying the questions into six categories.While all data was manually annotated, we found inconsistencies in the annotation.For example, "What is narcolepsy?" is labelled as description but "What is a fear of motion?" as an Entity.Other inconsistencies are "What is the oldest profession?" and "What team did baseball 's St. Louis Browns become?" labelled as Human vs "What do you call a professional map drawer" as Entity, or "Where did Indian Pudding come from?" being labelled as Description but "Where does chocolate come from?" as Location.Given that the same mislabelling remains in the test set (ex "Where does dew come from?" being labelled as location), ChatGPT generating sentences of the correct class won't help the classifier much.It is to note that these issues were already noted by Li and Roth (2002), who advise using multilabel classification to reduce biases introduced in the classifier.In all of its usage for DA however, we found it used as a regular classification problem, with all the ambiguity problems it entails.
Finally, FakeNews is a Kaggle dataset which has been thereafter used in many papers for fake news detection.We decided to use this dataset because it seemed a difficult and interesting task, but while analyzing it, we found it biased in a sense similar to Irony and IronyB.From what little information we could find, news were gathered from various sources and split into real or fake news based on the journal they came from.This causes biases because while some journals may have a tendency to sprout fake news, it does not mean all of its news are fake.Furthermore, we found strange choices of labelling.For example, all articles from Breitbart are labelled as real news even if it receives a mixed score of factual reporting11 and articles from consortium news, which receives the overall same score, are labelled as fake 12 .
By refining prompting, we can augment the TREC6 dataset to go to 68.6, which still underperforms when compared to BERT13 .We found Chat-GPT to have difficulty understanding the concept of "Entity" and "Human" questions, often labelling them instead as "Description".

Conclusion
Data augmentation is a seminal technique to lower annotation cost and keep good performances, but even today it is difficult to figure out which technique works best.In particular, the use of ChatGPT has not been correctly assessed for data augmentation, leading to the unknown factor for industries of whether it is worth the price.
In this paper, we study nine data augmentation techniques, including a novel one using a pretrained T5 system for paraphrasing, and show that while ChatGPT achieves among the best results, it doesn't outperform the other algorithm by a significant margin.This, coupled with the fact that using ChatGPT costs both time and money when compared to the other algorithms, brings us to a different conclusion than what previous studies using ChatGPT for DA found, namely that it might not be worth it depending on the task.We further found that while zero-shot generation of data could give outstanding results, it was often hindered by biased datasets, which prevented efficient prompting of ChatGPT.

Limitations
This paper explores the use of ChatGPT for DA, comparing it to other algorithms from the literature.Technical limitations of this paper include limited fine-tuning for some of the algorithms, including ChatGPT for which we used the Web interface and therefore could not finetune the hyperparameters.While the other algorithms have been fine-tuned on some hyper-parameters, fine-tuning was built on some supposition (such as the use of German as a pivot language for BT), which may not be the best.
This paper also focuses on English language and textual classification for short sentences, both assumptions which do not hold for many other tasks.As such, we do not guarantee the results are applicable to other tasks, especially for languages which are low-resource (such as Inuktitut or Swahili) or for longer texts, for which most of the algorithms used would most likely perform poorly due to lack of training data/limited context for input.

Ethics Statement
Use of pre-trained language models, and especially of "very large language models", come with a plethora of ethical problems which have been well discussed in the litterature, including the environmental cost (Schwartz et al., 2019;Bender et al., 2021) and environmental racism (Rillig et al., 2023), the repetition of learned biases against minorities (Singh, 2023), and concerns over data privacy (Li et al., 2023).
A big concern with the most recent models is the effect it will have on employement, but we believe this paper mitigates this effect by showing limitation of ChatGPT, especially in the context of data annotation and dataset creation.

B Example of generated sentences
We give in Table 3 examples of generated sentences for the SST-2 dataset and the negative class, with the starting sentence "makes a joke out of car chases for an hour and then gives us half an hour of car chases."for the algorithms that takes a sentence as an input (all except CGPT and ChatGPT-Desc).When fine-tuning is needed, we use a training set size of 20.

C Supplementary Results
In this section we give supplementary results to the paper.Table 4 gives some information about the datasets, and Table 5, the results of data augmentation on the full training set, with a ratio of generated-to-genuine of one.Figure 1 shows the performance as we increase the ratio for the ChatGPT-Desc strategy, compared to AEDA and T5, and with a dataset size of 10.As we can observe, the performance plateau quickly for all algorithms.Given that Chat-GPTDesc performs much better on SST-2 than the other datasets, we also give in Figure 2 the results while excluding SST-2.We leave for future work to investigate whether the plateauing for ChatGPT-Desc is due to lack of fine-tuning or simply the limit of ChatGPT when it comes to generate diverse sentences.

D Technical details
All final hyperparameters are detailed in the github, and we show a summary of which hyperparameters we fine-tuned in Table 7.For fine-tuning the classifiers, we changed the number of epochs while leaving the other parameters fixed.For fine-tuning the algorithms, we played with the hyperparameters detailed in Table 7, exploring random combinations around the hyperparameters recommended in the original papers.To correctly assess the capacities of the DA methods on the different datasets, we keep the same hyperparameters for a given dataset size across all datasets.Experiments were run on NVIDIA GeForce RTX 3090 with 24G of memory.

Figure 1 :
Figure 1: Average metric vs Ratio for a dataset size of 10 and ChatGPT-Desc, AEDA, and T5.

Table 1 :
Average metric over 15 runs for the training set sizes of 10 (left) and 20 (right) with a ratio of 10.We report accuracy for binary tasks and macro-f1 for multiclass ones.STDs are between 1.5 and 5, depending on the dataset.

Table 2 :
Average metric over 15 runs for the training set sizes of 500 (left) and 1000 (right) with a ratio of 1.We report accuracy for binary tasks and macro-f1 for multiclass ones.STDs are between 0.6 and 3.0, depending on the dataset.

Table 3 :
Examples of generated sentences for each algorithm for the SST-2 dataset and with a dataset size of 20.

Table 4 :
The tasks tackled in this study.The length of the sentences is defined by the number of tokens when tokenized at white spaces.