LLM-powered Data Augmentation for Enhanced Crosslingual Performance

This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, Sta-bleVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XS-toryCloze. Subsequently, we evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translated English-generated data, revealing the overall advantages of incorporating data generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best case. Furthermore, we conduct a human evaluation by asking native speakers to assess the naturalness and logical coherence of the generated examples across different languages. The results of the evaluation indicate that LLMs such as ChatGPT and GPT-4 excel at producing natural and coherent text in most languages, however, they struggle to generate meaningful text in certain languages like Tamil. We also observe that ChatGPT falls short in generating plausible alternatives compared to the original dataset, whereas examples from GPT-4 exhibit competitive logical consistency. We release the generated data at https://github.com/mbzuai-nlp/Gen-X .


Introduction
The success of NLP models greatly depends on the availability and quality of training data.This poses a significant challenge for multilingual NLP, as data for languages other than English is typically limited (Ponti et al., 2019;Joshi et al., 2020).An approach to address this data scarcity challenge is through zero-shot cross-lingual transfer or multitask training, in which a model is trained across data of diverse tasks and languages, exhibiting the capability to handle unseen tasks, particularly in larger models (Artetxe and Schwenk, 2019;Nooralahzadeh et al., 2020;Huang et al., 2021).However, when aiming for task-specific objectives, a smaller, finetuned model dedicated to that particular task outperforms general-purpose, zero-shot larger models.In addition, a smaller task-specific model is more practical and cost-effective for deployment and training.Nevertheless, developing a powerful task-specific model becomes challenging in the absence of training data (Lauscher et al., 2020).
Conversely, recent powerful large language models (LLMs) excel at handling general instructions and have shown promise in data generation tasks (Wang et al., 2022).In this work, we leverage LLMs to generate synthetic data for various multilingual commonsense reasoning tasks, XCOPA (Ponti et al., 2020), XWinograd (Tikhonov and Ryabinin, 2021), and XStoryCloze (Lin et al., 2022), where the training data is limited even for English (see Table 1).To augment the training data, we employ LLMs by providing them with instructions and examples from the original training data and then requesting the LLMs to generate diverse new examples.We explore the generation of synthetic data in English using different LLMs, including open-source models like Dolly-v22 and StableVicuna3 , as well as ChatGPT and GPT-4.Although the weights and capabilities of the latter two models remain undisclosed, they can generate texts in languages beyond English.
We develop task-specific models by fine-tuning multilingual pre-trained language models such as mBERT (Devlin et al., 2019) and XLM-R (Con- neau et al., 2020), using the generated data.We then compare their performance against models trained on a limited set of human-created data in the target language whenever available, and otherwise through zero-shot transfer learning from manually created English training data.Our experiments demonstrate that training the models with relatively large synthetically generated datasets yields better performance than training with limited manuallycreated datasets.This finding empirically confirms the utility of synthetic data generated by LLMs for improving downstream task-specific models.
We expand the multilingual data synthesis using ChatGPT and GPT-4 on XCOPA and find that generating multilingual datasets generally surpasses the effectiveness of zero-shot cross-lingual transfer, with the exception of ChatGPT-generated multilingual data on a larger fine-tuned XLMR.We further carry out a manual annotation to assess the quality of the generated dataset in different languages by evaluating the naturalness and logical soundness of the generated dataset compared to the human-written one.The annotation results reveal that while ChatGPT and GPT-4 successfully generate natural text in most languages, they struggle with generating understandable text in certain languages such as Tamil.Moreover, a noticeable gap is observed in terms of commonsense coherence when comparing ChatGPT-generated data to human-constructed data, on the other hand, GPT-4 significantly narrows this difference.
In brief, our work has the following key contributions: (1) Augmenting three low-resource, crosslingual commonsense reasoning datasets by leveraging and instructing four LLMs; (2) Fine-tuning smaller models, mBERT and XLMR, using the synthesised data and showcasing the practical value of the LLM-generated data; (3) Performing an extensive analysis of the effects of various target lan-guages in data generation and scaling, including a human evaluation of the naturalness and logical coherence of the generated data in different languages.(4) Releasing synthesised datasets for public use and reproducibility.

Related Work
Multilingual and Low-Resource NLP Recently, there has been increased attention on expanding NLP beyond English, including the development of multilingual models (Devlin et al., 2019;Conneau et al., 2020;Xue et al., 2021;Scao et al., 2022) as well as the creation of benchmarks to address multilingual challenges (Conneau et al., 2018;Artetxe et al., 2019;Adelani et al., 2021;Winata et al., 2023).Among the prevailing challenges faced across various languages, a common theme is the scarcity of available data.
Consequently, when data is lacking, one approach is to employ zero-shot cross-lingual transfer.Studies conducted by Winata et al. (2023) have demonstrated the effectiveness of zero-shot crosslingual transfer for related languages.Additionally, Muennighoff et al. (2022) show that models fine-tuned only with English instruction data are capable to understand multilingual instructions.In this work, we are tackling a similar scenario where the availability of data is limited.Lauscher et al. (2020) show that few-shot can drastically increase the cross-lingual performance of small models, proving that multilingual data augmentation is an effective strategy.A series of works try to predict the cross-lingual accuracy of models through measurements and modelling (Xia et al., 2020), and study strategies for multilingual data augmentation, such as choosing the transfer languages (Lin et al., 2019), and predicting multilingual few-shot accuracies leading for optimal data augmentation approcahes (Srinivasan et al., 2022).

XCOPA XWINOGRAD XSTORYCLOZE
We are gathering more examples for the COPA dataset which will be used to test a system's ability of Commonsense Causal Judgments.The format of the data: A premise: a statement of something that happened, and two choices that could plausibly {occur as the result / be the cause} of the premise.The correct choice is the alternative that is more plausible than the wrong choice.

Dataset Augmentation
This section explains the datasets used in the experiments and the detailed instruction setup.

Dataset
Our experiments use XCOPA, XWinograd, and XStoryCloze, which are selected due to the limited availability of training data and the fact that commonsense reasoning datasets present greater challenges for data synthesis.Table 1 summarises the statistics of the datasets.XWinograd has no train/validation/test split, and we follow an 80/10/10 split for the experiments.
XCOPA is a crosslingual Choice of Plausible Alternatives dataset that translates and re-annotates the validation and test sets of English (EN) COPA (Roemmele et al., 2011) into 11 target languages (ET: Estonian, HT: Haitian Creole, ID: Indonesian, IT: Italian, QU: Quechua, SW: Swahili, TA: Tamil, TH: Thai, TR: Turkish, VI: Vietnamese, and ZH: Chinese)4 .Each instance consists of a premise, a question (cuase/result), and two alternatives and the task is to predict the more plausible alternative.
XWinograd expands the original English Winograd Schema Challenge (WSC) (Levesque et al., 2012) to five other languages (FR: French, JA: Japanese, PT: Portuguese, RU: Russian, and ZH)5 , which consists of pronoun resolution problems aiming to evaluate the commonsense reasoning ability of a machine.Given a statement with two noun phrases and a pronoun, the challenge of WSC is to determine the referent of the pronoun, which can only be inferred from the context.
XStoryCloze is collected by Lin et al. (2022) 6 by translating the validation split of the original English StoryCloze dataset (Mostafazadeh et al., 2016) into 10 other typologically diverse languages (RU, ZH, ES: Spanish, AR: Arabic, HI: Hindi, ID, TE: Telugu, SW, EU: Basque, and MY: Burmese).Each example consists of a four-sentence commonsense story, a correct ending, and a wrong ending.

LLMs for Data Generation
Our preliminary experiments reveal that language models that are specifically fine-tuned on downstream NLP tasks, such as BLOOMZ (Muen-

Instructions and Responses
We utilise LLMs to generate synthetic examples for all datasets by instructing them.We start constructing the instructions using the descriptions of the dataset papers as a reference and provide LLMs with some examples, randomly sampled from the train (+validation) split of the original dataset, and ask them to generate similar data points.We experiment with various instructions and evaluate the synthesised data on a smaller scale, update the instructions based on the errors, and then choose the best instruction to generate the final datasets.The final instruction and responses (ChatGPT as an example) can be seen in Table 2.
We first request LLMs to generate a total of 3∼4K data points for each dataset, and then parse and filter the responses, where we only keep the 7 https://github.com/lm-sys/FastChatunique examples.LLMs tend to generate inconsistent output with the invalid format and often generate fewer samples than requested.We report the success rate for different LLMs on the three datasets in Table 3, which indicates that GPT-4 has the most robustness.Among the datasets, XWinograd, specifically, has the lowest generation success rate for LLMs, because the data requires both answers to be from the generated sentence, with only one pronoun being replaced.In addition, we observed pronoun inconsistency in the generated XWinograd data.Despite the requirement for interchangeable pronouns in the options, models frequently fail to comply.For example, "The dog bit the mailman because _ entered the yard." is generated by ChatGPT with the options 'The dog'" or "the mailman", however, "_" in the sentence cannot be replaced by the same pronoun for the given two options, hence it may make the task easier and the example is considered suboptimal.We keep those instances in the dataset and discuss further in §6.1.

Experimental Setups
We first generate synthetic English data for XCOPA, XWinograd, and XStoryCloze.We contrast data generation across Dolly-v2, StableVicuna, ChatGPT, and GPT-4, and compare them with a baseline of training models with the original English data.The size of the final synthesised data for the three datasets is 3.7k, 2K, and 1.7K, respectively.We then fine-tune mBERT, XLMR-base and XLMR-large 8 with the synthesised data, and measure the zero-shot cross-lingual transfer performance across different languages, where we use the original validation set in target languages.
For XCOPA, we additionally experiment with generating data points directly in non-English languages, by providing examples in the target language and specifying the language desired for the generated data (see Table 2).However, since no examples for cause are included in TH and TR train/validation data (however they appear in the test split), we do not generate XCOPA for the two languages.We use ChatGPT and GPT-4 for multilingual synthetic data generation, as both Dolly-v2 and StableVicuna exhibit limitations in effectively generating multilingual text.The size of the multilingual synthesised data is ∼3.6K in each language.
We fine-tune models on all datasets as multiple-

Results and Discussion
This section presents the main results of fine-tuned models on the three datasets and compares performance with generated data in different LLMs, languages, and different scales.

General Result
Table 4 presents the average accuracy of fine-tuned mBERT, XLMR-Base, and XLMR-Large models across all languages on the three datasets.The models are trained using original data (ORI), different LLM-generated data (GEN), as well as a combination of both sources (O+G) in English, comparing the zero-shot cross-lingual transfer.Across different datasets, LLMs, and fine-tuned models, consistent improvements are observed when using both original and LLM-generated data.Among the models, Dolly-v2 performs the best on Xingorad when fine-tuned on mBERT, while GPT-4 achieves the highest accuracy in other settings.The most significant improvement shows in XWinograd with XLMR-Base, where the addition of an extra 2k datapoints leads to an average accuracy enhancement of 12.8 compared to the baseline, across all four LLMs.
When using only LLM-generated data, smaller models like mBERT and XLMR-Base generally outperform the baseline.However, with XLMR-Large, which achieves stronger baselines.e.g.>80 in XWingograd and XStoryCloze, the accuracy remains similar or even worse compared to using the original data.GPT-4-generated data demonstrates the best robustness but still experiences a decline in performance in XWinograd when the generated data size is similar to the original data.This highlights the challenges of generating data at a human-level quality.

Multilingual Data Generation
The zero-shot cross-lingual approach is commonly used when a multilingual dataset is insufficient.In this subsection, we investigate whether a synthetically generated multilingual dataset outperforms training solely in English.We choose the XCOPA dataset and explore two settings: synthetic multilingual data by asking LLMs to generate responses in the target languages directly and translating the English-generated data to target languages with Google Translate API.We exclude Dolly-v2 and StableVicuna due to their limited effectiveness in generating non-English text.Although GPT-4 exhibits the most promising performance, it is significantly costlier compared to ChatGPT.Therefore, we also consider using ChatGPT as a contrasting Table 5: Accuracy on XCOPA.ORI corresponds to the original data, GEN EN and GEN XX represents data generated in English and target languages.T rans denotes translations of the English-generated data.We show languages that are available in all settings.Improvement and decline in performance are represented with green and red shadows.
experiment under resource-constrained conditions.
Table 5 shows the results for the languages that are available for all settings, excluding TR and TH (unavailable for LLM-generation, refer to §4), and QU (not supported by the Google Translate API).We can see the impact of the generated data varies across different fine-tuned models and languages, aligning with the findings of Kumar et al. (2022).Training on GPT-4 synthesized data displays consistent improvement across all scenarios and languages, except the zero-shot crosslingual result on HT with XLMR-Large.
More fluctuating results can be observed with ChatGPT-generated data.A comparison between GENEN + ORI and GENXX + ORI indicates that utilising data generated in target languages generally leads to improved performance with GPT-4 generated data, as well as in base models with ChatGPT-generated data.However, for XLMR-Large, employing ChatGPT-generated data in target languages mostly yields negative outcomes.In languages such as TA and VI, training on generated data in the target languages results in more performance degradation compared to zero-shot cross-lingual transfer.This suggests that ChatGPT performs worse in those languages than XLMR-Large (Ahuja et al., 2023).Translating the English dataset generally shows overall better results than training on the data generated directly in the target languages, with the exception of XLMR-Large with GPT-4.For SW, XLMR models finedtuned with ChatGPT-generated data exhibit performance decline in most cases, even when the English-generated data benefits all other languages.This observation suggests that XLMR struggles with SW.In subsection 6.1 we select TA, SW, and the two best languages ID and EN, along with EN, for human evaluation.
Additionally, we conduct experiments involving adding Target Languages in Validation (TLV).This results in minor variations in the performance, consistent with the findings of Ponti et al. (2020).We include the full results in Table 10 in Appendix C.

Dataset Scaling Up
We further investigate the impact of training on a larger scale of generated data on model performance.We focus on the XCOPA dataset and expand the generated data with ChatGPT to 28.6k examples in English.We also compare the results of zero-shot cross-lingual transfer with translating the English-generated data to target languages.The results in Table 6 demonstrate the positive impact of scaling up the generated data on model performance.Particularly, XLMR-Large exhibits the most significant improvement.

Human Evaluation
To better evaluate the quality of the generated datasets and compare them with the human-created data, we ask native speakers to annotate the multilingual data generated by ChatGPT and GPT-4.
For each dataset, we first select 50 generated examples in English, and then request two annotators to evaluate the examples in two categories: 1) Text Naturalness.The annotators are asked to choose one of the following options for each example: "the text sounds natural", "the text sounds awkward but understandable", or "the text is not understandable", and 2) Logic Soundness.This category focuses on the commonsense aspect of the examples.The annotators are required to select the most appropriate description from: "the correct option is (clearly) more plausible", "both options are equally plausible", "both options are implausible", or "the wrong option is actually more plausible".We only ask the annotators to evaluate the logic if the text is at least understandable.
For XWinograd, we introduce an additional evaluation criterion.Annotators are asked to determine whether the two noun phrases in the examples can be replaced by the same pronoun (refer to §3.3).For XCOPA, we extend the annotations to non-English languages, where we choose the two languages that demonstrate the most notable improvement, namely ZH and ID, as well as the two languages that exhibit the least improvement or regression in performance with ChatGPT-generated data, namely TA and SW (see Table 5).In addition to the original examples and the generated examples in the target languages, we include 50 examples that are translated from the same English-generated examples (that were selected for annotation).
To ensure impartiality, all the examples are shuffled, and the annotators are not provided with information regarding the source of the examples (human-created, LLM-generated, or translated).

Human Evaluation Results
Figure 1 presents the annotation results for XCOPA, averaged from two annotators for each language.
Looking at the Text Naturalness plot, we can see that for EN, ID, ZH, and SW, both ChatGPT and GPT-4 achieved higher naturalness than the original dataset.This is particularly prominent in ID, revealing the fluency issue in the original ID data in XCOPA, which is also confirmed by a native speaker.In contrast, TA demonstrates surprisingly low performance, with most examples being classified as "not understandable".This accounts for the significant decline of XLMR-Large performance on TA when trained on ChatGPT-generated data in TA.However, intriguingly, models trained on TA data generated by GPT-4 showed improvement over the baselines, despite the poor quality evaluation from human annotators.This result is unexpected.Upon further investigation with native speakers, they noted that the text sometimes contains some unrelated, nonsensical words.Nevertheless, readers can intuitively grasp the meaning.Therefore, although the text is extremely difficult to comprehend and appears unnatural, it is not strictly impossible to understand.We hypothesise that the trained model can still learn from such text.The translated text is typically less natural than the original and generated data (apart from ID due to issues in the original data).This result affirms that LLMs generally excel in generating fluent text for the languages it supports.
In terms of logic soundness, ChatGPT falls short compared to the original dataset.We further illustrate the categorised issues in the last column of the plots in Figure 1.We can see that for ChatGPT, the majority of the examples are labelled as "both options are equally plausible", only SW has more problematic examples with "the wrong option is actually more plausible".We suspect that this issue arises from the instruction provided (taken from the description of the original COPA dataset), which states that "both options could be plausible, but one is more plausible."In some cases, ChatGPT generates two choices that are excessively similar in terms of plausibility.On the other hand, GPT-4

Conclusions
This paper explores the effectiveness of utilising LLMs for data augmentation in cross-lingual datasets with limited training data.We specifically focus on commonsense reasoning tasks that are challenging for data synthesis.Our experiments including four LLMs for data generation on three datasets, showcase enhanced cross-lingual zeroshot transfer on smaller fine-tuned task-specific language models.However, the impact varies across different datasets and languages.Notably, larger models such as XLMR-Large, which have higher baselines, demonstrate more difficulty in achieving performance improvements with LLM-generated data.Among the four LLMs, GPT-4-generated data exhibits mostly consistent superior performance.
Expanding data generation directly in target languages also shows general improvements compared to cross-lingual zero-shot with the Englishgenerated data.Human evaluation of the synthesised multilingual dataset shows that the ChatGPT and GPT-4 generated data demonstrate high naturalness in most languages, even surpassing the original data.However, in certain languages like TA, both models fail to generate natural text.Additionally, when assessing the logical soundness of the dataset, examples synthesised by ChatGPT reveal notable inconsistencies regarding more plausible options compared to the original human-created data.In contrast, GPT-4 exhibits a performance on par with human-written data.
In conclusion, leveraging LLMs for data augmentation shows promise.However, the choice of LLM used for data generation significantly influences the quality of the resulting data, as well as its applicability to the language under consideration.In circumstances where a more advanced model such as GPT-4 cannot be accessed, other models can be utilised, though this might result in performance difficulties in certain non-English languages -a challenge that also exists for GPT-4and concerns regarding the logical coherence.

B Sentences and Events of StoryCloze
As the StoryCloze dataset contains more sentences and has richer content, we follow the analysis of the ROC story and further compare the stylistic features in terms of sentence length, and the most frequent events10 generated by ChatGPT with the original data.This helps us to determine whether ChatGPT-generated data can capture the corpus distribution by randomly sampling n examples from the dataset in the instructions.
In Figure 2, we present the results of comparing the generated data points with the original 300 train set used as few-shot examples in the generation instructions.We can see that 23 of the 30 most frequent events in the original dataset can also be found in the 30 most frequent events of the ChatGPT-generated data.Regarding the sentence length, we observe that ChatGPT tends to generate longer sentences, especially for the ending sentences, whereas in the original dataset, they tend to be the shortest among all sentences.

C Additional Results
Here are n examples in {language}: Example 1: Premise: The man wanted to save money.What happened as a result?Correct choice: He cut back on making frivolous purchases.Wrong choice: He withdrew money from his savings account. . . .Example n: . . .Based on the examples above, generate m new examples in {language}.We are collecting more examples for the Winograd Schema Challenge.Each example has a short sentence that contains two noun phrases and one pronoun replaced by "_", and the challenge is to determine the referent of the pronoun, which can only be inferred from the context.Here are n examples of the data: Example 1: Sentence: Harley hides from Dyna because _ is scary.Who/What is scary?Correct answer: Dyna.Wrong answer: Harley. . . .Example n: . . .Based on the examples above, generate m new examples.Both noun phrases in each example can be males, females, inanimate objects, or groups of people or objects.There should only be one "_" in the sentence.The correct and wrong answer should be one of the noun phrases mentioned in the sentence.We are collecting more examples for a story cloze dataset.Each example consists of a 4-sentence story, one correct ending sentence which is a plausible continuation of the story, and one wrong ending sentence which is logically inconsistent with the context.Here are n examples of the data: Example 1: Sent-1: Tina is very tired every single morning.Sent-2: She does not get enough sleep because of her two jobs.Sent-3: Tina decides to quit one of the jobs.Sent-4: She now gets enough sleep to function everyday.Correct ending: Tina is well rested.Wrong ending: Tina is more tired than ever before. . . .Example n: . . .Based on the examples above, provide m new similar examples.Requirements: 1) the story should read like a coherent story, with a specific beginning and ending, where something happens in between 2) both ending sentences should be entirely reasonable, realistic and sensible when read in isolation, and 3) both ending sentences should follow up the story by sharing at least one of the characters of the story.Premise: The politician made a controversial statement.What happened as a result?Correct choice: The politician faced criticism from the media.Wrong choice: The politician's approval ratings increased.Premise: 我裤子口袋里的钥匙不见 了。 What was the cause?Correct choice: 这个口袋上有一个洞。 Wrong choice: 裤子是新的。 Sentence: Sam gave Andrew the book because _ had already read it.Who/What had already read the book?Correct answer: Sam.Wrong answer: Andrew.Sentence: The dog chased the cat , but _ was too fast.Who/What was too fast?Correct answer: the cat.Wrong answer: The dog.Sent-1: Jordan was a high school student who wanted to become a doctor.Sent-2: He spent all his free time studying biology and chemistry.Sent-3: One day, his school hosted a science fair competition.Sent-4: Jordan's project won first place.Correct ending: Jordan went on to study medicine in college.Wrong ending: Jordan gave up his dream of becoming a doctor.

Figure 2 :
Figure 2: Comparison between the 30 most frequent events and the lengths of the sentences in the original and the ChatGPT-generated English StoryCloze dataset.
HT ID IT QU SW TA TH TR VI ZH MBERT GEN

Table 1 :
Number of examples available in XCOPA, XWinograd, and XStoryCloze.XX denotes the average number of non-English examples per language.Since a validation split is not specified in XStoryCloze, we take 60 random examples from the train split for validation.

Table 4 :
ORI 400 GEN 3.7k O+G 4.1k ORI 1.8k GEN 2k O+G 3.8k ORI 300 GEN 1.7k O+G 2k Comparison of Average Accuracy across all languages for mBERT, XLMR-Base, and XLMR-Large on XCOPA, XStoryCloze, and XWinograd.Training datasets include ORI (original EN data), GEN (LLM-generated EN data), and O+G (both), with the number of examples used for training indicated by the subscripts.The best results obtained with the same amount of training data are highlighted in bold.Green and red subscripts denote improvement and decline in performance compared to the baseline (ORI).See per language results in Appendix C.

Table 6 :
Accuracy on XCOPA when scaling up the generated data to over 28K with ChatGPT.We report average results on all XCOPA languages excl.QU, since it is not available with the Google Translate API.
Figure 1: Human evaluation of 50 random examples from the original XCOPA, ChatGPT (top) and GPT-4 (bottom) generated data in target languages, and translation of English generated data.Examples are annotated by two native speakers in each language.The subplots in the last column show the logic issues of the XCOPA data, where the three bars for each language represent Oringal, Gen XX , and Gen T rans

Table 7 ,
Table 8, and Table 9 show generated data in English with different LLMs on XWinograd and XStoryCloze.Table 10 and Table 11 show the full result on XCOPA with ChatGPT and GPT-4.