Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Zhuoyan Li; Hangxiao Zhu; Zhuoran Lu; Ming Yin (尹明)

doi:10.18653/v1/2023.emnlp-main.647

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin

Abstract

The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently explored using large language models (LLMs) to generate synthetic datasets as an alternative approach. However, the effectiveness of the LLM-generated synthetic data in supporting model training is inconsistent across different classification tasks. To better understand factors that moderate the effectiveness of the LLM-generated synthetic data, in this study, we look into how the performance of models trained on these synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data. We conclude by discussing the implications of our work on the potential and limitations of leveraging LLM for synthetic data generation.

Anthology ID:: 2023.emnlp-main.647
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10443–10461
Language:
URL:: https://aclanthology.org/2023.emnlp-main.647/
DOI:: 10.18653/v1/2023.emnlp-main.647
Bibkey:
Cite (ACL):: Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore. Association for Computational Linguistics.
Cite (Informal):: Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations (Li et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.647.pdf
Video:: https://aclanthology.org/2023.emnlp-main.647.mp4

PDF Cite Search Video Fix data