Making Large Language Models Better Data Creators

Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen White, Sujay Jauhar


Abstract
Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As such, trainable models are still the preferred option in some cases. However, these models still require human-labeled data for optimal performance, which is expensive and time-consuming to obtain. In order to address this issue, several techniques to reduce human effort involve labeling or generating data using LLMs. Although these methods are effective for certain applications, in practice they encounter difficulties in real-world scenarios. Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering. In this paper, we propose a unified data creation pipeline that requires only a single formatting example, and which is applicable to a broad range of tasks, including traditionally problematic ones with semantically devoid label spaces. In our experiments we demonstrate that instruction-following LLMs are highly cost-effective data creators, and that models trained with these data exhibit performance better than those trained with human-labeled data (by up to 17.5%) on out-of-distribution evaluation, while maintaining comparable performance on in-distribution tasks. These results have important implications for the robustness of NLP systems deployed in the real-world.
Anthology ID:
2023.emnlp-main.948
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15349–15360
Language:
URL:
https://aclanthology.org/2023.emnlp-main.948
DOI:
10.18653/v1/2023.emnlp-main.948
Bibkey:
Cite (ACL):
Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen White, and Sujay Jauhar. 2023. Making Large Language Models Better Data Creators. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15349–15360, Singapore. Association for Computational Linguistics.
Cite (Informal):
Making Large Language Models Better Data Creators (Lee et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.948.pdf