Better Synthetic Data by Retrieving and Transforming Existing Datasets

Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig


Abstract
Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, _DataTune_, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs _dataset transformation_, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We release a Python package and open-source repository to make this method accessible to the community (URL will be added upon acceptance).
Anthology ID:
2024.findings-acl.385
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6453–6466
Language:
URL:
https://aclanthology.org/2024.findings-acl.385
DOI:
10.18653/v1/2024.findings-acl.385
Bibkey:
Cite (ACL):
Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, and Graham Neubig. 2024. Better Synthetic Data by Retrieving and Transforming Existing Datasets. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6453–6466, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Better Synthetic Data by Retrieving and Transforming Existing Datasets (Gandhi et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.385.pdf