Christopher Thierauf
2024
Automating Dataset Production Using Generative Text and Image Models
Christopher Thierauf
|
Mitchell Abrams
|
Matthias Scheutz
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Practical and ethical dataset collection remains a challenge blocking many empirical methods in natural language processing, resulting in a lack of benchmarks or data on which to test hypotheses. We propose a solution to some of these areas by presenting a pipeline to reduce the research burden of producing image and text datasets when datasets may not exist. Our approach, with accompanying software tools, involves (1) generating text with LLMs; (2) creating accompanying image vignettes with text–to–image transformers; and (3) low-cost human validation. Based on existing literature that has struggled with quantitative evaluation (due to difficulty of data collection), we present the creation of 3 relevant datasets, and conduct a user study that demonstrates this approach is able to aid researchers in obtaining previously-challenging datasets. We provide sample data generated with this technique, the source code used to produce it, and discuss applicability and limitations.