Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Max Schaffelder; Albert Gatt

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Abstract

As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, we observe a tendency for higher output quality in the latter case, thus making outputs potentially more usable and dangerous. Finally, we also find evidence that fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data. All code is available at https://github.com/maxschaffelder/synthetic_data_diversity.

Anthology ID:: 2026.findings-acl.360
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7265–7293
Language:
URL:: https://aclanthology.org/2026.findings-acl.360/
DOI:
Bibkey:
Cite (ACL):: Max Schaffelder and Albert Gatt. 2026. Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 7265–7293, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning (Schaffelder & Gatt, Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.360.pdf
Checklist:: 2026.findings-acl.360.checklist.pdf

PDF Cite Search Checklist Fix data