Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?

Grgur Kovač; Jérémy Perez; Rémy Portelas; Peter Ford Dominey; Pierre-Yves Oudeyer

Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?

Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer

Abstract

Large language models (LLMs) are increasingly used in the creation of online content, creating feedback loops as subsequent generations of models will be trained on this synthetic data. Such loops were shown to lead to distribution shifts - models misrepresenting the true underlying distributions of human data (also called model collapse). However, how human data properties affect such shifts remains poorly understood. In this paper, we provide the first empirical examination of the effect of such properties on the outcome of recursive training. We first confirm that using different human datasets leads to distribution shifts of different magnitudes. Through exhaustive manipulation of dataset properties combined with regression analyses, we then identify a set of properties predicting distribution shift magnitudes. Lexical diversity is found to amplify these shifts, while semantic diversity and data quality mitigate them. Furthermore, we find that these influences are highly modular: data scrapped from a given internet domain has little influence on the content generated for another domain. Finally, experiments on political bias reveal that human data properties affect whether the initial bias will be amplified or reduced. Overall, our results portray a novel view, where different parts of internet may undergo different types of distribution shift.

Anthology ID:: 2025.emnlp-main.1643
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32278–32297
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1643/
DOI:
Bibkey:
Cite (ACL):: Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, and Pierre-Yves Oudeyer. 2025. Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32278–32297, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data? (Kovač et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1643.pdf
Checklist:: 2025.emnlp-main.1643.checklist.pdf

PDF Cite Search Checklist Fix data