The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Yanzhu Guo; Guokan Shang; Michalis Vazirgiannis; Chloé Clavel

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chloé Clavel

Abstract

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.

Anthology ID:: 2024.findings-naacl.228
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3589–3604
Language:
URL:: https://aclanthology.org/2024.findings-naacl.228
DOI:
Bibkey:
Cite (ACL):: Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel. 2024. The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3589–3604, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text (Guo et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-naacl.228.pdf

PDF Cite Search