Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting

Gauri Kambhatla, Chantal Shaib, Venkata S Govindarajan


Abstract
Fine-grained personas have recently been used for generating ‘diverse’ synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. First, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. Our results indicate that persona prompting produces higher lexical diversity than prompting without personas, particularly in larger models. In contrast, adding fine-grained persona details yields minimal gains in diversity compared to simply specifying a length cutoff in the prompt.
Anthology ID:
2025.findings-emnlp.1146
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21024–21033
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.1146/
DOI:
Bibkey:
Cite (ACL):
Gauri Kambhatla, Chantal Shaib, and Venkata S Govindarajan. 2025. Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21024–21033, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting (Kambhatla et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.1146.pdf
Checklist:
 2025.findings-emnlp.1146.checklist.pdf