Gauri Kambhatla
2025
Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting
Gauri Kambhatla
|
Chantal Shaib
|
Venkata S Govindarajan
Findings of the Association for Computational Linguistics: EMNLP 2025
Fine-grained personas have recently been used for generating ‘diverse’ synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. First, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. Our results indicate that persona prompting produces higher lexical diversity than prompting without personas, particularly in larger models. In contrast, adding fine-grained persona details yields minimal gains in diversity compared to simply specifying a length cutoff in the prompt.
2024
Promoting Constructive Deliberation: Reframing for Receptiveness
Gauri Kambhatla
|
Matthew Lease
|
Ashwin Rajadesingan
Findings of the Association for Computational Linguistics: EMNLP 2024
To promote constructive discussion of controversial topics online, we propose automatic reframing of disagreeing responses to signal receptiveness to a preceding comment. Drawing on research from psychology, communications, and linguistics, we identify six strategies for reframing. We automatically reframe replies to comments according to each strategy, using a Reddit dataset. Through human-centered experiments, we find that the replies generated with our framework are perceived to be significantly more receptive than the original replies and a generic receptiveness baseline. We illustrate how transforming receptiveness, a particular social science construct, into a computational framework, can make LLM generations more aligned with human perceptions. We analyze and discuss the implications of our results, and highlight how a tool based on our framework might be used for more teachable and creative content moderation.
2023
Quantifying Train-Evaluation Overlap with Nearest Neighbors
Gauri Kambhatla
|
Thuy Nguyen
|
Eunsol Choi
Findings of the Association for Computational Linguistics: ACL 2023
Characterizing benchmark datasets is crucial to interpreting model performance. In this work, we study train-evaluation overlap as a measure of an individual dataset’s adequacy to evaluate model generalization over a wide range of datasets. We quantify the overlap with a simple novel metric based on a nearest neighbors approach between the training and evaluation sets. We identify nearest training examples for each evaluation example by mapping instances with generic and task-specific embedding methods. Our study on eleven classification and extractive QA tasks reveals a wide range of train-evaluation overlap, and we show that the data collection method of the dataset and the difficulty of the task may play a role in the amount of overlap. Lastly, we use our nearest neighbor analysis to identify challenging or potentially mislabeled examples. Our analysis quantifies train-evaluation overlap, providing insights for constructing datasets to study generalization.
Search
Fix author
Co-authors
- Eunsol Choi 1
- Venkata S Govindarajan 1
- Matthew Lease 1
- Thuy Nguyen 1
- Ashwin Rajadesingan 1
- show all...