Who Speaks for Whom? LLM-Generated Survey Data as a Proxy for Public Opinion

Radhakrishnan Venkatakrishnan; Travis Brodbeck; Michael D. Young

Who Speaks for Whom? LLM-Generated Survey Data as a Proxy for Public Opinion

Radhakrishnan Venkatakrishnan, Travis Brodbeck, Michael D. Young

Abstract

Technological advancements, such as Large Language Models (LLMs), offer a potential solution to the two-faceted problem facing social science researchers: rising costs and declining response rates. The use of artificial personas is a budding practice, where chatbots are given the demographic characteristics of the person they are supposed to role-play as and answer questions for researchers. Before scholars and practitioners augment or replace the data created by interviewing humans, it is essential to understand how well models perform in generating accurate, reliable, and robust data, with concerns that the training of LLMs results in a bias towards the norms of WEIRD cultures. We present a procedure for practitioners to use to evaluate the quality of their synthetic data by measuring Intra Class Correlation (ICC), Earth Mover Distance (EMD), Variance, Hedging, and demographic drivers of LLM output. We find that the models may generate plausible results in the aggregate, but these synthetic data do not exhibit the depth or nuance of human respondents. Secondarily, we find that despite having generated definitive answers on a ten-point scale, the reasoning provided by the LLM exhibited varying degrees of hedging that do not consistently align with the LLM’s answer. The distortion of the results was not uniformly distributed; instead, the effects were more extreme for some demographic groups. Our findings suggest that the technology generating synthetic survey data may not be mature enough to address the increasing challenges of interviewing humans for public opinion research.

Anthology ID:: 2026.nlpcss-1.9
Volume:: Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Month:: July
Year:: 2026
Address:: San Diego
Editors:: Dallas Card, Anjalie Field, Katherine Keith, Julia Mendelsohn
Venues:: NLP+CSS | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 133–148
Language:
URL:: https://aclanthology.org/2026.nlpcss-1.9/
DOI:
Bibkey:
Cite (ACL):: Radhakrishnan Venkatakrishnan, Travis Brodbeck, and Michael D. Young. 2026. Who Speaks for Whom? LLM-Generated Survey Data as a Proxy for Public Opinion. In Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science, pages 133–148, San Diego. Association for Computational Linguistics.
Cite (Informal):: Who Speaks for Whom? LLM-Generated Survey Data as a Proxy for Public Opinion (Venkatakrishnan et al., NLP+CSS 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.nlpcss-1.9.pdf

PDF Cite Search Fix data