2024
pdf
bib
abs
Modeling Human Subjectivity in LLMs Using Explicit and Implicit Human Factors in Personas
Salvatore Giorgi
|
Tingting Liu
|
Ankit Aich
|
Kelsey Jane Isman
|
Garrick Sherman
|
Zachary Fried
|
João Sedoc
|
Lyle Ungar
|
Brenda Curtis
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) are increasingly being used in human-centered social scientific tasks, such as data annotation, synthetic data creation, and engaging in dialog. However, these tasks are highly subjective and dependent on human factors, such as one’s environment, attitudes, beliefs, and lived experiences. Thus, it may be the case that employing LLMs (which do not have such human factors) in these tasks results in a lack of variation in data, failing to reflect the diversity of human experiences. In this paper, we examine the role of prompting LLMs with human-like personas and asking the models to answer as if they were a specific human. This is done explicitly, with exact demographics, political beliefs, and lived experiences, or implicitly via names prevalent in specific populations. The LLM personas are then evaluated via (1) subjective annotation task (e.g., detecting toxicity) and (2) a belief generation task, where both tasks are known to vary across human factors. We examine the impact of explicit vs. implicit personas and investigate which human factors LLMs recognize and respond to. Results show that explicit LLM personas show mixed results when reproducing known human biases, but generally fail to demonstrate implicit biases. We conclude that LLMs may capture the statistical patterns of how people speak, but are generally unable to model the complex interactions and subtleties of human perceptions, potentially limiting their effectiveness in social science applications.
2023
pdf
bib
abs
AWARE-TEXT: An Android Package for Mobile Phone Based Text Collection and On-Device Processing
Salvatore Giorgi
|
Garrick Sherman
|
Douglas Bellew
|
Sharath Chandra Guntuku
|
Lyle Ungar
|
Brenda Curtis
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
We present the AWARE-text package, an open-source software package for collecting textual data on Android mobile devices. This package allows for collecting short message service (SMS or text messages) and character-level keystrokes. In addition to collecting this raw data, AWARE-text is designed for on device lexicon processing, which allows one to collect standard textual-based measures (e.g., sentiment, emotions, and topics) without collecting the underlying raw textual data. This is especially important in the case of mobile phones, which can contain sensitive and identifying information. Thus, the AWARE-text package allows for privacy protection while simultaneously collecting textual information at multiple levels of granularity: person (lifetime history of SMS), conversation (both sides of SMS conversations and group chats), message (single SMS), and character (individual keystrokes entered across applications). Finally, the unique processing environment of mobile devices opens up several methodological and privacy issues, which we discuss.
2022
pdf
bib
abs
Measuring the Language of Self-Disclosure across Corpora
Ann-Katrin Reuel
|
Sebastian Peralta
|
João Sedoc
|
Garrick Sherman
|
Lyle Ungar
Findings of the Association for Computational Linguistics: ACL 2022
Being able to reliably estimate self-disclosure – a key component of friendship and intimacy – from language is important for many psychology studies. We build single-task models on five self-disclosure corpora, but find that these models generalize poorly; the within-domain accuracy of predicted message-level self-disclosure of the best-performing model (mean Pearson’s r=0.69) is much higher than the respective across data set accuracy (mean Pearson’s r=0.32), due to both variations in the corpora (e.g., medical vs. general topics) and labeling instructions (target variables: self-disclosure, emotional disclosure, intimacy). However, some lexical features, such as expression of negative emotions and use of first person personal pronouns such as ‘I’ reliably predict self-disclosure across corpora. We develop a multi-task model that yields better results, with an average Pearson’s r of 0.37 for out-of-corpora prediction.
2020
pdf
bib
abs
Explaining the Trump Gap in Social Distancing Using COVID Discourse
Austin Van Loon
|
Sheridan Stewart
|
Brandon Waldon
|
Shrinidhi K Lakshmikanth
|
Ishan Shah
|
Sharath Chandra Guntuku
|
Garrick Sherman
|
James Zou
|
Johannes Eichstaedt
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020
Our ability to limit the future spread of COVID-19 will in part depend on our understanding of the psychological and sociological processes that lead people to follow or reject coronavirus health behaviors. We argue that the virus has taken on heterogeneous meanings in communities across the United States and that these disparate meanings shaped communities’ response to the virus during the early, vital stages of the outbreak in the U.S. Using word embeddings, we demonstrate that counties where residents socially distanced less on average (as measured by residential mobility) more semantically associated the virus in their COVID discourse with concepts of fraud, the political left, and more benign illnesses like the flu. We also show that the different meanings the virus took on in different communities explains a substantial fraction of what we call the “”Trump Gap”, or the empirical tendency for more Trump-supporting counties to socially distance less. This work demonstrates that community-level processes of meaning-making in part determined behavioral responses to the COVID-19 pandemic and that these processes can be measured unobtrusively using Twitter.