Lyle Ungar - ACL Anthology

Lyle Ungar

Also published as: Lyle H. Ungar

2026

SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Nikita Soni | H. Andrew Schwartz | Ryan L. Boyd | Phi Long Bui | Syeda Mahwish | August Håkan Nilsson | Adithya V Ganesan | Lyle Ungar | Niranjan Balasubramanian | Saif M. Mohammad
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present our shared task on predicting variation in emotional valence and arousal over time from ecological essays. The shared task uses a longitudinal dataset collected over 7 data collection phases of 14-day each spanning from 2021 to 2024, consisting of real-time essays and feeling words (e.g., happy, calm, sad, etc.) written in English by U.S. service-industry workers about “how they are feeling”. Each text is associated with self-reported valence (V) (0 - 4, highly negative to highly positive affect) and arousal (A) (0 - 2, low to high energy) scores. The shared task consists of three parts, Subtask (1): Longitudinal Affect Assessment, Subtask (2): Forecasting Variation in Affect as a (2a): \textit{state change}, and (2b): \textit{disposition change}.The task attracted over 200 member registrations on Codabench, receiving official system submissions from 31 teams (total 104 team members), of which 28 teams (with 90 team members) submitted system description papers making it to our leaderboard. We discuss baseline results along with findings from 28 systems, highlighting the best-performing systems, a deeper analysis of performance on essays versus feeling words, and assessments for authors seen versus unseen during training. The datasets for this task are publicly available.

Culture by Design: A Sociotechnical Framework for Culturally Grounded AI for Mental Health
Sunny Rai | Elizabeth Stade | Graise Zhou | Neil Sehgal | Simone Schriger | Sara Gerke | Lyle Ungar | Sharath Chandra Guntuku
Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

AI systems for mental health are developed predominantly using data from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) populations, raising concerns about their validity, fairness, and generalizability across geo-cultural contexts. This limitation is especially consequential in mental health, where linguistic expression, symptom presentation, help-seeking behavior, and access to care vary substantially across populations. We argue that culturally responsive AI mental health systems require explicit attention to culture throughout the development lifecycle, from data collection to training and deployment. We present a sociotechnical framework for developing culturally responsive AI mental health applications to provide AI researchers and practitioners with an actionable roadmap for building more equitable, reliable, and contextually appropriate mental health technologies.

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG
Ilias Triantafyllopoulos | Renyi Qu | Salvatore Giorgi | Brenda Curtis | Lyle Ungar | João Sedoc
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval-Augmented Generation (RAG) systems are increasingly deployed in high-stakes domains, where safety depends not only on how a system answers, but also on whether a query should be answered given a knowledge base (KB). Out-of-domain (OOD) queries can cause dense retrieval to surface weakly related context and lead the generator to produce fluent but unjustified responses. We study lightweight, KB-aligned OOD detection as an always-on gate for RAG systems. Our approach applies PCA to KB embeddings and scores queries in a compact subspace selected either by explained-variance retention (EVR) or by a separability-driven -test ranking. We evaluate geometric semantic-search rules and lightweight classifiers across 16 domains, including high-stakes COVID-19 and Substance Use KBs, and stress-test robustness using both LLM-generated attacks and an in-the-wild 4chan attack. We find that low-dimensional detectors achieve competitive OOD performance while being faster, cheaper, and more interpretable than prompted LLM-based judges. Finally, human and LLM-based evaluations show that OOD queries primarily degrade the relevance of RAG outputs, highlighting the need for efficient external OOD detection to maintain safe, in-scope behavior.

2025

Social Norms in Cinema: A Cross-Cultural Analysis of Shame, Pride and Prejudice
Sunny Rai | Khushang Zaveri | Shreya Havaldar | Soumna Nema | Lyle Ungar | Sharath Chandra Guntuku
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Shame and pride are social emotions expressed across cultures to motivate and regulate people’s thoughts, feelings, and behaviors. In this paper, we introduce the first cross-cultural dataset of over 10k shame/pride-related expressions with underlying social expectations from ~5.4K Bollywood and Hollywood movies. We examine how and why shame and pride are expressed across cultures using a blend of psychology-informed language analysis combined with large language models. We find significant cross-cultural differences in shame and pride expression aligning with known cultural tendencies of the USA and India – e.g., in Hollywood, shame-expressions predominantly discuss self whereas shame is expressed toward others in Bollywood. Women are more sanctioned across cultures and for violating similar social expectations.

Culturally-Aware Conversations: A Framework & Benchmark for LLMs
Shreya Havaldar | Young Min Cho | Sunny Rai | Lyle Ungar
Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)

Existing benchmarks that measure cultural adaptation in LLMs are misaligned with the actual challenges these models face when interacting with users from diverse cultural backgrounds. In this work, we introduce the first framework and benchmark designed to evaluate LLMs in realistic, multicultural conversational settings. Grounded in sociocultural theory, our framework formalizes how linguistic style — a key element of cultural communication — is shaped by situational, relational, and cultural context. We construct a benchmark dataset based on this framework, annotated by culturally diverse raters, and propose a new set of desiderata for cross-cultural evaluation in NLP: conversational framing, stylistic sensitivity, and subjective correctness. We evaluate today’s top LLMs on our benchmark and show that these models struggle with cultural adaptation in a conversational setting.

Language-based Valence and Arousal Expressions between the United States and China: a Cross-Cultural Examination
Young Min Cho | Dandan Pang | Stuti Thapa | Garrick Sherman | Lyle Ungar | Louis Tay | Sharath Chandra Guntuku
Findings of the Association for Computational Linguistics: NAACL 2025

While affective expressions on social media have been extensively studied, most research has focused on the Western context. This paper explores cultural differences in affective expressions by comparing valence and arousal on Twitter/X (geolocated to the US) and Sina Weibo (in Mainland China). Using the NRC-VAD lexicon to measure valence and arousal, we identify distinct patterns of emotional expression across both platforms. Our analysis reveals a functional representation between valence and arousal, showing a negative offset in contrast to traditional lab-based findings which suggest a positive offset. Furthermore, we uncover significant cross-cultural differences in arousal, with US users displaying higher emotional intensity than Chinese users, regardless of the valence of the content. Finally, we conduct a comprehensive language analysis correlating n-grams and LDA topics with affective dimensions to deepen our understanding of how language and culture shape emotional expression. These findings contribute to a more nuanced understanding of affective communication across cultural and linguistic contexts on social media.

ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations
Bowen Jiang | Yuan Yuan | Xinyi Bai | Zhuoqun Hao | Alyson Yin | Yaojie Hu | Wenyu Liao | Lyle Ungar | Camillo Jose Taylor
Findings of the Association for Computational Linguistics: EMNLP 2025

This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations. Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering.

Systematic Evaluation of Auto-Encoding and Large Language Model Representations for Capturing Author States and Traits
Khushboo Singh | Vasudha Varadarajan | Adithya V. Ganesan | August Håkan Nilsson | Nikita Soni | Syeda Mahwish | Pranav Chitale | Ryan L. Boyd | Lyle Ungar | Richard N. Rosenthal | H. Andrew Schwartz
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) are increasingly used in human-centered applications, yet their ability to model diverse psychological constructs is not well understood. In this study, we systematically evaluate a range of Transformer-LMs to predict psychological variables across five major dimensions: affect, substance use, mental health, sociodemographics, and personality. Analyses span three temporal levels—short daily text responses about current affect, text aggregated over two-weeks, and user-level text collected over two years—allowing us to examine how each model’s strengths align with the underlying stability of different constructs. The findings show that mental health signals emerge as the most accurately predicted dimensions (r=0.6) across all temporal scales. At the daily scale, smaller models like DeBERTa and HaRT often performed better, whereas, at longer scales or with greater context, larger model like Llama3-8B performed the best. Also, aggregating text over the entire study period yielded stronger correlations for outcomes, such as age and income. Overall, these results suggest the importance of selecting appropriate model architectures and temporal aggregation techniques based on the stability and nature of the target variable.

The Impact of Language Mixing on Bilingual LLM Reasoning
Yihao Li | Jiayi Xin | Miranda Muqing Miao | Qi Long | Lyle Ungar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing—alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We show that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on MATH500. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by 2.92 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.

Cross-Cultural Differences in Mental Health Expressions on Social Media
Sunny Rai | Khushi Shelat | Devansh Jain | Ashwin Kishen | Young Min Cho | Maitreyi Redkar | Samindara Hardikar-Sawant | Lyle Ungar | Sharath Chandra Guntuku
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

Culture moderates the way individuals perceive and express mental distress. Current understandings of mental health expressions on social media, however, are predominantly derived from WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. To address this gap, we examine mental health posts on Reddit made by individuals geolocated in India, to identify variations in social media language specific to the Indian context compared to users from Western nations. Our experiments reveal significant psychosocial variations in emotions and temporal orientation. This study demonstrates the potential of social media platforms for identifying cross-cultural differences in mental health expressions (e.g. seeking advice in India vs seeking support by Western users). Significant linguistic variations in online mental health-related language emphasize the importance of developing precision-targeted interventions that are culturally appropriate.

Towards Style Alignment in Cross-Cultural Translation
Shreya Havaldar | Adam Stein | Eric Wong | Lyle Ungar
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Successful communication depends on the speaker’s intended style (i.e., what the speaker is trying to convey) aligning with the listener’s interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style – biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.

2024

Building Knowledge-Guided Lexica to Model Cultural Variation
Shreya Havaldar | Salvatore Giorgi | Sunny Rai | Thomas Talhelm | Sharath Chandra Guntuku | Lyle Ungar
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs’ failure to measure cultural variation or generate culturally varied language.

Modeling Human Subjectivity in LLMs Using Explicit and Implicit Human Factors in Personas
Salvatore Giorgi | Tingting Liu | Ankit Aich | Kelsey Jane Isman | Garrick Sherman | Zachary Fried | João Sedoc | Lyle Ungar | Brenda Curtis
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) are increasingly being used in human-centered social scientific tasks, such as data annotation, synthetic data creation, and engaging in dialog. However, these tasks are highly subjective and dependent on human factors, such as one’s environment, attitudes, beliefs, and lived experiences. Thus, it may be the case that employing LLMs (which do not have such human factors) in these tasks results in a lack of variation in data, failing to reflect the diversity of human experiences. In this paper, we examine the role of prompting LLMs with human-like personas and asking the models to answer as if they were a specific human. This is done explicitly, with exact demographics, political beliefs, and lived experiences, or implicitly via names prevalent in specific populations. The LLM personas are then evaluated via (1) subjective annotation task (e.g., detecting toxicity) and (2) a belief generation task, where both tasks are known to vary across human factors. We examine the impact of explicit vs. implicit personas and investigate which human factors LLMs recognize and respond to. Results show that explicit LLM personas show mixed results when reproducing known human biases, but generally fail to demonstrate implicit biases. We conclude that LLMs may capture the statistical patterns of how people speak, but are generally unable to model the complex interactions and subtleties of human perceptions, potentially limiting their effectiveness in social science applications.

Using Daily Language to Understand Drinking: Multi-Level Longitudinal Differential Language Analysis
Matthew Matero | Huy Vu | August Nilsson | Syeda Mahwish | Young Min Cho | James McKay | Johannes Eichstaedt | Richard Rosenthal | Lyle Ungar | H. Andrew Schwartz
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

Analyses for linking language with psychological factors or behaviors predominately treat linguistic features as a static set, working with a single document per person or aggregating across multiple posts (e.g. on social media) into a single set of features. This limits language to mostly shed light on between-person differences rather than changes in behavior within-person. Here, we collected a novel dataset of daily surveys where participants were asked to describe their experienced well-being and report the number of alcoholic beverages they had within the past 24 hours. Through this data, we first build a multi-level forecasting model that is able to capture within-person change and leverage both the psychological features of the person and daily well-being responses. Then, we propose a longitudinal version of differential language analysis that finds patterns associated with drinking more (e.g. social events) and less (e.g. task-oriented), as well as distinguishing patterns of heavy drinks versus light drinkers.

2023

Multilingual Language Models are not Multicultural: A Case Study in Emotion
Shreya Havaldar | Sunny Rai | Bhumika Singhal | Langchen Liu | Sharath Chandra Guntuku | Lyle Ungar
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Emotions are experienced and expressed differently across the world. In order to use Large Language Models (LMs) for multilingual tasks that require emotional sensitivity, LMs must reflect this cultural variation in emotion. In this study, we investigate whether the widely-used multilingual LMs in 2023 reflect differences in emotional expressions across cultures and languages. We find that embeddings obtained from LMs (e.g., XLM-RoBERTa) are Anglocentric, and generative LMs (e.g., ChatGPT) reflect Western norms, even when responding to prompts in other languages. Our results show that multilingual LMs do not successfully learn the culturally appropriate nuances of emotion and we highlight possible research directions towards correcting this.

AWARE-TEXT: An Android Package for Mobile Phone Based Text Collection and On-Device Processing
Salvatore Giorgi | Garrick Sherman | Douglas Bellew | Sharath Chandra Guntuku | Lyle Ungar | Brenda Curtis
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

We present the AWARE-text package, an open-source software package for collecting textual data on Android mobile devices. This package allows for collecting short message service (SMS or text messages) and character-level keystrokes. In addition to collecting this raw data, AWARE-text is designed for on device lexicon processing, which allows one to collect standard textual-based measures (e.g., sentiment, emotions, and topics) without collecting the underlying raw textual data. This is especially important in the case of mobile phones, which can contain sensitive and identifying information. Thus, the AWARE-text package allows for privacy protection while simultaneously collecting textual information at multiple levels of granularity: person (lifetime history of SMS), conversation (both sides of SMS conversations and group chats), message (single SMS), and character (individual keystrokes entered across applications). Finally, the unique processing environment of mobile devices opens up several methodological and privacy issues, which we discuss.

Automatic Reflection Generation for Peer-to-Peer Counseling
Emma O’neil | João Sedoc | Diyi Yang | Haiyi Zhu | Lyle Ungar
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Online peer counseling platforms enable conversations between millions of people seeking and offering mental health support. Among counseling skills, reflective listening, i.e., capturing and returning to the client something the client has said, is important for positive therapeutic outcomes. We introduce a reflection generation system for online mental health support conversations leveraging GPT-3, a large language model. We compare few-shot learning against fine-tuning and assess the impact of the quality of training examples as measured by fluency, reflection resemblance, and overall preference. Fine-tuned GPT-3 generates responses that human evaluators rate as comparable in reflection quality to responses used for tuning. Models based on high-quality responses generate substantially better reflections than ones tuned on actual responses from a large online counseling service–and better reflections than the actual counselor responses. These results suggest the care needed in selecting examples for tuning generative models.

Conditioning on Dialog Acts improves Empathy Style Transfer
Renyi Qu | Lyle Ungar | João Sedoc
Findings of the Association for Computational Linguistics: EMNLP 2023

We explore the role of dialog acts in style transfer, specifically empathy style transfer – rewriting a sentence to make it more empathetic without changing its meaning. Specifically, we use two novel few-shot prompting strategies: target prompting, which only uses examples of the target style (unlike traditional prompting with source/target pairs), and dialog-act-conditioned prompting, which first estimates the dialog act of the source sentence and then makes it more empathetic using few-shot examples of the same dialog act. Our study yields two key findings: (1) Target prompting typically improves empathy more effectively while maintaining the same level of semantic similarity; (2) Dialog acts matter. Dialog-act-conditioned prompting enhances empathy while preserving both semantics and the dialog-act type. Different dialog acts benefit differently from different prompting methods, highlighting the need for further investigation of the role of dialog acts in style transfer.

Interactive Concept Learning for Uncovering Latent Themes in Large Text Collections
Maria Leonor Pacheco | Tunazzina Islam | Lyle Ungar | Ming Yin | Dan Goldwasser
Findings of the Association for Computational Linguistics: ACL 2023

Experts across diverse disciplines are often interested in making sense of large text collections. Traditionally, this challenge is approached either by noisy unsupervised techniques such as topic models, or by following a manual theme discovery process. In this paper, we expand the definition of a theme to account for more than just a word distribution, and include generalized concepts deemed relevant by domain experts. Then, we propose an interactive framework that receives and encodes expert feedback at different levels of abstraction. Our framework strikes a balance between automation and manual coding, allowing experts to maintain control of their study while reducing the manual effort required.

An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives
Young Min Cho | Sunny Rai | Lyle Ungar | João Sedoc | Sharath Guntuku
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Mental health conversational agents (a.k.a. chatbots) are widely studied for their potential to offer accessible support to those experiencing mental health challenges. Previous surveys on the topic primarily consider papers published in either computer science or medicine, leading to a divide in understanding and hindering the sharing of beneficial knowledge between both domains. To bridge this gap, we conduct a comprehensive literature review using the PRISMA framework, reviewing 534 papers published in both computer science and medicine. Our systematic review reveals 136 key papers on building mental health-related conversational agents with diverse characteristics of modeling and experimental design techniques. We find that computer science papers focus on LLM techniques and evaluating response quality using automated metrics with little attention to the application while medical papers use rule-based conversational agents and outcome metrics to measure the health outcomes of participants. Based on our findings on transparency, ethics, and cultural heterogeneity in this review, we provide a few recommendations to help bridge the disciplinary divide and enable the cross-disciplinary development of mental health conversational agents.

Conceptor-Aided Debiasing of Large Language Models
Li S. Yifei | Lyle Ungar | João Sedoc
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Pre-trained large language models (LLMs) reflect the inherent social biases of their training corpus. Many methods have been proposed to mitigate this issue, but they often fail to debias or they sacrifice model accuracy. We use conceptors–a soft projection method–to identify and remove the bias subspace in LLMs such as BERT and GPT. We propose two methods of applying conceptors (1) bias subspace projection by post-processing by the conceptor NOT operation; and (2) a new architecture, conceptor-intervened BERT (CI-BERT), which explicitly incorporates the conceptor projection into all layers during training. We find that conceptor post-processing achieves state-of-the-art (SoTA) debiasing results while maintaining LLMs’ performance on the GLUE benchmark. Further, it is robust in various scenarios and can mitigate intersectional bias efficiently by its AND operation on the existing bias subspaces. Although CI-BERT’s training takes all layers’ bias into account and can beat its post-processing counterpart in bias mitigation, CI-BERT reduces the language model accuracy. We also show the importance of carefully constructing the bias subspace. The best results are obtained by removing outliers from the list of biased words, combining them (via the OR operation), and computing their embeddings using the sentences from a cleaner corpus.

Comparing Styles across Languages
Shreya Havaldar | Matthew Pressimone | Eric Wong | Lyle Ungar
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into comparable lexical categories. We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset and exploring how politeness varies across four languages. Our approach enables an effective evaluation of how distinct linguistic categories contribute to stylistic variations and provides interpretable insights into how people communicate differently around the world.

StyLEx: Explaining Style Using Human Lexical Annotations
Shirley Anugrah Hayati | Kyumin Park | Dheeraj Rajagopal | Lyle Ungar | Dongyeop Kang
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Large pre-trained language models have achieved impressive results on various style classification tasks, but they often learn spurious domain-specific words to make predictions (Hayati et al., 2021). While human explanation highlights stylistic tokens as important features for this task, we observe that model explanations often do not align with them. To tackle this issue, we introduce StyLEx, a model that learns from human annotated explanations of stylistic features and jointly learns to perform the task and predict these features as model explanations. Our experiments show that StyLEx can provide human like stylistic lexical explanations without sacrificing the performance of sentence-level style prediction on both in-domain and out-of-domain datasets. Explanations from StyLEx show significant improvements in explanation metrics (sufficiency, plausibility) and when evaluated with human annotations. They are also more understandable by human judges compared to the widely-used saliency-based explanation baseline.

2022

A Holistic Framework for Analyzing the COVID-19 Vaccine Debate
Maria Leonor Pacheco | Tunazzina Islam | Monal Mahajan | Andrey Shor | Ming Yin | Lyle Ungar | Dan Goldwasser
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The Covid-19 pandemic has led to infodemic of low quality information leading to poor health decisions. Combating the outcomes of this infodemic is not only a question of identifying false claims, but also reasoning about the decisions individuals make. In this work we propose a holistic analysis framework connecting stance and reason analysis, and fine-grained entity level moral sentiment analysis. We study how to model the dependencies between the different level of analysis and incorporate human insights into the learning process. Experiments show that our framework provides reliable predictions even in the low-supervision settings.

Inducing Generalizable and Interpretable Lexica
Yilin Geng | Zetian Wu | Roshan Santhosh | Tejas Srivastava | Lyle Ungar | João Sedoc
Findings of the Association for Computational Linguistics: EMNLP 2022

Lexica – words and associated scores – are widely used as simple, interpretable, generalizable language features to predict sentiment, emotions, mental health, and personality. They also provide insight into the psychological features behind those moods and traits. Such lexica, historically created by human experts, are valuable to linguists, psychologists, and social scientists, but they take years of refinement and have limited coverage. In this paper, we investigate how the lexica that provide psycholinguistic insights could be computationally induced and how they should be assessed. We identify generalizability and interpretability as two essential properties of such lexica. We induce lexica using both context-oblivious and context-aware approaches, compare their predictive performance both within the training corpus and across various corpora, and evaluate their quality using crowd-worker assessment. We find that lexica induced from context-oblivious models are more generalizable and interpretable than those from more accurate context-aware transformer models. In addition, lexicon scores can identify explanatory words more reliably than a high performing transformer with feature-importance measures like SHAP.

Measuring the Language of Self-Disclosure across Corpora
Ann-Katrin Reuel | Sebastian Peralta | João Sedoc | Garrick Sherman | Lyle Ungar
Findings of the Association for Computational Linguistics: ACL 2022

Being able to reliably estimate self-disclosure – a key component of friendship and intimacy – from language is important for many psychology studies. We build single-task models on five self-disclosure corpora, but find that these models generalize poorly; the within-domain accuracy of predicted message-level self-disclosure of the best-performing model (mean Pearson’s r=0.69) is much higher than the respective across data set accuracy (mean Pearson’s r=0.32), due to both variations in the corpora (e.g., medical vs. general topics) and labeling instructions (target variables: self-disclosure, emotional disclosure, intimacy). However, some lexical features, such as expression of negative emotions and use of first person personal pronouns such as ‘I’ reliably predict self-disclosure across corpora. We develop a multi-task model that yields better results, with an average Pearson’s r of 0.37 for out-of-corpora prediction.

Interactively Uncovering Latent Arguments in Social Media Platforms: A Case Study on the Covid-19 Vaccine Debate
Maria Leonor Pacheco | Tunazzina Islam | Lyle Ungar | Ming Yin | Dan Goldwasser
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)

Automated methods for analyzing public opinion have grown in popularity with the proliferation of social media. While supervised methods can be very good at classifying text, the dynamic nature of social media discourse results in a moving target for supervised learning. Meanwhile, traditional unsupervised techniques for extracting themes from textual repositories, such as topic models, can result in incorrect outputs that are unusable to domain experts. For this reason, a non-trivial amount of research on social media discourse still relies on manual coding techniques. In this paper, we present an interactive, humans-in-the-loop framework that strikes a balance between unsupervised techniques and manual coding for extracting latent arguments from social media discussions. We use the COVID-19 vaccination debate as a case study, and show that our methodology can be used to obtain a more accurate, interpretable set of arguments when compared to traditional topic models. We do this at a relatively low manual cost, as 3 experts take approximately 2 hours to code close to 100k tweets.

Nonsuicidal Self-Injury and Substance Use Disorders: A Shared Language of Addiction
Salvatore Giorgi | Mckenzie Himelein-wachowiak | Daniel Habib | Lyle Ungar | Brenda Curtis
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

Nonsuicidal self-injury (NSSI), or the deliberate injuring of one?s body without intending to die, has been shown to exhibit many similarities to substance use disorders (SUDs), including population-level characteristics, impulsivity traits, and comorbidity with other mental disorders. Research has further shown that people who self-injure adopt language common in SUD recovery communities (e.g., “clean”, “relapse”, “addiction,” and celebratory language about sobriety milestones). In this study, we investigate the shared language of NSSI and SUD by comparing discussions on public Reddit forums related to self-injury and drug addiction. To this end, we build a set of LDA topics across both NSSI and SUD Reddit users and show that shared language across the two domains includes SUD recovery language in addition to other themes common to support forums (e.g., requests for help and gratitude). Next, we examine Reddit-wide posting activity and note that users posting in r/selfharm also post in many mental health-related subreddits, while users of drug addiction related subreddits do not, despite high comorbidity between NSSI and SUDs. These results show that while people who self-injure may contextualize their disorder as an addiction, their posting habits demonstrate comorbidities with other mental disorders more so than their counterparts in recovery from SUDs. These observations have clinical implications for people who self-injure and seek support by sharing their experiences online.

2021

WikiTalkEdit: A Dataset for modeling Editors’ behaviors on Wikipedia
Kokil Jaidka | Andrea Ceolin | Iknoor Singh | Niyati Chhaya | Lyle Ungar
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This study introduces and analyzes WikiTalkEdit, a dataset of conversations and edit histories from Wikipedia, for research in online cooperation and conversation modeling. The dataset comprises dialog triplets from the Wikipedia Talk pages, and editing actions on the corresponding articles being discussed. We show how the data supports the classic understanding of style matching, where positive emotion and the use of first-person pronouns predict a positive emotional change in a Wikipedia contributor. However, they do not predict editorial behavior. On the other hand, feedback invoking evidentiality and criticism, and references to Wikipedia’s community norms, is more likely to persuade the contributor to perform edits but is less likely to lead to a positive emotion. We developed baseline classifiers trained on pre-trained RoBERTa features that can predict editorial change with an F1 score of .54, as compared to an F1 score of .66 for predicting emotional change. A diagnostic analysis of persisting errors is also provided. We conclude with possible applications and recommendations for future work. The dataset is publicly available for the research community at https://github.com/kj2013/WikiTalkEdit/.

Characterizing Social Spambots by their Human Traits
Salvatore Giorgi | Lyle Ungar | H. Andrew Schwartz
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica
Shirley Anugrah Hayati | Dongyeop Kang | Lyle Ungar
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

People convey their intention and attitude through linguistic styles of the text that they write. In this study, we investigate lexicon usages across styles throughout two lenses: human perception and machine word importance, since words differ in the strength of the stylistic cues that they provide. To collect labels of human perception, we curate a new dataset, Hummingbird, on top of benchmarking style datasets. We have crowd workers highlight the representative words in the text that makes them think the text has the following styles: politeness, sentiment, offensiveness, and five emotion types. We then compare these human word labels with word importance derived from a popular fine-tuned style classifier like BERT. Our results show that the BERT often finds content words not relevant to the target style as important words used in style prediction, but humans do not perceive the same way even though for some styles (e.g., positive sentiment and joy) human- and machine-identified words share significant overlap for some styles.

2020

Learning Emotion from 100 Observations: Unexpected Robustness of Deep Learning under Strong Data Limitations
Sven Buechel | João Sedoc | H. Andrew Schwartz | Lyle Ungar
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

One of the major downsides of Deep Learning is its supposed need for vast amounts of training data. As such, these techniques appear ill-suited for NLP areas where annotated data is limited, such as less-resourced languages or emotion analysis, with its many nuanced and hard-to-acquire annotation formats. We conduct a questionnaire study indicating that indeed the vast majority of researchers in emotion analysis deems neural models inferior to traditional machine learning when training data is limited. In stark contrast to those survey results, we provide empirical evidence for English, Polish, and Portuguese that commonly used neural architectures can be trained on surprisingly few observations, outperforming n-gram based ridge regression on only 100 data points. Our analysis suggests that high-quality, pre-trained word embeddings are a main factor for achieving those results.

Detecting Emerging Symptoms of COVID-19 using Context-based Twitter Embeddings
Roshan Santosh | H. Andrew Schwartz | Johannes Eichstaedt | Lyle Ungar | Sharath Chandra Guntuku
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

In this paper, we present an iterative graph-based approach for the detection of symptoms of COVID-19, the pathology of which seems to be evolving. More generally, the method can be applied to finding context-specific words and texts (e.g. symptom mentions) in large imbalanced corpora (e.g. all tweets mentioning #COVID-19). Given the novelty of COVID-19, we also test if the proposed approach generalizes to the problem of detecting Adverse Drug Reaction (ADR). We find that the approach applied to Twitter data can detect symptom mentions substantially before to their being reported by the Centers for Disease Control (CDC).

Learning Word Ratings for Empathy and Distress from Document-Level User Responses
João Sedoc | Sven Buechel | Yehonathan Nachmany | Anneke Buffone | Lyle Ungar
Proceedings of the Twelfth Language Resources and Evaluation Conference

Despite the excellent performance of black box approaches to modeling sentiment and emotion, lexica (sets of informative words and associated weights) that characterize different emotions are indispensable to the NLP community because they allow for interpretable and robust predictions. Emotion analysis of text is increasing in popularity in NLP; however, manually creating lexica for psychological constructs such as empathy has proven difficult. This paper automatically creates empathy word ratings from document-level ratings. The underlying problem of learning word ratings from higher-level supervision has to date only been addressed in an ad hoc fashion and has not used deep learning methods. We systematically compare a number of approaches to learning word ratings from higher-level supervision against a Mixed-Level Feed Forward Network (MLFFN), which we find performs best, and use the MLFFN to create the first-ever empathy lexicon. We then use Signed Spectral Clustering to gain insights into the resulting words. The empathy and distress lexica are publicly available at: http://www.wwbp.org/lexica.html.

Predicting Responses to Psychological Questionnaires from Participants’ Social Media Posts and Question Text Embeddings
Huy Vu | Suhaib Abdurahman | Sudeep Bhatia | Lyle Ungar
Findings of the Association for Computational Linguistics: EMNLP 2020

Psychologists routinely assess people’s emotions and traits, such as their personality, by collecting their responses to survey questionnaires. Such assessments can be costly in terms of both time and money, and often lack generalizability, as existing data cannot be used to predict responses for new survey questions or participants. In this study, we propose a method for predicting a participant’s questionnaire response using their social media texts and the text of the survey question they are asked. Specifically, we use Natural Language Processing (NLP) tools such as BERT embeddings to represent both participants (via the text they write) and survey questions as embeddings vectors, allowing us to predict responses for out-of-sample participants and questions. Our novel approach can be used by researchers to integrate new participants or new questions into psychological studies without the constraint of costly data collection, facilitating novel practical applications and furthering the development of psychological theory. Finally, as a side contribution, the success of our model also suggests a new approach to study survey questions using NLP tools such as text embeddings rather than response data used in traditional methods.

Item Response Theory for Efficient Human Evaluation of Chatbots
João Sedoc | Lyle Ungar
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.

Toward Micro-Dialect Identification in Diaglossic and Code-Switched Environments
Muhammad Abdul-Mageed | Chiyu Zhang | AbdelRahim Elmadany | Lyle Ungar
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Although prediction of dialects is an important language processing task, with a wide range of applications, existing work is largely limited to coarse-grained varieties. Inspired by geolocation research, we propose the novel task of Micro-Dialect Identification (MDI) and introduce MARBERT, a new language model with striking abilities to predict a fine-grained variety (as small as that of a city) given a single, short message. For modeling, we offer a range of novel spatially and linguistically-motivated multi-task learning models. To showcase the utility of our models, we introduce a new, large-scale dataset of Arabic micro-varieties (low-resource) suited to our tasks. MARBERT predicts micro-dialects with 9.9% F1, 76 better than a majority class baseline. Our new language model also establishes new state-of-the-art on several external tasks.

2019

The Role of Protected Class Word Lists in Bias Identification of Contextualized Word Representations
João Sedoc | Lyle Ungar
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Systemic bias in word embeddings has been widely reported and studied, and efforts made to debias them; however, new contextualized embeddings such as ELMo and BERT are only now being similarly studied. Standard debiasing methods require heterogeneous lists of target words to identify the “bias subspace”. We show show that using new contextualized word embeddings in conceptor debiasing allows us to more accurately debias word embeddings by breaking target word lists into more homogeneous subsets and then combining (”Or’ing”) the debiasing conceptors of the different subsets.

Conceptor Debiasing of Word Representations Evaluated on WEAT
Saket Karve | Lyle Ungar | João Sedoc
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Bias in word representations, such as Word2Vec, has been widely reported and investigated, and efforts made to debias them. We apply the debiasing conceptor for post-processing both traditional and contextualized word embeddings. Our method can simultaneously remove racial and gender biases from word representations. Unlike standard debiasing methods, the debiasing conceptor can utilize heterogeneous lists of biased words without loss in performance. Finally, our empirical experiments show that the debiasing conceptor diminishes racial and gender bias of word representations as measured using the Word Embedding Association Test (WEAT) of Caliskan et al. (2017).

ChatEval: A Tool for Chatbot Evaluation
João Sedoc | Daphne Ippolito | Arun Kirubarajan | Jai Thirani | Lyle Ungar | Chris Callison-Burch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Open-domain dialog systems (i.e. chatbots) are difficult to evaluate. The current best practice for analyzing and comparing these dialog systems is the use of human judgments. However, the lack of standardization in evaluation procedures, and the fact that model parameters and code are rarely published hinder systematic human evaluation experiments. We introduce a unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems. Researchers can submit their trained models to the ChatEval web interface and obtain comparisons with baselines and prior work. The evaluation code is open-source to ensure standardization and transparency. In addition, we introduce open-source baseline models and evaluation datasets. ChatEval can be found at https://chateval.org.

Continual Learning for Sentence Representations Using Conceptors
Tianlin Liu | Lyle Ungar | João Sedoc
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Distributed representations of sentences have become ubiquitous in natural language processing tasks. In this paper, we consider a continual learning scenario for sentence representations: Given a sequence of corpora, we aim to optimize the sentence encoder with respect to the new corpus while maintaining its accuracy on the old corpora. To address this problem, we propose to initialize sentence encoders with the help of corpus-independent features, and then sequentially update sentence encoders using Boolean operations of conceptor matrices to learn corpus-dependent features. We evaluate our approach on semantic textual similarity tasks and show that our proposed sentence encoder can continually learn features from new corpora while retaining its competence on previously encountered corpora.

2018

ChatEval: A Tool for the Systematic Evaluation of Chatbots
João Sedoc | Daphne Ippolito | Arun Kirubarajan | Jai Thirani | Lyle Ungar | Chris Callison-Burch
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)

Enabling Deep Learning of Emotion With First-Person Seed Expressions
Hassan Alhuzali | Muhammad Abdul-Mageed | Lyle Ungar
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

The computational treatment of emotion in natural language text remains relatively limited, and Arabic is no exception. This is partly due to lack of labeled data. In this work, we describe and manually validate a method for the automatic acquisition of emotion labeled data and introduce a newly developed data set for Modern Standard and Dialectal Arabic emotion detection focused at Robert Plutchik’s 8 basic emotion types. Using a hybrid supervision method that exploits first person emotion seeds, we show how we can acquire promising results with a deep gated recurrent neural network. Our best model reaches 70% F-score, significantly (i.e., 11%, p < 0.05) outperforming a competitive baseline. Applying our method and data on an external dataset of 4 emotions released around the same time we finalized our work, we acquire 7% absolute gain in F-score over a linear SVM classifier trained on gold data, thus validating our approach.

Current and Future Psychological Health Prediction using Language and Socio-Demographics of Children for the CLPysch 2018 Shared Task
Sharath Chandra Guntuku | Salvatore Giorgi | Lyle Ungar
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

This article is a system description and report on the submission of a team from the University of Pennsylvania in the ’CLPsych 2018’ shared task. The goal of the shared task was to use childhood language as a marker for both current and future psychological health over individual lifetimes. Our system employs multiple textual features derived from the essays written and individuals’ socio-demographic variables at the age of 11. We considered several word clustering approaches, and explore the use of linear regression based on different feature sets. Our approach showed best results for predicting distress at the age of 42 and for predicting current anxiety on Disattenuated Pearson Correlation, and ranked fourth in the future health prediction task. In addition to the subtasks presented, we attempted to provide insight into mental health aspects at different ages. Our findings indicate that misspellings, words with illegible letters and increased use of personal pronouns are correlated with poor mental health at age 11, while descriptions about future physical activity, family and friends are correlated with good mental health.

Diachronic degradation of language models: Insights from social media
Kokil Jaidka | Niyati Chhaya | Lyle Ungar
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Natural languages change over time because they evolve to the needs of their users and the socio-technological environment. This study investigates the diachronic accuracy of pre-trained language models for downstream tasks in machine learning and user profiling. It asks the question: given that the social media platform and its users remain the same, how is language changing over time? How can these differences be used to track the changes in the affect around a particular topic? To our knowledge, this is the first study to show that it is possible to measure diachronic semantic drifts within social media and within the span of a few years.

Modeling Empathy and Distress in Reaction to News Stories
Sven Buechel | Anneke Buffone | Barry Slaff | Lyle Ungar | João Sedoc
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Computational detection and understanding of empathy is an important factor in advancing human-computer interaction. Yet to date, text-based empathy prediction has the following major limitations: It underestimates the psychological complexity of the phenomenon, adheres to a weak notion of ground truth where empathic states are ascribed by third parties, and lacks a shared corpus. In contrast, this contribution presents the first publicly available gold standard for empathy prediction. It is constructed using a novel annotation methodology which reliably captures empathy assessments by the writer of a statement using multi-item scales. This is also the first computational work distinguishing between multiple forms of empathy, empathic concern, and personal distress, as recognized throughout psychology. Finally, we present experimental results for three different predictive models, of which a CNN performs the best.

The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions
Salvatore Giorgi | Daniel Preoţiuc-Pietro | Anneke Buffone | Daniel Rieman | Lyle Ungar | H. Andrew Schwartz
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated by user. Results on four different U.S. county-level tasks, spanning demographic, health, and psychological outcomes show large and consistent improvements in prediction accuracies (e.g. from Pearson r=.73 to .82 for median income prediction or r=.37 to .47 for life satisfaction prediction) over the standard approach of aggregating all tweets. We make our aggregated and anonymized community-level data, derived from 37 billion tweets – over 1 billion of which were mapped to counties, available for research.

Identifying Locus of Control in Social Media Language
Masoud Rouhizadeh | Kokil Jaidka | Laura Smith | H. Andrew Schwartz | Anneke Buffone | Lyle Ungar
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Individuals express their locus of control, or “control”, in their language when they identify whether or not they are in control of their circumstances. Although control is a core concept underlying rhetorical style, it is not clear whether control is expressed by how or by what authors write. We explore the roles of syntax and semantics in expressing users’ sense of control –i.e. being “controlled by” or “in control of” their circumstances– in a corpus of annotated Facebook posts. We present rich insights into these linguistic aspects and find that while the language signaling control is easy to identify, it is more challenging to label it is internally or externally controlled, with lexical features outperforming syntactic features at the task. Our findings could have important implications for studying self-expression in social media.

User-Level Race and Ethnicity Predictors from Twitter Text
Daniel Preoţiuc-Pietro | Lyle Ungar
Proceedings of the 27th International Conference on Computational Linguistics

User demographic inference from social media text has the potential to improve a range of downstream applications, including real-time passive polling or quantifying demographic bias. This study focuses on developing models for user-level race and ethnicity prediction. We introduce a data set of users who self-report their race/ethnicity through a survey, in contrast to previous approaches that use distantly supervised data or perceived labels. We develop predictive models from text which accurately predict the membership of a user to the four largest racial and ethnic groups with up to .884 AUC and make these available to the research community.

Unsupervised Morphology Learning with Statistical Paradigms
Hongzhi Xu | Mitchell Marcus | Charles Yang | Lyle Ungar
Proceedings of the 27th International Conference on Computational Linguistics

This paper describes an unsupervised model for morphological segmentation that exploits the notion of paradigms, which are sets of morphological categories (e.g., suffixes) that can be applied to a homogeneous set of words (e.g., nouns or verbs). Our algorithm identifies statistically reliable paradigms from the morphological segmentation result of a probabilistic model, and chooses reliable suffixes from them. The new suffixes can be fed back iteratively to improve the accuracy of the probabilistic model. Finally, the unreliable paradigms are subjected to pruning to eliminate unreliable morphological relations between words. The paradigm-based algorithm significantly improves segmentation accuracy. Our method achieves start-of-the-art results on experiments using the Morpho-Challenge data, including English, Turkish, and Finnish.

2017

Personality Driven Differences in Paraphrase Preference
Daniel Preoţiuc-Pietro | Jordan Carpenter | Lyle Ungar
Proceedings of the Second Workshop on NLP and Computational Social Science

Personality plays a decisive role in how people behave in different scenarios, including online social media. Researchers have used such data to study how personality can be predicted from language use. In this paper, we study phrase choice as a particular stylistic linguistic difference, as opposed to the mostly topical differences identified previously. Building on previous work on demographic preferences, we quantify differences in paraphrase choice from a massive Facebook data set with posts from over 115,000 users. We quantify the predictive power of phrase choice in user profiling and use phrase choice to study psycholinguistic hypotheses. This work is relevant to future applications that aim to personalize text generation to specific personality types.

Recognizing Counterfactual Thinking in Social Media Texts
Youngseo Son | Anneke Buffone | Joe Raso | Allegra Larche | Anthony Janocko | Kevin Zembroski | H Andrew Schwartz | Lyle Ungar
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Counterfactual statements, describing events that did not occur and their consequents, have been studied in areas including problem-solving, affect management, and behavior regulation. People with more counterfactual thinking tend to perceive life events as more personally meaningful. Nevertheless, counterfactuals have not been studied in computational linguistics. We create a counterfactual tweet dataset and explore approaches for detecting counterfactuals using rule-based and supervised statistical approaches. A combined rule-based and statistical approach yielded the best results (F1 = 0.77) outperforming either approach used alone.

On the Distribution of Lexical Features at Multiple Levels of Analysis
Fatemeh Almodaresi | Lyle Ungar | Vivek Kulkarni | Mohsen Zakeri | Salvatore Giorgi | H. Andrew Schwartz
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Natural language processing has increasingly moved from modeling documents and words toward studying the people behind the language. This move to working with data at the user or community level has presented the field with different characteristics of linguistic data. In this paper, we empirically characterize various lexical distributions at different levels of analysis, showing that, while most features are decidedly sparse and non-normal at the message-level (as with traditional NLP), they follow the central limit theorem to become much more Log-normal or even Normal at the user- and county-levels. Finally, we demonstrate that modeling lexical features for the correct level of analysis leads to marked improvements in common social scientific prediction tasks.

Semantic Word Clusters Using Signed Spectral Clustering
João Sedoc | Jean Gallier | Dean Foster | Lyle Ungar
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Vector space representations of words capture many aspects of word similarity, but such methods tend to produce vector spaces in which antonyms (as well as synonyms) are close to each other. For spectral clustering using such word embeddings, words are points in a vector space where synonyms are linked with positive weights, while antonyms are linked with negative weights. We present a new signed spectral normalized graph cut algorithm, signed clustering, that overlays existing thesauri upon distributionally derived vector representations of words, so that antonym relationships between word pairs are represented by negative weights. Our signed clustering algorithm produces clusters of words that simultaneously capture distributional and synonym relations. By using randomized spectral decomposition (Halko et al., 2011) and sparse matrices, our method is both fast and scalable. We validate our clusters using datasets containing human judgments of word pair similarities and show the benefit of using our word clusters for sentiment prediction.

Beyond Binary Labels: Political Ideology Prediction of Twitter Users
Daniel Preoţiuc-Pietro | Ye Liu | Daniel Hopkins | Lyle Ungar
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic political orientation prediction from social media posts has to date proven successful only in distinguishing between publicly declared liberals and conservatives in the US. This study examines users’ political ideology using a seven-point scale which enables us to identify politically moderate and neutral users – groups which are of particular interest to political scientists and pollsters. Using a novel data set with political ideology labels self-reported through surveys, our goal is two-fold: a) to characterize the groups of politically engaged users through language use on Twitter; b) to build a fine-grained model that predicts political ideology of unseen users. Our results identify differences in both political leaning and engagement and the extent to which each group tweets using political keywords. Finally, we demonstrate how to improve ideology prediction accuracy by exploiting the relationships between the user groups.

EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks
Muhammad Abdul-Mageed | Lyle Ungar
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Accurate detection of emotion from natural language has applications ranging from building emotional chatbots to better understanding individuals and their lives. However, progress on emotion detection has been hampered by the absence of large labeled datasets. In this work, we build a very large dataset for fine-grained emotions and develop deep learning models on it. We achieve a new state-of-the-art on 24 fine-grained types of emotions (with an average accuracy of 87.58%). We also extend the task beyond emotion types to model Robert Plutick’s 8 primary emotion dimensions, acquiring a superior accuracy of 95.68%.

Domain Adaptation from User-level Facebook Models to County-level Twitter Predictions
Daniel Rieman | Kokil Jaidka | H. Andrew Schwartz | Lyle Ungar
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Several studies have demonstrated how language models of user attributes, such as personality, can be built by using the Facebook language of social media users in conjunction with their responses to psychology questionnaires. It is challenging to apply these models to make general predictions about attributes of communities, such as personality distributions across US counties, because it requires 1. the potentially inavailability of the original training data because of privacy and ethical regulations, 2. adapting Facebook language models to Twitter language without retraining the model, and 3. adapting from users to county-level collections of tweets. We propose a two-step algorithm, Target Side Domain Adaptation (TSDA) for such domain adaptation when no labeled Twitter/county data is available. TSDA corrects for the different word distributions between Facebook and Twitter and for the varying word distributions across counties by adjusting target side word frequencies; no changes to the trained model are made. In the case of predicting the Big Five county-level personality traits, TSDA outperforms a state-of-the-art domain adaptation method, gives county-level predictions that have fewer extreme outliers, higher year-to-year stability, and higher correlation with county-level outcomes.

Predicting Emotional Word Ratings using Distributional Representations and Signed Clustering
João Sedoc | Daniel Preoţiuc-Pietro | Lyle Ungar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Inferring the emotional content of words is important for text-based sentiment analysis, dialogue systems and psycholinguistics, but word ratings are expensive to collect at scale and across languages or domains. We develop a method that automatically extends word-level ratings to unrated words using signed clustering of vector space word representations along with affect ratings. We use our method to determine a word’s valence and arousal, which determine its position on the circumplex model of affect, the most popular dimensional model of emotion. Our method achieves superior out-of-sample word rating prediction on both affective dimensions across three different languages when compared to state-of-the-art word similarity based methods. Our method can assist building word ratings for new languages and improve downstream tasks such as sentiment analysis and emotion detection.

DLATK: Differential Language Analysis ToolKit
H. Andrew Schwartz | Salvatore Giorgi | Maarten Sap | Patrick Crutchley | Lyle Ungar | Johannes Eichstaedt
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present Differential Language Analysis Toolkit (DLATK), an open-source python package and command-line tool developed for conducting social-scientific language analyses. While DLATK provides standard NLP pipeline steps such as tokenization or SVM-classification, its novel strengths lie in analyses useful for psychological, health, and social science: (1) incorporation of extra-linguistic structured information, (2) specified levels and units of analysis (e.g. document, user, community), (3) statistical metrics for continuous outcomes, and (4) robust, proven, and accurate pipelines for social-scientific prediction problems. DLATK integrates multiple popular packages (SKLearn, Mallet), enables interactive usage (Jupyter Notebooks), and generally follows object oriented principles to make it easy to tie in additional libraries or storage technologies.

Assessing Objective Recommendation Quality through Political Forecasting
H. Andrew Schwartz | Masoud Rouhizadeh | Michael Bishop | Philip Tetlock | Barbara Mellers | Lyle Ungar
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Recommendations are often rated for their subjective quality, but few researchers have studied comment quality in terms of objective utility. We explore recommendation quality assessment with respect to both subjective (i.e. users’ ratings) and objective (i.e., did it influence? did it improve decisions?) metrics in a massive online geopolitical forecasting system, ultimately comparing linguistic characteristics of each quality metric. Using a variety of features, we predict all types of quality with better accuracy than the simple yet strong baseline of comment length. Looking at the most predictive content illustrates rater biases; for example, forecasters are subjectively biased in favor of comments mentioning business transactions or dealings as well as material things, even though such comments do not indeed prove any more useful objectively. Additionally, more complex sentence constructions, as evidenced by subordinate conjunctions, are characteristic of comments leading to objective improvements in forecasting.

Controlling Human Perception of Basic User Traits
Daniel Preoţiuc-Pietro | Sharath Chandra Guntuku | Lyle Ungar
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Much of our online communication is text-mediated and, lately, more common with automated agents. Unlike interacting with humans, these agents currently do not tailor their language to the type of person they are communicating to. In this pilot study, we measure the extent to which human perception of basic user trait information – gender and age – is controllable through text. Using automatic models of gender and age prediction, we estimate which tweets posted by a user are more likely to mis-characterize his traits. We perform multiple controlled crowdsourcing experiments in which we show that we can reduce the human prediction accuracy of gender to almost random – an over 20% drop in accuracy. Our experiments show that it is practically feasible for multiple applications such as text generation, text summarization or machine translation to be tailored to specific traits and perceived as such.

2016

Modelling Valence and Arousal in Facebook posts
Daniel Preoţiuc-Pietro | H. Andrew Schwartz | Gregory Park | Johannes Eichstaedt | Margaret Kern | Lyle Ungar | Elisabeth Shulman
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology
Kristy Hollingshead | Lyle Ungar
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

Exploring Stylistic Variation with Age and Income on Twitter
Lucie Flekova | Daniel Preoţiuc-Pietro | Lyle Ungar
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Analyzing Biases in Human Perception of User Age and Gender from Text
Lucie Flekova | Jordan Carpenter | Salvatore Giorgi | Lyle Ungar | Daniel Preoţiuc-Pietro
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

An Empirical Exploration of Moral Foundations Theory in Partisan News Sources
Dean Fulgoni | Jordan Carpenter | Lyle Ungar | Daniel Preoţiuc-Pietro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

News sources frame issues in different ways in order to appeal or control the perception of their readers. We present a large scale study of news articles from partisan sources in the US across a variety of different issues. We first highlight that differences between sides exist by predicting the political leaning of articles of unseen political bias. Framing can be driven by different types of morality that each group values. We emphasize differences in framing of different news building on the moral foundations theory quantified using hand crafted lexicons. Our results show that partisan sources frame political issues differently both in terms of words usage and through the moral foundations they relate to.

Using Syntactic and Semantic Context to Explore Psychodemographic Differences in Self-reference
Masoud Rouhizadeh | Lyle Ungar | Anneke Buffone | H Andrew Schwartz
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Does ‘well-being’ translate on Twitter?
Laura Smith | Salvatore Giorgi | Rishi Solanki | Johannes Eichstaedt | H. Andrew Schwartz | Muhammad Abdul-Mageed | Anneke Buffone | Lyle Ungar
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

Mental Illness Detection at the World Well-Being Project for the CLPsych 2015 Shared Task
Daniel Preoţiuc-Pietro | Maarten Sap | H. Andrew Schwartz | Lyle Ungar
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

The role of personality, age, and gender in tweeting about mental illness
Daniel Preoţiuc-Pietro | Johannes Eichstaedt | Gregory Park | Maarten Sap | Laura Smith | Victoria Tobolsky | H. Andrew Schwartz | Lyle Ungar
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

Crowdsourcing for NLP
Chris Callison-Burch | Lyle Ungar | Ellie Pavlick
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Extracting Human Temporal Orientation from Facebook Language
H. Andrew Schwartz | Gregory Park | Maarten Sap | Evan Weingarten | Johannes Eichstaedt | Margaret Kern | David Stillwell | Michal Kosinski | Jonah Berger | Martin Seligman | Lyle Ungar
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

Towards Assessing Changes in Degree of Depression through Facebook
H. Andrew Schwartz | Johannes Eichstaedt | Margaret L. Kern | Gregory Park | Maarten Sap | David Stillwell | Michal Kosinski | Lyle Ungar
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

Developing Age and Gender Predictive Lexica over Social Media
Maarten Sap | Gregory Park | Johannes Eichstaedt | Margaret Kern | David Stillwell | Michal Kosinski | Lyle Ungar | Hansen Andrew Schwartz
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

Choosing the Right Words: Characterizing and Reducing Error of the Word Count Approach
Hansen Andrew Schwartz | Johannes Eichstaedt | Eduardo Blanco | Lukasz Dziurzynski | Margaret L. Kern | Stephanie Ramones | Martin Seligman | Lyle Ungar
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

Spectral Learning Algorithms for Natural Language Processing
Shay Cohen | Michael Collins | Dean Foster | Karl Stratos | Lyle Ungar
NAACL HLT 2013 Tutorial Abstracts

Experiments with Spectral Learning of Latent-Variable PCFGs
Shay B. Cohen | Karl Stratos | Michael Collins | Dean P. Foster | Lyle Ungar
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

Penn: Using Word Similarities to better Estimate Sentence Similarity
Sneha Jha | Hansen A. Schwartz | Lyle Ungar
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Spectral Learning of Latent-Variable PCFGs
Shay B. Cohen | Karl Stratos | Michael Collins | Dean P. Foster | Lyle Ungar
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Spectral Dependency Parsing with Latent Variables
Paramveer Dhillon | Jordan Rodu | Michael Collins | Dean Foster | Lyle Ungar
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

New Insights from Coarse Word Sense Disambiguation in the Crowd
Adam Kapelner | Krishna Kaliannan | H. Andrew Schwartz | Lyle Ungar | Dean Foster
Proceedings of COLING 2012: Posters

Improving Supervised Sense Disambiguation with Web-Scale Selectors
H. Andrew Schwartz | Fernando Gomez | Lyle Ungar
Proceedings of COLING 2012

2010

A New Approach to Lexical Disambiguation of Arabic Text
Rushin Shah | Paramveer S. Dhillon | Mark Liberman | Dean Foster | Mohamed Maamouri | Lyle Ungar
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

Transfer Learning, Feature Selection and Word Sense Disambiguation
Paramveer S. Dhillon | Lyle H. Ungar
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2006

An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation
Jinying Chen | Andrew Schein | Lyle Ungar | Martha Palmer
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

2004

Integrated Annotation for Biomedical Information Extraction
Seth Kulick | Ann Bies | Mark Liberman | Mark Mandel | Ryan McDonald | Martha Palmer | Andrew Schein | Lyle Ungar | Scott Winters | Pete White
HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases

Co-authors

Johannes Eichstaedt 10

Anneke Buffone 7

Shreya Havaldar 6

Young Min Cho 5

Margaret Kern 5

Muhammad Abdul-Mageed 4

Michael Collins 4

Brenda Curtis 4

Garrick Sherman 4

Chris Callison-Burch 3

Jordan Carpenter 3

Shay B. Cohen 3

Paramveer S. Dhillon 3

Dan Goldwasser 3

Tunazzina Islam 3

Michal Kosinski 3

Syeda Mahwish 3

August Håkan Nilsson 3

María Leonor Pacheco 3

Masoud Rouhizadeh 3

David Stillwell 3

Ming Yin (尹明) 3

Niyati Chhaya 2

Lucie Flekova 2

Shirley Anugrah Hayati 2

Daphne Ippolito 2

Dongyeop Kang 2

Arun Kirubarajan 2

Mark Liberman 2

Martha Palmer 2

Daniel Rieman 2

Roshan Santhosh 2

Andrew Schein 2

Martin Seligman 2

Adithya V. Ganesan 2

Suhaib Abdurahman 1

Hassan Alhuzali 1

Fatemeh Almodaresi 1

Niranjan Balasubramanian 1

Douglas Bellew 1

Sudeep Bhatia 1

Michael Bishop 1

Eduardo Blanco 1

Andrea Ceolin 1

Pranav Chitale 1

Patrick Crutchley 1

Lukasz Dziurzynski 1

Abdelrahim Elmadany 1

Zachary Fried 1

Fernando Gomez 1

Sharath Guntuku 1

Samindara Hardikar-Sawant 1

Mckenzie Himelein-wachowiak 1

Kristy Hollingshead 1

Daniel Hopkins 1

Kelsey Jane Isman 1

Anthony Janocko 1

Krishna Kaliannan 1

Adam Kapelner 1

Ashwin Kishen 1

Vivek Kulkarni 1

Allegra Larche 1

Mohamed Maamouri 1

Monal Mahajan 1

Matthew Matero 1

Ryan McDonald 1

Barbara Mellers 1

Miranda Muqing Miao 1

Saif Mohammad 1

Yehonathan Nachmany 1

Emma O’neil 1

Ellie Pavlick 1

Sebastian Peralta 1

Matthew Pressimone 1

Dheeraj Rajagopal 1

Stephanie Ramones 1

Maitreyi Redkar 1

Ann-Katrin Reuel 1

Richard Rosenthal 1

Richard N. Rosenthal 1

Simone Schriger 1

Khushi Shelat 1

Elisabeth Shulman 1

Khushboo Singh 1

Bhumika Singhal 1

Rishi Solanki 1

Tejas Srivastava 1

Elizabeth Stade 1

Thomas Talhelm 1

Camillo Jose Taylor 1

Philip Tetlock 1

Victoria Tobolsky 1

Ilias Triantafyllopoulos 1

Vasudha Varadarajan 1

Evan Weingarten 1

Scott Winters 1

Mohsen Zakeri 1

Khushang Zaveri 1

Kevin Zembroski 1

Venues