Charles Welch - ACL Anthology

Charles Welch

2026

Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms
Kieran Henderson | Kian Omoomi | Vasudha Varadarajan | Allison Lahnala | Charles Welch
Findings of the Association for Computational Linguistics: EACL 2026

Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosures and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. Contrary to previous work, only a small number of comments related to the original post are needed. Lastly, a more diverse sample of annotator self-disclosures did not lead to the best performance. Sampling from a larger pool of comments without filtering still yields the best performance, suggesting that there is still much to uncover in terms of what information about an annotator is most useful for verdict prediction.

McMasters of Change: Predicting Well-Being States and Transitions from Longitudinal Language
Hongyi Zhang | Derron Li | Scarlett Cleary | Aadi Sanghani | Akshay Krishna Sirigana | Brian Miguel Pimentel | Kelsey Isman | Kian Omoomi | Vasudha Varadarajan | Charles Welch | Allison Lahnala
Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

Most existing work on mental health prediction from language focuses on isolated posts, overlooking temporal dynamics in longitudinal timelines. We present McMaster NLP’s system for the CLPsych 2026 Shared Task, which centers on modeling mental health dynamics in social media timelines using the MIND framework~\cite{atzil_slonim_2025_mind}. The task comprises: (1) identifying adaptive and maladaptive self-state components within posts, (2) detecting moments of change in well-being, and (3) generating structured summaries. For self-state prediction, we leverage LLM-generated archetypal representations of language use as semantic anchors within a dual-encoder architecture, enabling interpretable prediction of subelements and their intensities through alignment with prototypical expressions of psychological states. For temporal dynamics, we use BiLSTM-based sequence models to detect moments of change. For summarization, we employ a prompt-based LLM to generate grounded, structured summaries emphasizing causal interactions and temporal progression of self-states. Finally, we analyze model failure modes with respect to human evaluation and identify directions for reconciling the MIND framework with how state-assessment models encode meaning.

2025

NLP-ResTeam at LeWiDi-2025:Performance Shifts in Perspective Aware Models based on Evaluation Metrics
Olufunke O. Sarumi | Charles Welch | Daniel Braun
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP

Recent works in Natural Language Processing have focused on developing methods to model annotator perspectives within subjective datasets, aiming to capture opinion diversity. This has led to the development of various approaches that learn from disaggregated labels, leading to the question of what factors most influence the performance of these models. While dataset characteristics are a critical factor, the choice of evaluation metric is equally crucial, especially given the fluid and evolving concept of perspectivism. A model considered state-of-the-art under one evaluation scheme may not maintain its top-tier status when assessed with a different set of metrics, highlighting a potential challenge between model performance and the evaluation framework. This paper presents a performance analysis of annotator modeling approaches using the evaluation metrics of the 2025 Learning With Disagreement (LeWiDi) shared task and additional metrics. We evaluate five annotator-aware models under the same configurations. Our findings demonstrate a significant metric-induced shift in model rankings. Across four datasets, no single annotator modeling approach consistently outperformed others using a single metric, revealing that the “best” model is highly dependent on the chosen evaluation metric. This study systematically shows that evaluation metrics are not agnostic in the context of perspectivist model assessment.

McMaster at LeWiDi-2025: Demographic-Aware RoBERTa
Aadi Sanghani | Sarvin Azadi | Virendra Jethra | Charles Welch
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP

We present our submission to the Learning With Disagreements (LeWiDi) 2025 shared task. Our team implemented a variety of BERT-based models that encode annotator meta-data in combination with text to predict soft-label distributions and individual annotator labels. We show across four tasks that a combination of demographic factors leads to improved performance, however through ablations across all demographic variables we find that in some cases, a single variable performs best. Our approach placed 4th in the overall competition.

The Impact of Annotator Personas on LLM Behavior Across the Perspectivism Spectrum
Olufunke O. Sarumi | Charles Welch | Daniel Braun | Jörg Schlötterer
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)

The Practical Impacts of Theoretical Constructs on Empathy Modeling
Allison Lahnala | Charles Welch | David Jurgens | Lucie Flek
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Conceptual operationalizations of empathy in NLP are varied, with some having specific behaviors and properties, while others are more abstract. How these variations relate to one another and capture properties of empathy observable in text remains unclear. To provide insight into this, we analyze the transfer performance of empathy models adapted to empathy tasks with different theoretical groundings. We study (1) the dimensionality of empathy definitions, (2) the correspondence between the defined dimensions and measured/observed properties, and (3) the conduciveness of the data to represent them, finding they have a significant impact to performance compared to other transfer setting features. Characterizing the theoretical grounding of empathy tasks as direct, abstract, or adjacent further indicates that tasks that directly predict specified empathy components have higher transferability. Our work provides empirical evidence for the need for precise and multidimensional empathy operationalizations.

Funzac at CoMeDi Shared Task: Modeling Annotator Disagreement from Word-In-Context Perspectives
Olufunke O. Sarumi | Charles Welch | Lucie Flek | Jörg Schlötterer
Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation

In this work, we evaluate annotator disagreement in Word-in-Context (WiC) tasks exploring the relationship between contextual meaning and disagreement as part of the CoMeDi shared task competition. While prior studies have modeled disagreement by analyzing annotator attributes with single-sentence inputs, this shared task incorporates WiC to bridge the gap between sentence-level semantic representation and annotator judgment variability. We describe three different methods that we developed for the shared task, including a feature enrichment approach that combines concatenation, element-wise differences, products, and cosine similarity, Euclidean and Manhattan distances to extend contextual embedding representations, a transformation by Adapter blocks to obtain task-specific representations of contextual embeddings, and classifiers of varying complexities, including ensembles. The comparison of our methods demonstrates improved performance for methods that include enriched and task-specfic features. While the performance of our method falls short in comparison to the best system in subtask 1 (OGWiC), it is competitive to the official evaluation results in subtask 2 (DisWiC)

Analyzing Interview Questions via Bloom’s Taxonomy to Enhance the Design Thinking Process
Fatemeh Kazemi Vanhari | Christopher Anand | Charles Welch
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Interviews are central to the Empathy phase of Design Thinking, helping designers uncover user needs and experience. Although interviews are widely used to support human centered innovation, evaluating their quality, especially from a cognitive perspective, remains underexplored. This study introduces a structured framework for evaluating interview quality in the context of Design Thinking, using Bloom’s Taxonomy as a foundation. We propose the Cognitive Interview Quality Score, a composite metric that integrates three dimensions: Effectiveness Score, Bloom Coverage Score, and Distribution Balance Score. Using human-annotations, we assessed 15 interviews across three domains to measure cognitive diversity and structure. We compared CIQS-based rankings with human experts and found that the Bloom Coverage Score aligned more closely with expert judgments. We evaluated the performance of LMA-3-8B-Instruct and GPT-4o-mini, using zero-shot, few-shot, and chain-of-thought prompting, finding GPT-4o-mini, especially in zero-shot mode, showed the highest correlation with human annotations in all domains. Error analysis revealed that models struggled more with mid-level cognitive tasks (e.g., Apply, Analyze) and performed better on Create, likely due to clearer linguistic cues. These findings highlight both the promise and limitations of using NLP models for automated cognitive classification and underscore the importance of combining cognitive metrics with qualitative insights to comprehensively assess interview quality.

2024

Harnessing Personalization Methods to Identify and Predict Unreliable Information Spreader Behavior
Shaina Ashraf | Fabio Gruschka | Lucie Flek | Charles Welch
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

Studies on detecting and understanding the spread of unreliable news on social media have identified key characteristic differences between reliable and unreliable posts. These differences in language use also vary in expression across individuals, making it important to consider personal factors in unreliable news detection. The application of personalization methods for this has been made possible by recent publication of datasets with user histories, though this area is still largely unexplored. In this paper we present approaches to represent social media users in order to improve performance on three tasks: (1) classification of unreliable news posts, (2) classification of unreliable news spreaders, and, (3) prediction of the spread of unreliable news. We compare the User2Vec method from previous work to two other approaches; a learnable user embedding layer trained with the downstream task, and a representation derived from an authorship attribution classifier. We demonstrate that the implemented strategies substantially improve classification performance over state-of-the-art and provide initial results on the task of unreliable news prediction.

A Perspectivist Corpus of Numbers in Social Judgements
Marlon May | Lucie Flek | Charles Welch
Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024

With growing interest in the use of large language models, it is becoming increasingly important to understand whose views they express. These models tend to generate output that conforms to majority opinion and are not representative of diverse views. As a step toward building models that can take differing views into consideration, we build a novel corpus of social judgements. We crowdsourced annotations of a subset of the Commonsense Norm Bank that contained numbers in the situation descriptions and asked annotators to replace the number with a range defined by a start and end value that, in their view, correspond to the given verdict. Our corpus contains unaggregated annotations and annotator demographics. We describe our annotation process for social judgements and will release our dataset to support future work on numerical reasoning and perspectivist approaches to natural language processing.

Corpus Considerations for Annotator Modeling and Scaling
Olufunke O. Sarumi | Béla Neuendorf | Joan Plepi | Lucie Flek | Jörg Schlötterer | Charles Welch
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent trends in natural language processing research and annotation tasks affirm a paradigm shift from the traditional reliance on a single ground truth to a focus on individual perspectives, particularly in subjective tasks. In scenarios where annotation tasks are meant to encompass diversity, models that solely rely on the majority class labels may inadvertently disregard valuable minority perspectives. This oversight could result in the omission of crucial information and, in a broader context, risk disrupting the balance within larger ecosystems. As the landscape of annotator modeling unfolds with diverse representation techniques, it becomes imperative to investigate their effectiveness with the fine-grained features of the datasets in view. This study systematically explores various annotator modeling techniques and compares their performance across seven corpora. From our findings, we show that the commonly used user token model consistently outperforms more complex models. We introduce a composite embedding approach and show distinct differences in which model performs best as a function of the agreement with a given dataset. Our findings shed light on the relationship between corpus statistics and annotator modeling performance, which informs future work on corpus construction and perspectivist NLP.

Appraisal Framework for Clinical Empathy: A Novel Application to Breaking Bad News Conversations
Allison Lahnala | Béla Neuendorf | Alexander Thomin | Charles Welch | Tina Stibane | Lucie Flek
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Empathy is essential in healthcare communication. We introduce an annotation approach that draws on well-established frameworks for clinical empathy and breaking bad news (BBN) conversations for considering the interactive dynamics of discourse relations. We construct Empathy in BBNs, a span-relation task dataset of simulated BBN conversations in German, using our annotation scheme, in collaboration with a large medical school to support research on educational tools for medical didactics. The annotation is based on 1) Pounds (2011)’s appraisal framework for clinical empathy, which is grounded in systemic functional linguistics, and 2) the SPIKES protocol for breaking bad news (Baile et al., 2000), commonly taught in medical didactics training. This approach presents novel opportunities to study clinical empathic behavior and enables the training of models to detect causal relations involving empathy, a highly desirable feature of systems that can provide feedback to medical professionals in training. We present illustrative examples, discuss applications of the annotation scheme, and insights we can draw from the framework.

Perspective Taking through Generating Responses to Conflict Situations
Joan Plepi | Charles Welch | Lucie Flek
Findings of the Association for Computational Linguistics: ACL 2024

Although language model performance across diverse tasks continues to improve, these models still struggle to understand and explain the beliefs of other people. This skill requires perspective-taking, the process of conceptualizing the point of view of another person. Perspective taking becomes challenging when the text reflects more personal and potentially more controversial beliefs.We explore this task through natural language generation of responses to conflict situations. We evaluate novel modifications to recent architectures for conditioning generation on an individual’s comments and self-disclosure statements. Our work extends the Social-Chem-101 corpus, using 95k judgements written by 6k authors from English Reddit data, for each of whom we obtained 20-500 self-disclosure statements. Our evaluation methodology borrows ideas from both personalized generation and theory of mind literature. Our proposed perspective-taking models outperform recent work, especially the twin encoder model conditioned on self-disclosures with high similarity to the conflict situation.

Research on psychological risk factors for suicide has developed for decades. However, combining explainable theory with modern data-driven language model approaches is non-trivial. In this study, we propose and evaluate methods for identifying language patterns aligned with theories of suicide risk by combining theory-driven suicidal archetypes with language model-based and relative entropy-based approaches. Archetypes are based on prototypical statements that evince risk of suicidality while relative entropy considers the ratio of how unusual both a risk-familiar and unfamiliar model find the statements. While both approaches independently performed similarly, we find that combining the two significantly improved the performance in the shared task evaluations, yielding our combined system submission with a BERTScore Recall of 0.906. Consistent with the literature, we find that titles are highly informative as suicide risk evidence, despite the brevity. We conclude that a combination of theory- and data-driven methods are needed in the mental health space and can outperform more modern prompt-based methods.

Do Multilingual Large Language Models Mitigate Stereotype Bias?
Shangrui Nie | Michael Fromm | Charles Welch | Rebekka Görge | Akbar Karimi | Joan Plepi | Nazia Mowmita | Nicolas Flores-Herr | Mehdi Ali | Lucie Flek
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP

While preliminary findings indicate that multilingual LLMs exhibit reduced bias compared to monolingual ones, a comprehensive understanding of the effect of multilingual training on bias mitigation, is lacking. This study addresses this gap by systematically training six LLMs of identical size (2.6B parameters) and architecture: five monolingual models (English, German, French, Italian, and Spanish) and one multilingual model trained on an equal distribution of data across these languages, all using publicly available data. To ensure robust evaluation, standard bias benchmarks were automatically translated into the five target languages and verified for both translation quality and bias preservation by human annotators. Our results consistently demonstrate that multilingual training effectively mitigates bias. Moreover, we observe that multilingual models achieve not only lower bias but also superior prediction accuracy when compared to monolingual models with the same amount of training data, model architecture, and size.

2023

Domain Transfer for Empathy, Distress, and Personality Prediction
Fabio Gruschka | Allison Lahnala | Charles Welch | Lucie Flek
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This research contributes to the task of predicting empathy and personality traits within dialogue, an important aspect of natural language processing, as part of our experimental work for the WASSA 2023 Empathy and Emotion Shared Task. For predicting empathy, emotion polarity, and emotion intensity on turns within a dialogue, we employ adapters trained on social media interactions labeled with empathy ratings in a stacked composition with the target task adapters. Furthermore, we embed demographic information to predict Interpersonal Reactivity Index (IRI) subscales and Big Five Personality Traits utilizing BERT-based models. The results from our study provide valuable insights, contributing to advancements in understanding human behavior and interaction through text. Our team ranked 2nd on the personality and empathy prediction tasks, 4th on the interpersonal reactivity index, and 6th on the conversational task.

Style Locality for Controllable Generation with kNN Language Models
Gilles Nawezi | Lucie Flek | Charles Welch
Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants!

Recent language models have been improved by the addition of external memory. Nearest neighbor language models retrieve similar contexts to assist in word prediction. The addition of locality levels allows a model to learn how to weight neighbors based on their relative location to the current text in source documents, and have been shown to further improve model performance. Nearest neighbor models have been explored for controllable generation but have not examined the use of locality levels. We present a novel approach for this purpose and evaluate it using automatic and human evaluation on politeness, formality, supportiveness, and toxicity textual data. We find that our model is successfully able to control style and provides a better fluency-style trade-off than previous work

Challenges of GPT-3-Based Conversational Agents for Healthcare
Fabian Lechner | Allison Lahnala | Charles Welch | Lucie Flek
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

The potential of medical domain dialogue agents lies in their ability to provide patients with faster information access while enabling medical specialists to concentrate on critical tasks. However, the integration of large-language models (LLMs) into these agents presents certain limitations that may result in serious consequences. This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA). We perform several evaluations contextualized in terms of standard medical principles. We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.

2022

CAISA at WASSA 2022: Adapter-Tuning for Empathy Prediction
Allison Lahnala | Charles Welch | Lucie Flek
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis

We build a system that leverages adapters, a light weight and efficient method for leveraging large language models to perform the task Em- pathy and Distress prediction tasks for WASSA 2022. In our experiments, we find that stacking our empathy and distress adapters on a pre-trained emotion lassification adapter performs best compared to full fine-tuning approaches and emotion feature concatenation. We make our experimental code publicly available

Understanding Interpersonal Conflict Types and their Impact on Perception Classification
Charles Welch | Joan Plepi | Béla Neuendorf | Lucie Flek
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Studies on interpersonal conflict have a long history and contain many suggestions for conflict typology. We use this as the basis of a novel annotation scheme and release a new dataset of situations and conflict aspect annotations. We then build a classifier to predict whether someone will perceive the actions of one individual as right or wrong in a given situation. Our analyses include conflict aspects, but also generated clusters, which are human validated, and show differences in conflict content based on the relationship of participants to the author. Our findings have important implications for understanding conflict and social norms.

Mitigating Toxic Degeneration with Empathetic Data: Exploring the Relationship Between Toxicity and Empathy
Allison Lahnala | Charles Welch | Béla Neuendorf | Lucie Flek
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Large pre-trained neural language models have supported the effectiveness of many NLP tasks, yet are still prone to generating toxic language hindering the safety of their use. Using empathetic data, we improve over recent work on controllable text generation that aims to reduce the toxicity of generated text. We find we are able to dramatically reduce the size of fine-tuning data to 7.5-30k samples while at the same time making significant improvements over state-of-the-art toxicity mitigation of up to 3.4% absolute reduction (26% relative) from the original work on 2.3m samples, by strategically sampling data based on empathy scores. We observe that the degree of improvements is subject to specific communication components of empathy. In particular, the more cognitive components of empathy significantly beat the original dataset in almost all experiments, while emotional empathy was tied to less improvement and even underperforming random samples of the original data. This is a particularly implicative insight for NLP work concerning empathy as until recently the research and resources built for it have exclusively considered empathy as an emotional concept.

Nearest Neighbor Language Models for Stylistic Controllable Generation
Severino Trotta | Lucie Flek | Charles Welch
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Recent language modeling performance has been greatly improved by the use of external memory. This memory encodes the context so that similar contexts can be recalled during decoding. This similarity depends on how the model learns to encode context, which can be altered to include other attributes, such as style. We construct and evaluate an architecture for this purpose, using corpora annotated for politeness, formality, and toxicity. Through extensive experiments and human evaluation we demonstrate the potential of our method to generate text while controlling style. We find that style-specific datastores improve generation performance, though results vary greatly across styles, and the effect of pretraining data and specific styles should be explored in future work.

A Critical Reflection and Forward Perspective on Empathy and Natural Language Processing
Allison Lahnala | Charles Welch | David Jurgens | Lucie Flek
Findings of the Association for Computational Linguistics: EMNLP 2022

We review the state of research on empathy in natural language processing and identify the following issues: (1) empathy definitions are absent or abstract, which (2) leads to low construct validity and reproducibility. Moreover, (3) emotional empathy is overemphasized, skewing our focus to a narrow subset of simplified tasks. We believe these issues hinder research progress and argue that current directions will benefit from a clear conceptualization that includes operationalizing cognitive empathy components. Our main objectives are to provide insight and guidance on empathy conceptualization for NLP research objectives and to encourage researchers to pursue the overlooked opportunities in this area, highly relevant, e.g., for clinical and educational sectors.

Unifying Data Perspectivism and Personalization: An Application to Social Norms
Joan Plepi | Béla Neuendorf | Lucie Flek | Charles Welch
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Instead of using a single ground truth for language processing tasks, several recent studies have examined how to represent and predict the labels of the set of annotators. However, often little or no information about annotators is known, or the set of annotators is small. In this work, we examine a corpus of social media posts about conflict from a set of 13k annotators and 210k judgements of social norms. We provide a novel experimental setup that applies personalization methods to the modeling of annotators and compare their effectiveness for predicting the perception of social norms. We further provide an analysis of performance across subsets of social situations that vary by the closeness of the relationship between parties in conflict, and assess where personalization helps the most.

Knowledge Enhanced Reflection Generation for Counseling Dialogues
Siqi Shen | Veronica Perez-Rosas | Charles Welch | Soujanya Poria | Rada Mihalcea
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we study the effect of commonsense and domain knowledge while generating responses in counseling conversations using retrieval and generative methods for knowledge integration. We propose a pipeline that collects domain knowledge through web mining, and show that retrieval from both domain-specific and commonsense knowledge bases improves the quality of generated responses. We also present a model that incorporates knowledge generated by COMET using soft positional encoding and masked self-attention. We show that both retrieved and COMET-generated knowledge improve the system’s performance as measured by automatic metrics and also by human evaluation. Lastly, we present a comparative study on the types of knowledge encoded by our system showing that causal and intentional relationships benefit the generation task more than other types of commonsense relations.

Leveraging Similar Users for Personalized Language Modeling with Limited Data
Charles Welch | Chenxi Gu | Jonathan K. Kummerfeld | Veronica Perez-Rosas | Rada Mihalcea
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Personalized language models are designed and trained to capture language patterns specific to individual users. This makes them more accurate at predicting what a user will write. However, when a new user joins a platform and not enough text is available, it is harder to build effective personalized language models. We propose a solution for this problem, using a model trained on users that are similar to a new user. In this paper, we explore strategies for finding the similarity between new users and existing ones and methods for using the data from existing users who are a good match. We further explore the trade-off between available data for new users and how well their language can be modeled.

2021

Exploring Self-Identified Counseling Expertise in Online Support Forums
Allison Lahnala | Yuntian Zhao | Charles Welch | Jonathan K. Kummerfeld | Lawrence C An | Kenneth Resnicow | Rada Mihalcea | Verónica Pérez-Rosas
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

Counseling-Style Reflection Generation Using Generative Pretrained Transformers with Augmented Context
Siqi Shen | Charles Welch | Rada Mihalcea | Verónica Pérez-Rosas
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We introduce a counseling dialogue system that seeks to assist counselors while they are learning and refining their counseling skills. The system generates counselors’reflections – i.e., responses that reflect back on what the client has said given the dialogue history. Our method builds upon the new generative pretrained transformer architecture and enhances it with context augmentation techniques inspired by traditional strategies used during counselor training. Through a set of comparative experiments, we show that the system that incorporates these strategies performs better in the reflection generation task than a system that is just fine-tuned with counseling conversations. To confirm our findings, we present a human evaluation study that shows that our system generates naturally-looking reflections that are also stylistically and grammatically correct.

Expressive Interviewing: A Conversational System for Coping with COVID-19
Charles Welch | Allison Lahnala | Veronica Perez-Rosas | Siqi Shen | Sarah Seraj | Larry An | Kenneth Resnicow | James Pennebaker | Rada Mihalcea
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The ongoing COVID-19 pandemic has raised concerns for many regarding personal and public health implications, financial security and economic stability. Alongside many other unprecedented challenges, there are increasing concerns over social isolation and mental health. We introduce Expressive Interviewing – an interview-style conversational system that draws on ideas from motivational interviewing and expressive writing. Expressive Interviewing seeks to encourage users to express their thoughts and feelings through writing by asking them questions about how COVID-19 has impacted their lives. We present relevant aspects of the system’s design and implementation as well as quantitative and qualitative analyses of user interactions with the system. In addition, we conduct a comparative evaluation with a general purpose dialogue system for mental health that shows our system potential in helping users to cope with COVID-19 issues.

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
Charles Welch | Rada Mihalcea | Jonathan K. Kummerfeld
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.

Compositional Demographic Word Embeddings
Charles Welch | Jonathan K. Kummerfeld | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.

Exploring the Value of Personalized Word Embeddings
Charles Welch | Jonathan K. Kummerfeld | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we introduce personalized word embeddings, and examine their value for language modeling. We compare the performance of our proposed prediction model when using personalized versus generic word representations, and study how these representations can be leveraged for improved performance. We provide insight into what types of words can be more accurately predicted when building personalized models. Our results show that a subset of words belonging to specific psycholinguistic categories tend to vary more in their representations across users and that combining generic and personalized word embeddings yields the best performance, with a 4.7% relative reduction in perplexity. Additionally, we show that a language model using personalized word embeddings can be effectively used for authorship attribution.

2018

World Knowledge for Abstract Meaning Representation Parsing
Charles Welch | Jonathan K. Kummerfeld | Song Feng | Rada Mihalcea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

Targeted Sentiment to Understand Student Comments
Charles Welch | Rada Mihalcea
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We address the task of targeted sentiment as a means of understanding the sentiment that students hold toward courses and instructors, as expressed by students in their comments. We introduce a new dataset consisting of student comments annotated for targeted sentiment and describe a system that can both identify the courses and instructors mentioned in student comments, as well as label the students’ sentiment toward those entities. Through several comparative evaluations, we show that our system outperforms previous work on a similar task.

Co-authors

Béla Neuendorf 5

Olufunke O. Sarumi 4

Jörg Schlötterer 3

Vasudha Varadarajan 3

Fabio Gruschka 2

David Jurgens 2

Kenneth Resnicow 2

Aadi Sanghani 2

Lawrence C An 1

Christopher Anand 1

Shaina Ashraf 1

Ana-Maria Bucur 1

Scarlett Cleary 1

Nicolas Flores-Herr 1

Michael Fromm 1

Rebekka Görge 1

Kieran Henderson 1

Virendra Jethra 1

Fatemeh Kazemi Vanhari 1

Kevin Lanning 1

Fabian Lechner 1

Siddharth Mangalik 1

Nazia Mowmita 1

Gilles Nawezi 1

James Pennebaker 1

Brian Miguel Pimentel 1

Soujanya Poria 1

H. Andrew Schwartz 1

Akshay Krishna Sirigana 1

Alexander Thomin 1

Severino Trotta 1

Adithya V. Ganesan 1

Isabella Vallejo 1

Venues