Rada Mihalcea - ACL Anthology

Rada Mihalcea

Also published as: Rada F. Mihalcea

2026

Persuasion at Play: Understanding Misinformation Dynamics in Demographic-Aware Human-LLM Interactions
Angana Borah | Rada Mihalcea | Veronica Perez-Rosas
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing challenges in misinformation exposure and susceptibility vary across demographics, as some populations are more vulnerable to misinformation than others. Large language models (LLMs) introduce new dimensions to these challenges through their ability to generate persuasive content at scale and reinforcing existing biases. Our study introduces PANDORA, a framework that investigates the bidirectional persuasion dynamics between LLMs and humans when exposed to misinformative content. We use a multi-agent LLM framework to analyze the spread of misinformation under persuasion among demographic-oriented LLM agents. Our findings show that demographic factors influence LLM susceptibility, with up to 15 percentage point differences in misinformation correctness across groups. Multi-agent LLMs also exhibit echo chamber behavior, aligning with human-like group polarization patterns. Therefore, this work highlights demographic divides in misinformation dynamics and offers insights for future interventions.

When Do Language Models Endorse Limitations on Human Rights Principles?
Keenan Samway | Miu Nicole Takagi | Rada Mihalcea | Bernhard Schölkopf | Ilias Chalkidis | Daniel Hershcovich | Zhijing Jin
Findings of the Association for Computational Linguistics: EACL 2026

As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with universal human rights principles becomes important to ensure that these rights are abided by in high stakes AI-mediated interactions. In this paper, we evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights (UDHR), leveraging 1,152 synthetically generated scenarios across 24 rights articles and eight languages. Our analysis of eleven major LLMs reveals systematic biases where models: (1) accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights, (2) demonstrate significant cross-linguistic variation with elevated endorsement rates of rights-limiting actions in Chinese and Hindi compared to English or Romanian, (3) show substantial susceptibility to prompt-based steering, and (4) exhibit noticeable differences between Likert and open-ended responses, highlighting critical challenges in LLM preference assessment.

Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models
David Guzman Piedrahita | Irene Strauss | Rada Mihalcea | Zhijing Jin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left–right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy–authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increased favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicitly political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes.

Natural language processing (NLP) now shapes many aspects of our world, yet its potential for positive social impact is underexplored. This paper surveys work in “NLP for Social Good" (NLP4SG) across nine domains relevant to global development and risk agendas, summarizing principal tasks and challenges. We analyze ACL Anthology trends, finding that inclusion and AI harms attract the most research, while domains such as poverty, peacebuilding, and environmental protection remain underexplored. Guided by our review, we outline opportunities for responsible and equitable NLP and conclude with a call for cross-disciplinary partnerships and human-centered approaches to ensure that future NLP technologies advance the public good.

2025

Quriosity: Analyzing Human Questioning Behavior and Causal Inquiry through Curiosity-Driven Queries
Roberto Ceraolo | Dmitrii Kharlapenko | Ahmad Khan | Amélie Reymond | Rada Mihalcea | Bernhard Schölkopf | Mrinmaya Sachan | Zhijing Jin
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Recent progress in Large Language Model (LLM) technology has changed our role in interacting with these models. Instead of primarily testing these models with questions we already know answers to, we are now using them for queries where the answers are unknown to us, driven by human curiosity. This shift highlights the growing need to understand curiosity-driven human questions – those that are more complex, open-ended, and reflective of real-world needs. To this end, we present Quriosity, a collection of 13K naturally occurring questions from three diverse sources: human-to-search-engine queries, human-to-human interactions, and human-to-LLM conversations. Our comprehensive collection enables a rich understanding of human curiosity across various domains and contexts. Our analysis reveals a significant presence of causal questions (up to 42%) in the dataset, for which we develop an iterative prompt improvement framework to identify all causal queries and examine their unique linguistic properties, cognitive complexity and source distribution. We also lay the groundwork for exploring efficient identifiers of causal questions, providing six efficient classification models.

MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind
Emilio Villa-Cueva | S M Masrur Ahmed | Rendi Chevi | Jan Christian Blaise Cruz | Kareem Elzeky | Fermin Cristobal | Alham Fikri Aji | Skyler Wang | Rada Mihalcea | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025

Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MoMentS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively. For audio, models that process dialogues as audio do not consistently outperform transcript-based inputs. Our findings highlight the need to improve multimodal integration and point to open challenges that must be addressed to advance AI’s social understanding.

Chumor 2.0: Towards Better Benchmarking Chinese Humor Understanding from (Ruo Zhi Ba)
Ruiqi He | Yushu He | Longju Bai | Jiarui Liu | Zhenjie Sun | Zenghao Tang | He Wang | Hanchen Xia | Rada Mihalcea | Naihao Deng
Findings of the Association for Computational Linguistics: ACL 2025

Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct **Chumor**, the first and the largest Chinese humor explanation dataset. **Chumor** is sourced from Ruo Zhi Ba (RZB, 弱智吧), a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes. We test ten LLMs through direct and chain-of-thought prompting, revealing that **Chumor** poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE4-turbo. We release **Chumor** at https://huggingface.co/datasets/MichiganNLP/Chumor , our project page is at https://github.com/MichiganNLP/Chumor-2.0 , our leaderboard is at https://huggingface.co/spaces/MichiganNLP/Chumor-leaderboard , and our codebase is at https://github.com/MichiganNLP/Chumor-2.0 .

Towards Region-aware Bias Evaluation Metrics
Angana Borah | Aparna Garimella | Rada Mihalcea
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

When exposed to human-generated data, language models are known to learn and amplify societal biases. While previous works introduced metrics that can be used to assess the bias in these models, they rely on assumptions that may not be universally true. For instance, a gender bias dimension commonly used by these metrics is that of family–career, but this may not be the only common bias in certain regions of the world. In this paper, we identify topical differences in gender bias across different regions and propose a region-aware bottom-up approach for bias assessment. Several of our proposed region-aware gender bias dimensions are found to be aligned with the human perception of gender biases in these regions.

Mind the (Belief) Gap: Group Identity in the World of LLMs
Angana Borah | Marwa Houalla | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2025

Social biases and belief-driven behaviors can significantly impact Large Language Models’ (LLMs’) decisions on several tasks. As LLMs are increasingly used in multi-agent systems for societal simulations, their ability to model fundamental group psychological characteristics remains critical yet under-explored. In this study, we present a multi-agent framework that simulates belief congruence, a classical group psychology theory that plays a crucial role in shaping societal interactions and preferences. Our findings reveal that LLMs exhibit amplified belief congruence compared to humans, across diverse contexts. We further investigate the implications of this behavior on two downstream tasks: (1) misinformation dissemination and (2) LLM learning, finding that belief congruence in LLMs increases misinformation dissemination and impedes learning. To mitigate these negative impacts, we propose strategies inspired by: (1) contact hypothesis, (2) accuracy nudges, and (3) global citizenship framework. Our results show that the best strategies reduce misinformation dissemination by up to (37%) and enhance learning by (11%). Bridging social psychology and AI, our work provides insights to navigate real-world interactions using LLMs while addressing belief-driven biases.

InspAIred: Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media Data
Oana Ignat | Gayathri Ganesh Lakshmy | Rada Mihalcea
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

Inspiration is linked to various positive outcomes, such as increased creativity, productivity, and happiness. Although inspiration has great potential, there has been limited effort toward identifying content that is inspiring, as opposed to just engaging or positive. Additionally, most research has concentrated on Western data, with little attention paid to other cultures. This work is the first to study cross-cultural inspiration through machine learning methods. We aim to identify and analyze real and AI-generated cross-cultural inspiring posts. To this end, we compile and make publicly available the InspAIred dataset, which consists of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 generated inspiring posts evenly distributed across India and the UK. The real posts are sourced from Reddit, while the generated posts are created using the GPT-4 model. Using this dataset, we conduct extensive computational linguistic analyses to (1) compare inspiring content across cultures, (2) compare AI-generated inspiring posts to real inspiring posts, and (3) determine if detection models can accurately distinguish between inspiring content across cultures and data sources.

Uplifting Lower-Income Data: Strategies for Socioeconomic Perspective Shifts in Large Multi-modal Models
Joan Nwatu | Oana Ignat | Rada Mihalcea
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent work has demonstrated that the unequal representation of cultures and socioeconomic groups in training data leads to biased Large Multi-modal (LMM) models. To improve LMM model performance on underrepresented data, we propose and evaluate several prompting strategies using non-English, geographic, and socioeconomic attributes. We show that these geographic and socioeconomic integrated prompts favor retrieving topic appearances commonly found in data from low-income households across different countries leading to improved LMM model performance on lower-income data. Our analyses identify and highlight contexts where these strategies yield the most improvements.

Evaluating LLMs’ Mathematical and Coding Competency through Ontology-guided Interventions
Pengfei Hong | Navonil Majumder | Deepanway Ghosal | Somak Aditya | Rada Mihalcea | Soujanya Poria
Findings of the Association for Computational Linguistics: ACL 2025

Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, GSMore and HumanEval-Core, respectively, of perturbed math and coding problems to probe LLM capabilities in numeric reasoning and coding tasks.Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology.

MAiDE-up: Multilingual Deception Detection of AI-generated Hotel Reviews
Oana Ignat | Xiaomeng Xu | Rada Mihalcea
Findings of the Association for Computational Linguistics: NAACL 2025

Deceptive reviews are becoming increasingly common, especially given the increase in performance and the prevalence of LLMs. While work to date has addressed the development of models to differentiate between truthful and deceptive human reviews, much less is known about the distinction between real reviews and AI-authored fake reviews. Moreover, most of the research so far has focused primarily on English, with very little work dedicated to other languages. In this paper, we compile and make publicly available the MAiDE-up dataset, consisting of 10,000 real and 10,000 AI-generated fake hotel reviews, balanced across ten languages. Using this dataset, we conduct extensive linguistic analyses to (1) compare the AI fake hotel reviews to real hotel reviews, and (2) identify the factors that influence the deception detection model performance. We explore the effectiveness of several models for deception detection in hotel reviews across three main dimensions: sentiment, location, and language. We find that these dimensions influence how well we can detect AI-generated fake reviews.

Rethinking Table Instruction Tuning
Naihao Deng | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2025

Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.

Voices of Her: Analyzing Gender Differences in the AI Publication World
Yiwen Ding | Jiarui Liu | Zhiheng Lyu | Kun Zhang | Bernhard Schölkopf | Zhijing Jin | Rada Mihalcea
Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI)

While several previous studies have analyzed gender bias in research, we are still missing a comprehensive analysis of gender differences in the AI community, covering diverse topics and different development trends. Using the AI Scholar dataset of 78K researchers in the field of AI, we identify several gender differences: (1) Although female researchers tend to have fewer overall citations than males, this citation difference does not hold for all academic-age groups; (2) There exist large gender homophily in co-authorship on AI papers; (3) Female first-authored papers show distinct linguistic styles, such as longer text, more positive emotion words, and more catchy titles than male first-authored papers. Our analysis provides a window into the current demographic trends in our AI community, and encourages more gender equality and diversity in the future.

Speech-Integrated Modeling for Behavioral Coding in Counseling
Do June Min | Verónica Pérez-Rosas | Kenneth Resnicow | Rada Mihalcea
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Computational models of psychotherapy often ignore vocal cues by relying solely on text. To address this, we propose MISQ, a framework that integrates speech features directly into language models using a speech encoder and lightweight adapter. MISQ improves behavioral analysis in counseling conversations, achieving ~5% relative gains over text-only or indirect speech methods—underscoring the value of vocal signals like tone and prosody.

Eeyore: Realistic Depression Simulation via Expert-in-the-Loop Supervised and Preference Optimization
Siyang Liu | Bianca Brie | Wenda Li | Laura Biester | Andrew Lee | James Pennebaker | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have been previously explored for mental healthcare training and therapy client simulation, but they still fall short in authentically capturing diverse client traits and psychological conditions. We introduce Eeyore , an 8B model optimized for realistic depression simulation through a structured alignment framework, incorporating expert input at every stage.First, we systematically curate real-world depression-related conversations, extracting depressive traits to guide data filtering and psychological profile construction, and use this dataset to instruction-tune Eeyore for profile adherence. Next, to further enhance realism, Eeyore undergoes iterative preference optimization—first leveraging model-generated preferences and then calibrating with a small set of expert-annotated preferences.Throughout the entire pipeline, we actively collaborate with domain experts, developing interactive interfaces to validate trait extraction and iteratively refine structured psychological profiles for clinically meaningful role-play customization.Despite its smaller model size, the Eeyore depression simulation outperforms GPT-4o with SOTA prompting strategies, both in linguistic authenticity and profile adherence.

Are Language Models Consequentialist or Deontological Moral Reasoners?
Keenan Samway | Max Kleiman-Weiner | David Guzman Piedrahita | Rada Mihalcea | Bernhard Schölkopf | Zhijing Jin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments.

Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?
Siqi Shen | Mehar Singh | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Rada Mihalcea
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The value orientation of Large Language Models (LLMs) has been extensively studied, as it can shape user experiences across demographic groups.However, two key challenges remain: (1) the lack of systematic comparison across value probing strategies, despite the Multiple Choice Question (MCQ) setting being vulnerable to perturbations, and (2) the uncertainty over whether probed values capture in-context information or predict models’ real-world actions.In this paper, we systematically compare three widely used value probing methods: token likelihood, sequence perplexity, and text generation.Our results show that all three methods exhibit large variances under non-semantic perturbations in prompts and option formats, with sequence perplexity being the most robust overall.We further introduce two tasks to assess expressiveness: demographic prompting, testing whether probed values adapt to cultural context; and value–action agreement, testing the alignment of probed values with value-based actions.We find that demographic context has little effect on the text generation method, and probed values only weakly correlate with action preferences across all methods.Our work highlights the instability and the limited expressive power of current value probing methods, calling for more reliable LLM value representations.

Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
Neemesh Yadav | Jiarui Liu | Francesco Ortu | Roya Ensafi | Zhijing Jin | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2025

The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work.

Benchmarking and Improving LLM Robustness for Personalized Generation
Chimaobi Okite | Naihao Deng | Kiran Bodipati | Huaidian Hou | Joyce Chai | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user’s preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness of LLMs in personalization, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fails to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.

CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation
Naihao Deng | Kapotaksha Das | Rada Mihalcea | Vitaliy Popov | Mohamed Abouelenien
Findings of the Association for Computational Linguistics: ACL 2025

In clinical operations, teamwork can be the crucial factor that determines the final outcome. Prior studies have shown that sufficient collaboration is the key factor that determines the outcome of an operation. To understand how the team practices teamwork during the operation, we collected **CliniDial** from simulations of medical operations. **CliniDial** includes the audio data and its transcriptions, the simulated physiology signals of the patient manikins, and how the team operates from two camera angles. We annotate behavior codes following an existing framework to understand the teamwork process for **CliniDial**. We pinpoint three main characteristics of our dataset, including its label imbalances, rich and natural interactions, and multiple modalities, and conduct experiments to test existing LLMs’ capabilities on handling data with these characteristics. Experimental results show that **CliniDial** poses significant challenges to the existing models, inviting future effort on developing methods that can deal with real-world clinical data. We open-source the codebase at https://github.com/MichiganNLP/CliniDial.

R³: “This is My SQL, Are You With Me?” A Consensus-Based Multi-Agent System for Text-to-SQL Tasks
Hanchen Xia | Feng Jiang | Naihao Deng | Cunxiang Wang | Guojiang Zhao | Rada Mihalcea | Yue Zhang
Proceedings of the 4th Table Representation Learning Workshop

Large Language Models (LLMs) have demon- strated exceptional performance across diverse tasks. To harness their capabilities for Text- to-SQL, we introduce R3 (Review-Rebuttal- Revision), a consensus-based multi-agent sys- tem for Text-to-SQL tasks. R3 achieves the new state-of-the-art performance of 89.9 on the Spider test set. In the meantime, R3 achieves 61.80 on the Bird development set. R3 out- performs existing single-LLM and multi-agent Text-to-SQL systems by 1.3% to 8.1% on Spi- der and Bird, respectively. Surprisingly, we find that for Llama-3-8B, R3 outperforms chain-of- thought prompting by over 20%, even outper- forming GPT-3.5 on the Spider development set. We open-source our codebase at https: //github.com/1ring2rta/R3.

The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning
Longju Bai | Angana Borah | Oana Ignat | Rada Mihalcea
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research.

Acoustic Individual Identification of White-Faced Capuchin Monkeys Using Joint Multi-Species Embeddings
Álvaro Vega-Hidalgo | Artem Abzaliev | Thore Bergman | Rada Mihalcea
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Acoustic individual identification of wild animals is an essential task for understanding animal vocalizations within their social contexts, and for facilitating conservation and wildlife monitoring efforts. However, most of the work in this space relies on human efforts, as the development of methods for automatic individual identification is hindered by the lack of data. In this paper, we explore cross-species pre-training to address the task of individual classification in white-faced capuchin monkeys. Using acoustic embeddings from birds and humans, we find that they can be effectively used to identify the calls from individual monkeys. Moreover, we find that joint multi-species representations can lead to further improvements over the use of one representation at a time. Our work demonstrates the potential of cross-species data transfer and multi-species representations, as strategies to address tasks on species with very limited data.

Examining Spanish Counseling with MIDAS: a Motivational Interviewing Dataset in Spanish
Aylin Ece Gunal | Bowen Yi | John D. Piette | Rada Mihalcea | Veronica Perez-Rosas
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Cultural and language factors significantly influence counseling, but Natural Language Processing research has not yet examined whether the findings of conversational analysis for counseling conducted in English apply to other languages. This paper presents a first step towards this direction. We introduce MIDAS (Motivational Interviewing Dataset in Spanish), a counseling dataset created from public video sources that contains expert annotations for counseling reflections and questions. Using this dataset, we explore language-based differences in counselor behavior in English and Spanish and develop classifiers in monolingual and multilingual settings, demonstrating its applications in counselor behavioral coding tasks.

Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias
Yuen Chen | Vethavikashini Chithrra Raghuram | Justus Mattern | Rada Mihalcea | Zhijing Jin
Findings of the Association for Computational Linguistics: NAACL 2025

Generated texts from large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics. These findings motivate research efforts aiming to understand and measure such effects. This paper introduces a causal formulation for bias measurement in generative language models. Based on this theoretical foundation, we outline a list of desiderata for designing robust bias benchmarks. We then propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias. We test several state-of-the-art open-source LLMs on OccuGender, including Llama, Mistral, and their instruction-tuned versions. The results show that these models exhibit substantial occupational gender bias. Lastly, we discuss prompting strategies for bias mitigation and an extension of our causal formulation to illustrate the generalizability of our framework.

2024

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Oana Ignat | Longju Bai | Joan C. Nwatu | Rada Mihalcea
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone due to the imbalanced geographical and economic representation of the data used in the training process. Most of this data comes from Western countries, leading to poor results for underrepresented countries. To address this issue, more data needs to be collected from these countries, but the cost of annotation can be a significant bottleneck. In this paper, we propose methods to identify the data to be annotated to balance model performance and annotation costs. Our approach first involves finding the countries with images of topics (objects and actions) most visually distinct from those already in the training datasets used by current large vision-language foundation models. Next, we identify countries with higher visual similarity for these topics and show that using data from these countries to supplement the training data improves model performance and reduces annotation costs. The resulting lists of countries and corresponding topics are made available at https://github.com/MichiganNLP/visual_diversity_budget.

Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense
Siqi Shen | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Soujanya Poria | Rada Mihalcea
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs’ general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks.Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally-aware language models.

Tables as Texts or Images: Evaluating the Table Reasoning Ability of LLMs and MLLMs
Naihao Deng | Zhenjie Sun | Ruiqi He | Aman Sikka | Yulong Chen | Lin Ma | Yue Zhang | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2024

Tables contrast with unstructured text data by its structure to organize the information.In this paper, we investigate the efficiency of various LLMs in interpreting tabular data through different prompting strategies and data formats. Our analysis extends across six benchmarks for table-related tasks such as question-answering and fact-checking. We pioneer in the assessment of LLMs’ performance on image-based table representation. Specifically, we compare five text-based and three image-based table representations, revealing the influence of representation and prompting on LLM performance. We hope our study provides researchers insights into optimizing LLMs’ application in table-related tasks.

Analyzing Occupational Distribution Representation in Japanese Language Models
Katsumi Ibaraki | Winston Wu | Lu Wang | Rada Mihalcea
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recent advances in large language models (LLMs) have enabled users to generate fluent and seemingly convincing text. However, these models have uneven performance in different languages, which is also associated with undesirable societal biases toward marginalized populations. Specifically, there is relatively little work on Japanese models, despite it being the thirteenth most widely spoken language. In this work, we first develop three Japanese language prompts to probe LLMs’ understanding of Japanese names and their association between gender and occupations. We then evaluate a variety of English, multilingual, and Japanese models, correlating the models’ outputs with occupation statistics from the Japanese Census Bureau from the last 100 years. Our findings indicate that models can associate Japanese names with the correct gendered occupations when using constrained decoding. However, with sampling or greedy decoding, Japanese language models have a preference for a small set of stereotypically gendered occupations, and multilingual models, though trained on Japanese, are not always able to understand Japanese prompts.

Whose wife is it anyway? Assessing bias against same-gender relationships in machine translation
Ian Stewart | Rada Mihalcea
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Machine translation often suffers from biased data and algorithms that can lead to unacceptable errors in system output. While bias in gender norms has been investigated, less is known about whether MT systems encode bias about social relationships, e.g., “the lawyer kissed her wife.” We investigate the degree of bias against same-gender relationships in MT systems, using generated template sentences drawn from several noun-gender languages (e.g., Spanish) and comprised of popular occupation nouns. We find that three popular MT services consistently fail to accurately translate sentences concerning relationships between entities of the same gender. The error rate varies considerably based on the context, and same-gender sentences referencing high female-representation occupations are translated with lower accuracy. We provide this work as a case study in the evaluation of intrinsic bias in NLP systems with respect to social relationships.

Proceedings of the Third Workshop on NLP for Positive Impact
Daryna Dementieva | Oana Ignat | Zhijing Jin | Rada Mihalcea | Giorgio Piatti | Joel Tetreault | Steven Wilson | Jieyu Zhao
Proceedings of the Third Workshop on NLP for Positive Impact

Recent progress in large language models (LLMs) has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that “it’s all been solved.” Not surprisingly, this has, in turn, made many NLP researchers – especially those at the beginning of their careers – worry about what NLP research area they should focus on. Has it all been solved, or what remaining questions can we work on regardless of LLMs? To address this question, this paper compiles NLP research directions rich for exploration. We identify fourteen different research areas encompassing 45 research directions that require new research and are not directly solvable by LLMs. While we identify many research areas, many others exist; we do not cover areas currently addressed by LLMs, but where LLMs lag behind in performance or those focused on LLM development. We welcome suggestions for other research directions to include: https://bit.ly/nlp-era-llm.

A Comparative Multidimensional Analysis of Empathetic Systems
Andrew Lee | Jonathan K. Kummerfeld | Larry Ann | Rada Mihalcea
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, empathetic dialogue systems have received significant attention.While some researchers have noted limitations, e.g., that these systems tend to generate generic utterances, no study has systematically verified these issues. We survey 21 systems, asking what progress has been made on the task. We observe multiple limitations of current evaluation procedures. Most critically, studies tend to rely on a single non-reproducible empathy score, which inadequately reflects the multidimensional nature of empathy. To better understand the differences between systems, we comprehensively analyze each system with automated methods that are grounded in a variety of aspects of empathy. We find that recent systems lack three important aspects of empathy: specificity, reflection levels, and diversity. Based on our results, we discuss problematic behaviors that may have gone undetected in prior evaluations, and offer guidance for developing future systems.

Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data
Shinka Mori | Oana Ignat | Andrew Lee | Rada Mihalcea
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HeadRoom, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.

EmoBench: Evaluating the Emotional Intelligence of Large Language Models
Sahand Sabour | Siyang Liu | Zheyuan Zhang | June Liu | Jinfeng Zhou | Alvionna Sunaryo | Tatia Lee | Rada Mihalcea | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advances in Large Language Models (LLMs) have highlighted the need for robust, comprehensive, and challenging benchmarks. Yet, research on evaluating their Emotional Intelligence (EI) is considerably limited. Existing benchmarks have two major shortcomings: first, they mainly focus on emotion recognition, neglecting essential EI capabilities such as emotion management and thought facilitation through emotion understanding; second, they are primarily constructed from existing datasets, which include frequent patterns, explicit information, and annotation errors, leading to unreliable evaluation. We propose EmoBench, a benchmark that draws upon established psychological theories and proposes a comprehensive definition for machine EI, including Emotional Understanding and Emotional Application. EmoBench includes a set of 400 hand-crafted questions in English and Chinese, which are meticulously designed to require thorough reasoning and understanding. Our findings reveal a considerable gap between the EI of existing LLMs and the average human, highlighting a promising direction for future research. Our code and data are publicly available at https://github.com/Sahandfer/EmoBench.

Do LLMs Think Fast and Slow? A Causal Study on Sentiment Analysis
Zhiheng Lyu | Zhijing Jin | Fernando Gonzalez Adauto | Rada Mihalcea | Bernhard Schölkopf | Mrinmaya Sachan
Findings of the Association for Computational Linguistics: EMNLP 2024

Sentiment analysis (SA) aims to identify the sentiment expressed in a piece of text, often in the form of a review. Assuming a review and the sentiment associated with it, in this paper we formulate SA as a combination of two tasks: (1) a causal discovery task that distinguishes whether a review “primes” the sentiment (Causal Hypothesis C1), or the sentiment “primes” the review (Causal Hypothesis C2); and (2) the traditional prediction task to model the sentiment using the review as input. Using the peak-end rule in psychology, we classify a sample as C1 if its overall sentiment score approximates an average of all the sentence-level sentiments in the review, and as C2 if the overall sentiment score approximates an average of the peak and end sentiments. For the prediction task, we use the discovered causal mechanisms behind the samples to improve the performance of LLMs by proposing causal prompts that give the models an inductive bias of the underlying causal graph, leading to substantial improvements by up to 32.13 F1 points on zero-shot five-class SA.

Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions
Angana Borah | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2024

As Large Language Models (LLMs) continue to evolve, they are increasingly being employed in numerous studies to simulate societies and execute diverse social tasks. However, LLMs are susceptible to societal biases due to their exposure to human-generated data. Given that LLMs are being used to gain insights into various societal aspects, it is essential to mitigate these biases. To that end, our study investigates the presence of implicit gender biases in multi-agent LLM interactions and proposes two strategies to mitigate these biases. We begin by creating a dataset of scenarios where implicit gender biases might arise, and subsequently develop a metric to assess the presence of biases. Our empirical analysis reveals that LLMs generate outputs characterized by strong implicit bias associations (≥ ≈ 50% of the time). Furthermore, these biases tend to escalate following multi-agent interactions. To mitigate them, we propose two strategies: self-reflection with in-context examples (ICE); and supervised fine-tuning. Our research demonstrates that both methods effectively mitigate implicit biases, with the ensemble of fine-tuning and self-reflection proving to be the most successful.

Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
Do June Min | Veronica Perez-Rosas | Ken Resnicow | Rada Mihalcea
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation. We focus on the task of counselor reflection generation, where we optimize the generators to simultaneously improve the fluency, coherence, and reflection quality of generated counselor responses. We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously. Specifically, we employ non-contextual and contextual multi-arm bandits to dynamically adjust multiple reward weights during training. Through automatic and manual evaluations, we show that our proposed techniques, DynaOpt and C-DynaOpt, outperform existing naive and bandit baselines, showcasing their potential for enhancing language models.

Unsupervised Discrete Representations of American Sign Language
Artem Abzaliev | Rada Mihalcea
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Many modalities are naturally represented as continuous signals, making it difficult to use them with models that expect discrete units, such as LLMs. In this paper, we explore the use of audio compression techniques for the discrete representation of the gestures used in sign language. We train a tokenizer for American Sign Language (ASL) fingerspelling, which discretizes sequences of fingerspelling signs into tokens. We also propose a loss function to improve the interpretability of these tokens such that they preserve both the semantic and the visual information of the signal. We show that the proposed method improves the performance of the discretized sequence on downstream tasks.

Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification
Artem Abzaliev | Humberto Perez-Espinosa | Rada Mihalcea
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.

The Generation Gap: Exploring Age Bias in the Value Systems of Large Language Models
Siyang Liu | Trisha Maturi | Bowen Yi | Siqi Shen | Rada Mihalcea
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We explore the alignment of values in Large Language Models (LLMs) with specific age groups, leveraging data from the World Value Survey across thirteen categories. Through a diverse set of prompts tailored to ensure response robustness, we find a general inclination of LLM values towards younger demographics, especially when compared to the US population. Although a general inclination can be observed, we also found that this inclination toward younger groups can be different across different value categories. Additionally, we explore the impact of incorporating age identity information in prompts and observe challenges in mitigating value discrepancies with different age cohorts. Our findings highlight the age bias in LLMs and provide insights for future work. Materials for our analysis will be available via https://github.com/anonymous

Implicit Personalization in Language Models: A Systematic Study
Zhijing Jin | Nils Heil | Jiarui Liu | Shehzaad Dhuliawala | Yahang Qi | Bernhard Schölkopf | Rada Mihalcea | Mrinmaya Sachan
Findings of the Association for Computational Linguistics: EMNLP 2024

Implicit Personalization (IP) is a phenomenon of language models inferring a user’s background from the implicit cues in the input prompts and tailoring the response based on this inference. While previous work has touched upon various instances of this problem, there lacks a unified framework to study this behavior. This work systematically studies IP through a rigorous mathematical formulation, a multi-perspective moral reasoning framework, and a set of case studies. Our theoretical foundation for IP relies on a structural causal model and introduces a novel method, indirect intervention, to estimate the causal effect of a mediator variable that cannot be directly intervened upon. Beyond the technical approach, we also introduce a set of moral reasoning principles based on three schools of moral philosophy to study when IP may or may not be ethically appropriate. Equipped with both mathematical and ethical insights, we present three diverse case studies illustrating the varied nature of the IP problem and offer recommendations for future research.

Learning Human Action Representations from Temporal Context in Lifestyle Vlogs
Oana Ignat | Santiago Castro | Weiji Li | Rada Mihalcea
Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing

We address the task of human action representation and show how the approach to generating word representations based on co-occurrence can be adapted to generate human action representations by analyzing their co-occurrence in videos. To this end, we formalize the new task of human action co-occurrence identification in online videos, i.e., determine whether two human actions are likely to co-occur in the same interval of time.We create and make publicly available the Co-Act (Action Co-occurrence) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring.We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains.

2023

Empathy Identification Systems are not Accurately Accounting for Context
Andrew Lee | Jonathan K. Kummerfeld | Larry An | Rada Mihalcea
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Understanding empathy in text dialogue data is a difficult, yet critical, skill for effective human-machine interaction. In this work, we ask whether systems are making meaningful progress on this challenge. We consider a simple model that checks if an input utterance is similar to a small set of empathetic examples. Crucially, the model does not look at what the utterance is a response to, i.e., the dialogue context. This model performs comparably to other work on standard benchmarks and even outperforms state-of-the-art models for empathetic rationale extraction by 16.7 points on T-F1 and 4.3 on IOU-F1. This indicates that current systems rely on the surface form of the response, rather than whether it is suitable in context. To confirm this, we create examples with dialogue contexts that change the interpretation of the response and show that current systems continue to label utterances as empathetic. We discuss the implications of our findings, including improvements for empathetic benchmarks and how our model can be an informative baseline.

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Siyang Liu | Naihao Deng | Sahand Sabour | Yilin Jia | Minlie Huang | Rada Mihalcea
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol that allows for the integration of task-specific tokens into the pre-trained model’s tokenization step. Through extensive experiments on psychological question-answering tasks in both Chinese and English, we find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens. Preliminary experiments point to promising results when using our tokenization approach with very large language models.

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts
Deepanway Ghosal | Navonil Majumder | Roy Lee | Rada Mihalcea | Soujanya Poria
Findings of the Association for Computational Linguistics: EMNLP 2023

Visual question answering (VQA) is the task of answering questions about an image. The task assumes an understanding of both the image and the question to provide a natural language answer. VQA has gained popularity in recent years due to its potential applications in a wide range of fields, including robotics, education, and healthcare. In this paper, we focus on knowledge-augmented VQA, where answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image. We propose a multimodal framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately. We benchmark our method on the multi-choice question-answering task of the A-OKVQA, Science-QA, VSR, and IconQA datasets using CLIP and BLIP models. We show that the use of language guidance is a simple but powerful and effective strategy for visual question answering. Our language guidance improves the performance of CLIP by 7.6% and BLIP-2 by 4.8% in the challenging A-OKVQA dataset. We also observe consistent improvement in performance on the Science-QA, VSR, and IconQA datasets when using the proposed language guidances. The implementation of LG-VQA is publicly available at https://github.com/declare-lab/LG-VQA.

Bridging the Digital Divide: Performance Variation across Socio-Economic Factors in Vision-Language Models
Joan Nwatu | Oana Ignat | Rada Mihalcea
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Despite the impressive performance of current AI models reported across various tasks, performance reports often do not include evaluations of how these models perform on the specific groups that will be impacted by these technologies. Among the minority groups under-represented in AI, data from low-income households are often overlooked in data collection and model evaluation. We evaluate the performance of a state-of-the-art vision-language model (CLIP) on a geo-diverse dataset containing household images associated with different income values (DollarStreet) and show that performance inequality exists among households of different income levels. Our results indicate that performance for the poorer groups is consistently lower than the wealthier groups across various topics and countries. We highlight insights that can help mitigate these issues and propose actionable steps for economic-level inclusive AI development.

Cross-Cultural Analysis of Human Values, Morals, and Biases in Folk Tales
Winston Wu | Lu Wang | Rada Mihalcea
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Folk tales are strong cultural and social influences in children’s lives, and they are known to teach morals and values. However, existing studies on folk tales are largely limited to European tales. In our study, we compile a large corpus of over 1,900 tales originating from 27 diverse cultures across six continents. Using a range of lexicons and correlation analyses, we examine how human values, morals, and gender biases are expressed in folk tales across cultures. We discover differences between cultures in prevalent values and morals, as well as cross-cultural trends in problematic gender biases. Furthermore, we find trends of reduced value expression when examining public-domain fiction stories, extrinsically validate our analyses against the multicultural Schwartz Survey of Cultural Values and the Global Gender Gap Report, and find traditional gender biases associated with values, morals, and agency. This large-scale cross-cultural study of folk tales paves the way towards future studies on how literature influences and reflects cultural norms.

VERVE: Template-based ReflectiVE Rewriting for MotiVational IntErviewing
Do June Min | Veronica Perez-Rosas | Ken Resnicow | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2023

Reflective listening is a fundamental skill that counselors must acquire to achieve proficiency in motivational interviewing (MI). It involves responding in a manner that acknowledges and explores the meaning of what the client has expressed in the conversation. In this work, we introduce the task of counseling response rewriting, which transforms non-reflective statements into reflective responses. We introduce VERVE, a template-based rewriting system with paraphrase-augmented training and adaptive template updating. VERVE first creates a template by identifying and filtering out tokens that are not relevant to reflections and constructs a reflective response using the template. Paraphrase-augmented training allows the model to learn less-strict fillings of masked spans, and adaptive template updating helps discover effective templates for rewriting without significantly removing the original content. Using both automatic and human evaluations, we compare our method against text rewriting baselines and show that our framework is effective in turning non-reflective statements into more reflective responses while achieving a good content preservation-reflection style trade-off.

Query Rewriting for Effective Misinformation Discovery
Ashkan Kazemi | Artem Abzaliev | Naihao Deng | Rui Hou | Scott A. Hale | Veronica Perez-Rosas | Rada Mihalcea
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Scalable Performance Analysis for Vision-Language Models
Santiago Castro | Oana Ignat | Rada Mihalcea
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Joint vision-language models have shown great performance over a diverse set of tasks. However, little is known about their limitations, as the high dimensional space learned by these models makes it difficult to identify semantic errors. Recent work has addressed this problem by designing highly controlled probing task benchmarks. Our paper introduces a more scalable solution that relies on already annotated benchmarks. Our method consists of extracting a large set of diverse features from a vision-language benchmark and measuring their correlation with the output of the target model. We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs; we also uncover novel insights such as CLIP getting confused by concrete words. Our framework is available at https://github.com/MichiganNLP/Scalable-VLM-Probing and can be used with other multimodal models and benchmarks.

Beyond Good Intentions: Reporting the Research Landscape of NLP for Social Good
Fernando Adauto | Zhijing Jin | Bernhard Schölkopf | Tom Hope | Mrinmaya Sachan | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2023

With the recent advances in natural language processing (NLP), a vast number of applications have emerged across various use cases. Among the plethora of NLP applications, many academic researchers are motivated to do work that has a positive social impact, in line with the recent initiatives of NLP for Social Good (NLP4SG). However, it is not always obvious to researchers how their research efforts are tackling today’s big social problems. Thus, in this paper, we introduce NLP4SGPapers, a scientific dataset with three associated tasks that can help identify NLP4SG papers and characterize the NLP4SG landscape by: (1) identifying the papers that address a social problem, (2) mapping them to the corresponding UN Sustainable Development Goals (SDGs), and (3) identifying the task they are solving and the methods they are using. Using state-of-the-art NLP models, we address each of these tasks and use them on the entire ACL Anthology, resulting in a visualization workspace that gives researchers a comprehensive overview of the field of NLP4SG. Our website is available at https://nlp4sg.vercel.app . We released our data at https://huggingface.co/datasets/feradauto/NLP4SGPapers and code at https://github.com/feradauto/nlp4sg

You Are What You Annotate: Towards Better Models through Annotator Representations
Naihao Deng | Xinliang Zhang | Siyang Liu | Winston Wu | Lu Wang | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2023

Annotator disagreement is ubiquitous in natural language processing (NLP) tasks. There are multiple reasons for such disagreements, including the subjectivity of the task, difficult cases, unclear guidelines, and so on. Rather than simply aggregating labels to obtain data annotations, we instead try to directly model the diverse perspectives of the annotators, and explicitly account for annotators’ idiosyncrasies in the modeling process by creating representations for each annotator (*annotator embeddings*) and also their annotations (*annotation embeddings*). In addition, we propose **TID-8**, **T**he **I**nherent **D**isagreement - **8** dataset, a benchmark that consists of eight existing language understanding datasets that have inherent annotator disagreement. We test our approach on TID-8 and show that our approach helps models learn significantly better from disagreements on six different datasets in TID-8 while increasing model size by fewer than 1% parameters. By capturing the unique tendencies and subjectivity of individual annotators through embeddings, our representations prime AI models to be inclusive of diverse viewpoints.

Word Category Arcs in Literature Across Languages and Genres
Winston Wu | Lu Wang | Rada Mihalcea
Proceedings of the 5th Workshop on Narrative Understanding

Word category arcs measure the progression of word usage across a story. Previous work on arcs has explored structural and psycholinguistic arcs through the course of narratives, but so far it has been limited to \textit{English} narratives and a narrow set of word categories covering binary emotions and cognitive processes. In this paper, we expand over previous work by (1) introducing a novel, general approach to quantitatively analyze word usage arcs for any word category through a combination of clustering and filtering; and (2) exploring narrative arcs in literature in eight different languages across multiple genres. Through multiple experiments and analyses, we quantify the nature of narratives across languages, corroborating existing work on monolingual narrative arcs as well as drawing new insights about the interpretation of arcs through correlation analyses.

Reflection of Demographic Background on Word Usage
Aparna Garimella | Carmen Banea | Rada Mihalcea
Computational Linguistics, Volume 49, Issue 2 - June 2023

The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.

Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models
Yufan Wu | Yinghui He | Yilin Jia | Rada Mihalcea | Yulong Chen | Naihao Deng
Findings of the Association for Computational Linguistics: EMNLP 2023

Theory of Mind (ToM) is the ability to reason about one’s own and others’ mental states. ToM plays a critical role in the development of intelligence, language understanding, and cognitive processes. While previous work has primarily focused on first and second-order ToM, we explore higher-order ToM, which involves recursive reasoning on others’ beliefs. %We also incorporate a new deception mechanism in ToM reasoning. We introduce Hi-ToM, a Higher Order Theory of Mind benchmark. Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks, demonstrating the limitations of current LLMs. We conduct a thorough analysis of different failure cases of LLMs, and share our thoughts on the implications of our findings on the future of NLP.

Navigating Data Scarcity: Pretraining for Medical Utterance Classification
Do June Min | Veronica Perez-Rosas | Rada Mihalcea
Proceedings of the 5th Clinical Natural Language Processing Workshop

Pretrained language models leverage self-supervised learning to use large amounts of unlabeled text for learning contextual representations of sequences. However, in the domain of medical conversations, the availability of large, public datasets is limited due to issues of privacy and data management. In this paper, we study the effectiveness of dialog-aware pretraining objectives and multiphase training in using unlabeled data to improve LMs training for medical utterance classification. The objectives of pretraining for dialog awareness involve tasks that take into account the structure of conversations, including features such as turn-taking and the roles of speakers. The multiphase training process uses unannotated data in a sequence that prioritizes similarities and connections between different domains. We empirically evaluate these methods on conversational dialog classification tasks in the medical and counseling domains, and find that multiphase training can help achieve higher performance than standard pretraining or finetuning.

2022

Leveraging Similar Users for Personalized Language Modeling with Limited Data
Charles Welch | Chenxi Gu | Jonathan K. Kummerfeld | Veronica Perez-Rosas | Rada Mihalcea
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Personalized language models are designed and trained to capture language patterns specific to individual users. This makes them more accurate at predicting what a user will write. However, when a new user joins a platform and not enough text is available, it is harder to build effective personalized language models. We propose a solution for this problem, using a model trained on users that are similar to a new user. In this paper, we explore strategies for finding the similarity between new users and existing ones and methods for using the data from existing users who are a good match. We further explore the trade-off between available data for new users and how well their language can be modeled.

How Well Do You Know Your Audience? Toward Socially-aware Question Generation
Ian Stewart | Rada Mihalcea
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

When writing, a person may need to anticipate questions from their audience, but different social groups may ask very different types of questions. If someone is writing about a problem they want to resolve, what kind of follow-up question will a domain expert ask, and could the writer better address the expert’s information needs by rewriting their original post? In this paper, we explore the task of socially-aware question generation. We collect a data set of questions and posts from social media, including background information about the question-askers’ social groups. We find that different social groups, such as experts and novices, consistently ask different types of questions. We train several text-generation models that incorporate social information, and we find that a discrete social-representation model outperforms the text-only model when different social groups ask highly different questions from one another. Our work provides a framework for developing text generation models that can help writers anticipate the information expectations of highly different social groups.

In-the-Wild Video Question Answering
Santiago Castro | Naihao Deng | Pingxuan Huang | Mihai Burzo | Rada Mihalcea
Proceedings of the 29th International Conference on Computational Linguistics

Existing video understanding datasets mostly focus on human interactions, with little attention being paid to the “in the wild” settings, where the videos are recorded outdoors. We propose WILDQA, a video understanding dataset of videos recorded in outside settings. In addition to video question answering (Video QA), we also introduce the new task of identifying visual support for a given question and answer (Video Evidence Selection). Through evaluations using a wide range of baseline models, we show that WILDQA poses new challenges to the vision and language research communities. The dataset is available at https: //lit.eecs.umich.edu/wildqa/.

Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion
Yiqun Yao | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2022

In multimodal machine learning, additive late-fusion is a straightforward approach to combine the feature representations from different modalities, in which the final prediction can be formulated as the sum of unimodal predictions. While it has been found that certain late-fusion models can achieve competitive performance with lower computational costs compared to complex multimodal interactive models, how to effectively search for a good late-fusion model is still an open question. Moreover, for different modalities, the best unimodal models may work under significantly different learning rates due to the nature of the modality and the computational flow of the model; thus, selecting a global learning rate for late-fusion models can result in a vanishing gradient for some modalities. To help address these issues, we propose a Modality-Specific Learning Rate (MSLR) method to effectively build late-fusion multimodal models from fine-tuned unimodal models. We investigate three different strategies to assign learning rates to different modalities. Our experiments show that MSLR outperforms global learning rates on multiple tasks and settings, and enables the models to effectively learn each modality.

Towards Understanding the Relation between Gestures and Language
Artem Abzaliev | Andrew Owens | Rada Mihalcea
Proceedings of the 29th International Conference on Computational Linguistics

In this paper, we explore the relation between gestures and language. Using a multimodal dataset, consisting of Ted talks where the language is aligned with the gestures made by the speakers, we adapt a semi-supervised multimodal model to learn gesture embeddings. We show that gestures are predictive of the native language of the speaker, and that gesture embeddings further improve language prediction result. In addition, gesture embeddings might contain some linguistic information, as we show by probing embeddings for psycholinguistic categories. Finally, we analyze the words that lead to the most expressive gestures and find that function words drive the expressiveness of gestures.

Demographic-Aware Language Model Fine-tuning as a Bias Mitigation Technique
Aparna Garimella | Rada Mihalcea | Akhash Amarnath
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

BERT-like language models (LMs), when exposed to large unstructured datasets, are known to learn and sometimes even amplify the biases present in such data. These biases generally reflect social stereotypes with respect to gender, race, age, and others. In this paper, we analyze the variations in gender and racial biases in BERT, a large pre-trained LM, when exposed to different demographic groups. Specifically, we investigate the effect of fine-tuning BERT on text authored by historically disadvantaged demographic groups in comparison to that by advantaged groups. We show that simply by fine-tuning BERT-like LMs on text authored by certain demographic groups can result in the mitigation of social biases in these LMs against various target groups.

Analyzing the Effects of Annotator Gender across NLP Tasks
Laura Biester | Vanita Sharma | Ashkan Kazemi | Naihao Deng | Steven Wilson | Rada Mihalcea
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022

Recent studies have shown that for subjective annotation tasks, the demographics, lived experiences, and identity of annotators can have a large impact on how items are labeled. We expand on this work, hypothesizing that gender may correlate with differences in annotations for a number of NLP benchmarks, including those that are fairly subjective (e.g., affect in text) and those that are typically considered to be objective (e.g., natural language inference). We develop a robust framework to test for differences in annotation across genders for four benchmark datasets. While our results largely show a lack of statistically significant differences in annotation by males and females for these tasks, the framework can be used to analyze differences in annotation between various other demographic groups in future work. Finally, we note that most datasets are collected without annotator demographics and released only in aggregate form; we call on the community to consider annotator demographics as data is collected, and to release dis-aggregated data to allow for further work analyzing variability among annotators.

PAIR: Prompt-Aware margIn Ranking for Counselor Reflection Scoring in Motivational Interviewing
Do June Min | Verónica Pérez-Rosas | Kenneth Resnicow | Rada Mihalcea
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Counselor reflection is a core verbal skill used by mental health counselors to express understanding and affirmation of the client’s experience and concerns. In this paper, we propose a system for the analysis of counselor reflections. Specifically, our system takes as input one dialog turn containing a client prompt and a counselor response, and outputs a score indicating the level of reflection in the counselor response. We compile a dataset consisting of different levels of reflective listening skills, and propose the Prompt-Aware margIn Ranking (PAIR) framework that contrasts positive and negative prompt and response pairs using specially designed multi-gap and prompt-aware margin ranking losses. Through empirical evaluations and deployment of our system in a real-life educational environment, we show that our analysis model outperforms several baselines on different metrics, and can be used to provide useful feedback to counseling trainees.

FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework
Santiago Castro | Ruoyao Wang | Pingxuan Huang | Ian Stewart | Oana Ignat | Nan Liu | Jonathan Stroud | Rada Mihalcea
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose fill-in-the-blanks as a video understanding evaluation framework and introduce FIBER – a novel dataset consisting of 28,000 videos and descriptions in support of this evaluation framework. The fill-in-the-blanks setting tests a model’s understanding of a video by requiring it to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. The FIBER benchmark does not share the weaknesses of the current state-of-the-art language-informed video understanding tasks, namely: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit linguistic biases in the task formulation, thus making our framework challenging for the current state-of-the-art systems to solve; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth. The FIBER dataset and our code are available at https://lit.eecs.umich.edu/fiber/.

CICERO: A Dataset for Contextualized Commonsense Inference in Dialogues
Deepanway Ghosal | Siqi Shen | Navonil Majumder | Rada Mihalcea | Soujanya Poria
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper addresses the problem of dialogue reasoning with contextualized commonsense inference. We curate CICERO, a dataset of dyadic conversations with five types of utterance-level reasoning-based inferences: cause, subsequent event, prerequisite, motivation, and emotional reaction. The dataset contains 53,105 of such inferences from 5,672 dialogues. We use this dataset to solve relevant generative and discriminative tasks: generation of cause and subsequent event; generation of prerequisite, motivation, and listener’s emotional reaction; and selection of plausible alternatives. Our results ascertain the value of such dialogue-centric commonsense knowledge datasets. It is our hope that CICERO will open new research avenues into commonsense-based dialogue reasoning.

Using Paraphrases to Study Properties of Contextual Embeddings
Laura Burdick | Jonathan K. Kummerfeld | Rada Mihalcea
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We use paraphrases as a unique source of data to analyze contextualized embeddings, with a particular focus on BERT. Because paraphrases naturally encode consistent word and phrase semantics, they provide a unique lens for investigating properties of embeddings. Using the Paraphrase Database’s alignments, we study words within paraphrases as well as phrase representations. We find that contextual embeddings effectively handle polysemous words, but give synonyms surprisingly different representations in many cases. We confirm previous findings that BERT is sensitive to word order, but find slightly different patterns than prior work in terms of the level of contextualization across BERT’s layers.

Logical Fallacy Detection
Zhijing Jin | Abhinav Lalwani | Tejas Vaidhya | Xiaoyu Shen | Yiwen Ding | Zhiheng Lyu | Mrinmaya Sachan | Rada Mihalcea | Bernhard Schölkopf
Findings of the Association for Computational Linguistics: EMNLP 2022

Reasoning is central to human intelligence. However, fallacious arguments are common, and some exacerbate problems such as spreading misinformation about climate change. In this paper, we propose the task of logical fallacy detection, and provide a new dataset (Logic) of logical fallacies generally found in text, together with an additional challenge set for detecting logical fallacies in climate change claims (LogicClimate). Detecting logical fallacies is a hard problem as the model must understand the underlying logical structure of the argument. We find that existing pretrained large language models perform poorly on this task. In contrast, we show that a simple structure-aware classifier outperforms the best language model by 5.46% F1 scores on Logic and 4.51% on LogicClimate. We encourage future work to explore this task since (a) it can serve as a new reasoning challenge for language models, and (b) it can have potential applications in tackling the spread of misinformation. Our dataset and code are available at https://github.com/causalNLP/logical-fallacy

Text-Aware Graph Embeddings for Donation Behavior Prediction
MeiXing Dong | Xueming Xu | Rada Mihalcea
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

Predicting user behavior is essential for a large number of applications including recommender and dialog systems, and more broadly in domains such as healthcare, education, and economics. In this paper, we show that we can effectively predict donation behavior by using text-aware graph models, building upon graphs that connect user behaviors and their interests. Using a university donation dataset, we show that the graph representation significantly improves over learning from textual representations. Moreover, we show how incorporating implicit information inferred from text associated with the graph entities brings additional improvements. Our results demonstrate the role played by text-aware graph representations in predicting donation behavior.

Knowledge Enhanced Reflection Generation for Counseling Dialogues
Siqi Shen | Veronica Perez-Rosas | Charles Welch | Soujanya Poria | Rada Mihalcea
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we study the effect of commonsense and domain knowledge while generating responses in counseling conversations using retrieval and generative methods for knowledge integration. We propose a pipeline that collects domain knowledge through web mining, and show that retrieval from both domain-specific and commonsense knowledge bases improves the quality of generated responses. We also present a model that incorporates knowledge generated by COMET using soft positional encoding and masked self-attention. We show that both retrieved and COMET-generated knowledge improve the system’s performance as measured by automatic metrics and also by human evaluation. Lastly, we present a comparative study on the types of knowledge encoded by our system showing that causal and intentional relationships benefit the generation task more than other types of commonsense relations.

Proceedings of the 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022)
Lingfei Wu | Bang Liu | Rada Mihalcea | Jian Pei | Yue Zhang | Yunyao Li
Proceedings of the 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022)

Deep Learning for Text Style Transfer: A Survey
Di Jin | Zhijing Jin | Zhiting Hu | Olga Vechtomova | Rada Mihalcea
Computational Linguistics, Volume 48, Issue 1 - March 2022

Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text, such as politeness, emotion, humor, and many others. It has a long history in the field of natural language processing, and recently has re-gained significant attention thanks to the promising performance brought by deep neural models. In this article, we present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017. We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data. We also provide discussions on a variety of important topics regarding the future development of this task.1

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering
Deepanway Ghosal | Navonil Majumder | Rada Mihalcea | Soujanya Poria
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We propose a simple refactoring of multi-choice question answering (MCQA) tasks as a series of binary classifications. The MCQA task is generally performed by scoring each (question, answer) pair normalized over all the pairs, and then selecting the answer from the pair that yield the highest score. For n answer choices, this is equivalent to an n-class classification setup where only one class (true answer) is correct. We instead show that classifying (question, true answer) as positive instances and (question, false answer) as negative instances is significantly more effective across various models and datasets. We show the efficacy of our proposed approach in different tasks – abductive reasoning, commonsense question answering, science question answering, and sentence completion. Our DeBERTa binary classification model reaches the top or close to the top performance on public leaderboards for these tasks. The source code of the proposed approach is available at https://github.com/declare-lab/TEAM.

2021

WhyAct: Identifying Action Reasons in Lifestyle Vlogs
Oana Ignat | Santiago Castro | Hanwen Miao | Weiji Li | Rada Mihalcea
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We aim to automatically identify human action reasons in online videos. We focus on the widespread genre of lifestyle vlogs, in which people perform actions while verbally describing them. We introduce and make publicly available the WhyAct dataset, consisting of 1,077 visual actions manually annotated with their reasons. We describe a multimodal model that leverages visual and textual information to automatically infer the reasons corresponding to an action presented in the video.

CIDER: Commonsense Inference for Dialogue Explanation and Reasoning
Deepanway Ghosal | Pengfei Hong | Siqi Shen | Navonil Majumder | Rada Mihalcea | Soujanya Poria
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Commonsense inference to understand and explain human language is a fundamental research problem in natural language processing. Explaining human conversations poses a great challenge as it requires contextual understanding, planning, inference, and several aspects of reasoning including causal, temporal, and commonsense reasoning. In this work, we introduce CIDER – a manually curated dataset that contains dyadic dialogue explanations in the form of implicit and explicit knowledge triplets inferred using contextual commonsense inference. Extracting such rich explanations from conversations can be conducive to improving several downstream applications. The annotated triplets are categorized by the type of commonsense knowledge present (e.g., causal, conditional, temporal). We set up three different tasks conditioned on the annotated dataset: Dialogue-level Natural Language Inference, Span Extraction, and Multi-choice Span Selection. Baseline results obtained with transformer-based models reveal that the tasks are difficult, paving the way for promising future research. The dataset and the baseline implementations are publicly available at https://github.com/declare-lab/CIDER.

Hitting your MARQ: Multimodal ARgument Quality Assessment in Long Debate Video
Md Kamrul Hasan | James Spann | Masum Hasan | Md Saiful Islam | Kurtis Haut | Rada Mihalcea | Ehsan Hoque
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The combination of gestures, intonations, and textual content plays a key role in argument delivery. However, the current literature mostly considers textual content while assessing the quality of an argument, and it is limited to datasets containing short sequences (18-48 words). In this paper, we study argument quality assessment in a multimodal context, and experiment on DBATES, a publicly available dataset of long debate videos. First, we propose a set of interpretable debate centric features such as clarity, content variation, body movement cues, and pauses, inspired by theories of argumentation quality. Second, we design the Multimodal ARgument Quality assessor (MARQ) – a hierarchical neural network model that summarizes the multimodal signals on long sequences and enriches the multimodal embedding with debate centric features. Our proposed MARQ model achieves an accuracy of 81.91% on the argument quality prediction task and outperforms established baseline models with an error rate reduction of 22.7%. Through ablation studies, we demonstrate the importance of multimodal cues in modeling argument quality.

Micromodels for Efficient, Explainable, and Reusable Systems: A Case Study on Mental Health
Andrew Lee | Jonathan K. Kummerfeld | Larry An | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2021

Many statistical models have high accuracy on test benchmarks, but are not explainable, struggle in low-resource scenarios, cannot be reused for multiple tasks, and cannot easily integrate domain expertise. These factors limit their use, particularly in settings such as mental health, where it is difficult to annotate datasets and model outputs have significant impact. We introduce a micromodel architecture to address these challenges. Our approach allows researchers to build interpretable representations that embed domain knowledge and provide explanations throughout the model’s decision process. We demonstrate the idea on multiple mental health tasks: depression classification, PTSD classification, and suicidal risk assessment. Our systems consistently produce strong results, even in low-resource scenarios, and are more interpretable than alternative methods.

Room to Grow: Understanding Personal Characteristics Behind Self Improvement Using Social Media
MeiXing Dong | Xueming Xu | Yiwei Zhang | Ian Stewart | Rada Mihalcea
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Many people aim for change, but not everyone succeeds. While there are a number of social psychology theories that propose motivation-related characteristics of those who persist with change, few computational studies have explored the motivational stage of personal change. In this paper, we investigate a new dataset consisting of the writings of people who manifest intention to change, some of whom persist while others do not. Using a variety of linguistic analysis techniques, we first examine the writing patterns that distinguish the two groups of people. Persistent people tend to reference more topics related to long-term self-improvement and use a more complicated writing style. Drawing on these consistent differences, we build a classifier that can reliably identify the people more likely to persist, based on their language. Our experiments provide new insights into the motivation-related behavior of people who persist with their intention to change.

STaCK: Sentence Ordering with Temporal Commonsense Knowledge
Deepanway Ghosal | Navonil Majumder | Rada Mihalcea | Soujanya Poria
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Sentence order prediction is the task of finding the correct order of sentences in a randomly ordered document. Correctly ordering the sentences requires an understanding of coherence with respect to the chronological sequence of events described in the text. Document-level contextual understanding and commonsense knowledge centered around these events are often essential in uncovering this coherence and predicting the exact chronological order. In this paper, we introduce STaCK — a framework based on graph neural networks and temporal commonsense knowledge to model global information and predict the relative order of sentences. Our graph network accumulates temporal evidence using knowledge of ‘past’ and ‘future’ and formulates sentence ordering as a constrained edge classification problem. We report results on five different datasets, and empirically show that the proposed method is naturally suitable for order prediction. The implementation of this work is available at: https://github.com/declare-lab/sentence-ordering.

Mining the Cause of Political Decision-Making from Social Media: A Case Study of COVID-19 Policies across the US States
Zhijing Jin | Zeyu Peng | Tejas Vaidhya | Bernhard Schoelkopf | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2021

Mining the causes of political decision-making is an active research area in the field of political science. In the past, most studies have focused on long-term policies that are collected over several decades of time, and have primarily relied on surveys as the main source of predictors. However, the recent COVID-19 pandemic has given rise to a new political phenomenon, where political decision-making consists of frequent short-term decisions, all on the same controlled topic—the pandemic. In this paper, we focus on the question of how public opinion influences policy decisions, while controlling for confounders such as COVID-19 case increases or unemployment rates. Using a dataset consisting of Twitter data from the 50 US states, we classify the sentiments toward governors of each state, and conduct controlled studies and comparisons. Based on the compiled samples of sentiments, policies, and confounders, we conduct causal inference to discover trends in political decision-making across different states.

Exploring the Role of Context in Utterance-level Emotion, Act and Intent Classification in Conversations: An Empirical Study
Deepanway Ghosal | Navonil Majumder | Rada Mihalcea | Soujanya Poria
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

MUSER: MUltimodal Stress detection using Emotion Recognition as an Auxiliary Task
Yiqun Yao | Michalis Papakostas | Mihai Burzo | Mohamed Abouelenien | Rada Mihalcea
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The capability to automatically detect human stress can benefit artificial intelligent agents involved in affective computing and human-computer interaction. Stress and emotion are both human affective states, and stress has proven to have important implications on the regulation and expression of emotion. Although a series of methods have been established for multimodal stress detection, limited steps have been taken to explore the underlying inter-dependence between stress and emotion. In this work, we investigate the value of emotion recognition as an auxiliary task to improve stress detection. We propose MUSER – a transformer-based model architecture and a novel multi-task learning algorithm with speed-based dynamic sampling strategy. Evaluation on the Multimodal Stressed Emotion (MuSE) dataset shows that our model is effective for stress detection with both internal and external auxiliary tasks, and achieves state-of-the-art results.

Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News
Ashkan Kazemi | Zehua Li | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

In this paper, we explore the construction of natural language explanations for news claims, with the goal of assisting fact-checking and news evaluation applications. We experiment with two methods: (1) an extractive method based on Biased TextRank – a resource-effective unsupervised graph-based algorithm for content extraction; and (2) an abstractive method based on the GPT-2 language model. We perform comparative evaluations on two misinformation datasets in the political and health news domains, and find that the extractive method shows the most promise.

Evaluating Automatic Speech Recognition Quality and Its Impact on Counselor Utterance Coding
Do June Min | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

Automatic speech recognition (ASR) is a crucial step in many natural language processing (NLP) applications, as often available data consists mainly of raw speech. Since the result of the ASR step is considered as a meaningful, informative input to later steps in the NLP pipeline, it is important to understand the behavior and failure mode of this step. In this work, we analyze the quality of ASR in the psychotherapy domain, using motivational interviewing conversations between therapists and clients. We conduct domain agnostic and domain-relevant evaluations using standard evaluation metrics and also identify domain-relevant keywords in the ASR output. Moreover, we empirically study the effect of mixing ASR and manual data during the training of a downstream NLP model, and also demonstrate how additional local context can help alleviate the error introduced by noisy ASR transcripts.

Exploring Self-Identified Counseling Expertise in Online Support Forums
Allison Lahnala | Yuntian Zhao | Charles Welch | Jonathan K. Kummerfeld | Lawrence C An | Kenneth Resnicow | Rada Mihalcea | Verónica Pérez-Rosas
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact
Zhijing Jin | Geeticka Chauhan | Brian Tse | Mrinmaya Sachan | Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Analyzing the Surprising Variability in Word Embedding Stability Across Languages
Laura Burdick | Jonathan K. Kummerfeld | Rada Mihalcea
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Word embeddings are powerful representations that form the foundation of many natural language processing architectures, both in English and in other languages. To gain further insight into word embeddings, we explore their stability (e.g., overlap between the nearest neighbors of a word in different embedding spaces) in diverse languages. We discuss linguistic properties that are related to stability, drawing out insights about correlations with affixing, language gender systems, and other features. This has implications for embedding use, particularly in research that uses them to study language trends.

2020

Biased TextRank: Unsupervised Graph-Based Content Extraction
Ashkan Kazemi | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 28th International Conference on Computational Linguistics

We introduce Biased TextRank, a graph-based content extraction method inspired by the popular TextRank algorithm that ranks text spans according to their importance for language processing tasks and according to their relevance to an input “focus.” Biased TextRank enables focused content extraction for text by modifying the random restarts in the execution of TextRank. The random restart probabilities are assigned based on the relevance of the graph nodes to the focus of the task. We present two applications of Biased TextRank: focused summarization and explanation extraction, and show that our algorithm leads to improved performance on two different datasets by significant ROUGE-N score margins. Much like its predecessor, Biased TextRank is unsupervised, easy to implement and orders of magnitude faster and lighter than current state-of-the-art Natural Language Processing methods for similar tasks.

KinGDOM: Knowledge-Guided DOMain Adaptation for Sentiment Analysis
Deepanway Ghosal | Devamanyu Hazarika | Abhinaba Roy | Navonil Majumder | Rada Mihalcea | Soujanya Poria
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Cross-domain sentiment analysis has received significant attention in recent years, prompted by the need to combat the domain gap between different applications that make use of sentiment analysis. In this paper, we take a novel perspective on this task by exploring the role of external commonsense knowledge. We introduce a new framework, KinGDOM, which utilizes the ConceptNet knowledge graph to enrich the semantics of a document by providing both domain-specific and domain-general background concepts. These concepts are learned by training a graph convolutional autoencoder that leverages inter-domain concepts in a domain-invariant manner. Conditioning a popular domain-adversarial baseline method with these learned concepts helps improve its performance over state-of-the-art approaches, demonstrating the efficacy of our proposed framework.

COSMIC: COmmonSense knowledge for eMotion Identification in Conversations
Deepanway Ghosal | Navonil Majumder | Alexander Gelbukh | Rada Mihalcea | Soujanya Poria
Findings of the Association for Computational Linguistics: EMNLP 2020

In this paper, we address the task of utterance level emotion recognition in conversations using commonsense knowledge. We propose COSMIC, a new framework that incorporates different elements of commonsense such as mental states, events, and causal relations, and build upon them to learn interactions between interlocutors participating in a conversation. Current state-of-theart methods often encounter difficulties in context propagation, emotion shift detection, and differentiating between related emotion classes. By learning distinct commonsense representations, COSMIC addresses these challenges and achieves new state-of-the-art results for emotion recognition on four different benchmark conversational datasets. Our code is available at https://github.com/declare-lab/conv-emotion.

Exploring the Value of Personalized Word Embeddings
Charles Welch | Jonathan K. Kummerfeld | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we introduce personalized word embeddings, and examine their value for language modeling. We compare the performance of our proposed prediction model when using personalized versus generic word representations, and study how these representations can be leveraged for improved performance. We provide insight into what types of words can be more accurately predicted when building personalized models. Our results show that a subset of words belonging to specific psycholinguistic categories tend to vary more in their representations across users and that combining generic and personalized word embeddings yields the best performance, with a 4.7% relative reduction in perplexity. Additionally, we show that a language model using personalized word embeddings can be effectively used for authorship attribution.

Compositional Demographic Word Embeddings
Charles Welch | Jonathan K. Kummerfeld | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.

MIME: MIMicking Emotions for Empathetic Response Generation
Navonil Majumder | Pengfei Hong | Shanshan Peng | Jiankun Lu | Deepanway Ghosal | Alexander Gelbukh | Rada Mihalcea | Soujanya Poria
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Current approaches to empathetic response generation view the set of emotions expressed in the input text as a flat structure, where all the emotions are treated uniformly. We argue that empathetic responses often mimic the emotion of the user to a varying degree, depending on its positivity or negativity and content. We show that the consideration of these polarity-based emotion clusters and emotional mimicry results in improved empathy and contextual relevance of the response as compared to the state-of-the-art. Also, we introduce stochasticity into the emotion mixture that yields emotionally more varied empathetic responses than the previous work. We demonstrate the importance of these factors to empathetic response generation using both automatic- and human-based evaluations. The implementation of MIME is publicly available at https://github.com/declare-lab/MIME.

“Judge me by my size (noun), do you?” YodaLib: A Demographic-Aware Humor Generation Framework
Aparna Garimella | Carmen Banea | Nabil Hossain | Rada Mihalcea
Proceedings of the 28th International Conference on Computational Linguistics

The subjective nature of humor makes computerized humor generation a challenging task. We propose an automatic humor generation framework for filling the blanks in Mad Libs® stories, while accounting for the demographic backgrounds of the desired audience. We collect a dataset consisting of such stories, which are filled in and judged by carefully selected workers on Amazon Mechanical Turk. We build upon the BERT platform to predict location-biased word fillings in incomplete sentences, and we fine-tune BERT to classify location-specific humor in a sentence. We leverage these components to produce YodaLib, a fully-automated Mad Libs style humor generation framework, which selects and ranks appropriate candidate words and sentences in order to generate a coherent and funny story tailored to certain demographics. Our experimental results indicate that YodaLib outperforms a previous semi-automated approach proposed for this task, while also surpassing human annotators in both qualitative and quantitative analyses.

Building Location Embeddings from Physical Trajectories and Textual Representations
Laura Biester | Carmen Banea | Rada Mihalcea
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Word embedding methods have become the de-facto way to represent words, having been successfully applied to a wide array of natural language processing tasks. In this paper, we explore the hypothesis that embedding methods can also be effectively used to represent spatial locations. Using a new dataset consisting of the location trajectories of 729 students over a seven month period and text data related to those locations, we implement several strategies to create location embeddings, which we then use to create embeddings of the sequences of locations a student has visited. To identify the surface level properties captured in the representations, we propose a number of probing tasks such as the presence of a specific location in a sequence or the type of activities that take place at a location. We then leverage the representations we generated and employ them in more complex downstream tasks ranging from predicting a student’s area of study to a student’s depression level, showing the effectiveness of these location embeddings.

Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020
Karin Verspoor | Kevin Bretonnel Cohen | Michael Conway | Berry de Bruijn | Mark Dredze | Rada Mihalcea | Byron Wallace
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

MuSE: a Multimodal Dataset of Stressed Emotion
Mimansa Jaiswal | Cristian-Paul Bara | Yuanhang Luo | Mihai Burzo | Rada Mihalcea | Emily Mower Provost
Proceedings of the Twelfth Language Resources and Evaluation Conference

Endowing automated agents with the ability to provide support, entertainment and interaction with human beings requires sensing of the users’ affective state. These affective states are impacted by a combination of emotion inducers, current psychological state, and various conversational factors. Although emotion classification in both singular and dyadic settings is an established area, the effects of these additional factors on the production and perception of emotion is understudied. This paper presents a new dataset, Multimodal Stressed Emotion (MuSE), to study the multimodal interplay between the presence of stress and expressions of affect. We describe the data collection protocol, the possible areas of use, and the annotations for the emotional content of the recordings. The paper also presents several baselines to measure the performance of multimodal features for emotion and stress classification.

LifeQA: A Real-life Dataset for Video Question Answering
Santiago Castro | Mahmoud Azab | Jonathan Stroud | Cristina Noujaim | Ruoyao Wang | Jia Deng | Rada Mihalcea
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce LifeQA, a benchmark dataset for video question answering that focuses on day-to-day real-life situations. Current video question answering datasets consist of movies and TV shows. However, it is well-known that these visual domains are not representative of our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA, and we apply several state-of-the-art video question answering models to provide benchmarks for future research. The full dataset is publicly available at https://lit.eecs.umich.edu/lifeqa/.

Counseling-Style Reflection Generation Using Generative Pretrained Transformers with Augmented Context
Siqi Shen | Charles Welch | Rada Mihalcea | Verónica Pérez-Rosas
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We introduce a counseling dialogue system that seeks to assist counselors while they are learning and refining their counseling skills. The system generates counselors’reflections – i.e., responses that reflect back on what the client has said given the dialogue history. Our method builds upon the new generative pretrained transformer architecture and enhances it with context augmentation techniques inspired by traditional strategies used during counselor training. Through a set of comparative experiments, we show that the system that incorporates these strategies performs better in the reflection generation task than a system that is just fine-tuned with counseling conversations. To confirm our findings, we present a human evaluation study that shows that our system generates naturally-looking reflections that are also stylistically and grammatically correct.

Inferring Social Media Users’ Mental Health Status from Multimodal Information
Zhentao Xu | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the Twelfth Language Resources and Evaluation Conference

Worldwide, an increasing number of people are suffering from mental health disorders such as depression and anxiety. In the United States alone, one in every four adults suffers from a mental health condition, which makes mental health a pressing concern. In this paper, we explore the use of multimodal cues present in social media posts to predict users’ mental health status. Specifically, we focus on identifying social media activity that either indicates a mental health condition or its onset. We collect posts from Flickr and apply a multimodal approach that consists of jointly analyzing language, visual, and metadata cues and their relation to mental health. We conduct several classification experiments aiming to discriminate between (1) healthy users and users affected by a mental health illness; and (2) healthy users and users prone to mental illness. Our experimental results indicate that using multiple modalities can improve the performance of this classification task as compared to the use of one modality at a time, and can provide important cues into a user’s mental status.

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
Charles Welch | Rada Mihalcea | Jonathan K. Kummerfeld
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.

Small Town or Metropolis? Analyzing the Relationship between Population Size and Language
Amy Rechkemmer | Steven Wilson | Rada Mihalcea
Proceedings of the Twelfth Language Resources and Evaluation Conference

The variance in language used by different cultures has been a topic of study for researchers in linguistics and psychology, but often times, language is compared across multiple countries in order to show a difference in culture. As a geographically large country that is diverse in population in terms of the background and experiences of its citizens, the U.S. also contains cultural differences within its own borders. Using a set of over 2 million posts from distinct Twitter users around the country dating back as far as 2014, we ask the following question: is there a difference in how Americans express themselves online depending on whether they reside in an urban or rural area? We categorize Twitter users as either urban or rural and identify ideas and language that are more commonly expressed in tweets written by one population over the other. We take this further by analyzing how the language from specific cities of the U.S. compares to the language of other cities and by training predictive models to predict whether a user is from an urban or rural area. We publicly release the tweet and user IDs that can be used to reconstruct the dataset for future studies in this direction.

Expressive Interviewing: A Conversational System for Coping with COVID-19
Charles Welch | Allison Lahnala | Veronica Perez-Rosas | Siqi Shen | Sarah Seraj | Larry An | Kenneth Resnicow | James Pennebaker | Rada Mihalcea
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The ongoing COVID-19 pandemic has raised concerns for many regarding personal and public health implications, financial security and economic stability. Alongside many other unprecedented challenges, there are increasing concerns over social isolation and mental health. We introduce Expressive Interviewing – an interview-style conversational system that draws on ideas from motivational interviewing and expressive writing. Expressive Interviewing seeks to encourage users to express their thoughts and feelings through writing by asking them questions about how COVID-19 has impacted their lives. We present relevant aspects of the system’s design and implementation as well as quantitative and qualitative analyses of user interactions with the system. In addition, we conduct a comparative evaluation with a general purpose dialogue system for mental health that shows our system potential in helping users to cope with COVID-19 issues.

Quantifying the Effects of COVID-19 on Mental Health Support Forums
Laura Biester | Katie Matton | Janarthanan Rajendran | Emily Mower Provost | Rada Mihalcea
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The COVID-19 pandemic, like many of the disease outbreaks that have preceded it, is likely to have a profound effect on mental health. Understanding its impact can inform strategies for mitigating negative consequences. In this work, we seek to better understand the effects of COVID-19 on mental health by examining discussions within mental health support communities on Reddit. First, we quantify the rate at which COVID-19 is discussed in each community, or subreddit, in order to understand levels of pandemic-related discussion. Next, we examine the volume of activity in order to determine whether the number of people discussing mental health has risen. Finally, we analyze how COVID-19 has influenced language use and topics of discussion within each subreddit.

2019

Identifying Visible Actions in Lifestyle Vlogs
Oana Ignat | Laura Burdick | Jia Deng | Rada Mihalcea
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We consider the task of identifying human actions visible in online videos. We focus on the widely spread genre of lifestyle vlogs, which consist of videos of people performing actions while verbally describing them. Our goal is to identify if actions mentioned in the speech description of a video are visually present. We construct a dataset with crowdsourced manual annotations of visible actions, and introduce a multimodal algorithm that leverages information derived from visual and linguistic clues to automatically infer which actions are visible in a video.

Towards Extracting Medical Family History from Natural Language Interactions: A New Dataset and Baselines
Mahmoud Azab | Stephane Dadian | Vivi Nastase | Larry An | Rada Mihalcea
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We introduce a new dataset consisting of natural language interactions annotated with medical family histories, obtained during interactions with a genetic counselor and through crowdsourcing, following a questionnaire created by experts in the domain. We describe the data collection process and the annotations performed by medical professionals, including illness and personal attributes (name, age, gender, family relationships) for the patient and their family members. An initial system that performs argument identification and relation extraction shows promising results – average F-score of 0.87 on complex sentences on the targeted relations.

Multi-Label Transfer Learning for Multi-Relational Semantic Similarity
Li Zhang | Steven Wilson | Rada Mihalcea
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Multi-relational semantic similarity datasets define the semantic relations between two short texts in multiple ways, e.g., similarity, relatedness, and so on. Yet, all the systems to date designed to capture such relations target one relation at a time. We propose a multi-label transfer learning approach based on LSTM to make predictions for several relations simultaneously and aggregate the losses to update the parameters. This multi-label regression approach jointly learns the information provided by the multiple relations, rather than treating them as separate tasks. Not only does this approach outperform the single-task approach and the traditional multi-task learning approach, but it also achieves state-of-the-art performance on all but one relation of the Human Activity Phrase dataset.

What Makes a Good Counselor? Learning to Distinguish between High-quality and Low-quality Counseling Conversations
Verónica Pérez-Rosas | Xinyi Wu | Kenneth Resnicow | Rada Mihalcea
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The quality of a counseling intervention relies highly on the active collaboration between clients and counselors. In this paper, we explore several linguistic aspects of the collaboration process occurring during counseling conversations. Specifically, we address the differences between high-quality and low-quality counseling. Our approach examines participants’ turn-by-turn interaction, their linguistic alignment, the sentiment expressed by speakers during the conversation, as well as the different topics being discussed. Our results suggest important language differences in low- and high-quality counseling, which we further use to derive linguistic features able to capture the differences between the two groups. These features are then used to build automatic classifiers that can predict counseling quality with accuracies of up to 88%.

Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper)
Santiago Castro | Devamanyu Hazarika | Verónica Pérez-Rosas | Roger Zimmermann | Rada Mihalcea | Soujanya Poria
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Sarcasm is often expressed through several verbal and non-verbal cues, e.g., a change of tone, overemphasis in a word, a drawn-out syllable, or a straight looking face. Most of the recent work in sarcasm detection has been carried out on textual data. In this paper, we argue that incorporating multimodal cues can improve the automatic classification of sarcasm. As a first step towards enabling the development of multimodal approaches for sarcasm detection, we propose a new sarcasm dataset, Multimodal Sarcasm Detection Dataset (MUStARD), compiled from popular TV shows. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context of historical utterances in the dialogue, which provides additional information on the scenario where the utterance occurs. Our initial results show that the use of multimodal information can reduce the relative error rate of sarcasm detection by up to 12.9% in F-score when compared to the use of individual modalities. The full dataset is publicly available for use at https://github.com/soujanyaporia/MUStARD.

Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)
Rada Mihalcea | Ekaterina Shutova | Lun-Wei Ku | Kilian Evang | Soujanya Poria
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Women’s Syntactic Resilience and Men’s Grammatical Luck: Gender-Bias in Part-of-Speech Tagging and Dependency Parsing
Aparna Garimella | Carmen Banea | Dirk Hovy | Rada Mihalcea
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Several linguistic studies have shown the prevalence of various lexical and grammatical patterns in texts authored by a person of a particular gender, but models for part-of-speech tagging and dependency parsing have still not adapted to account for these differences. To address this, we annotate the Wall Street Journal part of the Penn Treebank with the gender information of the articles’ authors, and build taggers and parsers trained on this data that show performance differences in text written by men and women. Further analyses reveal numerous part-of-speech tags and syntactic relations whose prediction performances benefit from the prevalence of a specific gender in the training data. The results underscore the importance of accounting for gendered differences in syntactic tasks, and outline future venues for developing more accurate taggers and parsers. We release our data to the research community.

Representing Movie Characters in Dialogues
Mahmoud Azab | Noriyuki Kojima | Jia Deng | Rada Mihalcea
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We introduce a new embedding model to represent movie characters and their interactions in a dialogue by encoding in the same representation the language used by these characters as well as information about the other participants in the dialogue. We evaluate the performance of these new character embeddings on two tasks: (1) character relatedness, using a dataset we introduce consisting of a dense character interaction matrix for 4,378 unique character pairs over 22 hours of dialogue from eighteen movies; and (2) character relation classification, for fine- and coarse-grained relations, as well as sentiment relations. Our experiments show that our model significantly outperforms the traditional Word2Vec continuous bag-of-words and skip-gram models, demonstrating the effectiveness of the character embeddings we introduce. We further show how these embeddings can be used in conjunction with a visual question answering system to improve over previous results.

Predicting Human Activities from User-Generated Content
Steven Wilson | Rada Mihalcea
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The activities we do are linked to our interests, personality, political preferences, and decisions we make about the future. In this paper, we explore the task of predicting human activities from user-generated content. We collect a dataset containing instances of social media users writing about a range of everyday activities. We then use a state-of-the-art sentence embedding framework tailored to recognize the semantics of human activities and perform an automatic clustering of these activities. We train a neural network model to make predictions about which clusters contain activities that were performed by a given user based on the text of their previous posts and self-description. Additionally, we explore the degree to which incorporating inferred user traits into our model helps with this prediction task.

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations
Soujanya Poria | Devamanyu Hazarika | Navonil Majumder | Gautam Naik | Erik Cambria | Rada Mihalcea
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Emotion recognition in conversations is a challenging task that has recently gained popularity due to its potential applications. Until now, however, a large-scale multimodal multi-party emotional conversational database containing more than two speakers per dialogue was missing. Thus, we propose the Multimodal EmotionLines Dataset (MELD), an extension and enhancement of EmotionLines. MELD contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends. Each utterance is annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities. We propose several strong multimodal baselines and show the importance of contextual and multimodal information for emotion recognition in conversations. The full dataset is available for use at http://affective-meld.github.io.

Box of Lies: Multimodal Deception Detection in Dialogues
Felix Soldner | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Deception often takes place during everyday conversations, yet conversational dialogues remain largely unexplored by current work on automatic deception detection. In this paper, we address the task of detecting multimodal deceptive cues during conversational dialogues. We introduce a multimodal dataset containing deceptive conversations between participants playing the Box of Lies game from The Tonight Show Starring Jimmy Fallon, in which they try to guess whether an object description provided by their opponent is deceptive or not. We conduct annotations of multimodal communication behaviors, including facial and linguistic behaviors, and derive several learning features based on these annotations. Initial classification experiments show promising results, performing well above both a random and a human baseline, and reaching up to 69% accuracy in distinguishing deceptive and truthful behaviors.

2018

Analyzing the Quality of Counseling Conversations: the Tell-Tale Signs of High-quality Counseling
Verónica Pérez-Rosas | Xuetong Sun | Christy Li | Yuchen Wang | Kenneth Resnicow | Rada Mihalcea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection
Devamanyu Hazarika | Soujanya Poria | Rada Mihalcea | Erik Cambria | Roger Zimmermann
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Emotion recognition in conversations is crucial for building empathetic machines. Present works in this domain do not explicitly consider the inter-personal influences that thrive in the emotional dynamics of dialogues. To this end, we propose Interactive COnversational memory Network (ICON), a multimodal emotion detection framework that extracts multimodal features from conversational videos and hierarchically models the self- and inter-speaker emotional influences into global memories. Such memories generate contextual summaries which aid in predicting the emotional orientation of utterance-videos. Our model outperforms state-of-the-art networks on multiple classification and regression tasks in two benchmark datasets.

World Knowledge for Abstract Meaning Representation Parsing
Charles Welch | Jonathan K. Kummerfeld | Song Feng | Rada Mihalcea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Automatic Detection of Fake News
Verónica Pérez-Rosas | Bennett Kleinberg | Alexandra Lefevre | Rada Mihalcea
Proceedings of the 27th International Conference on Computational Linguistics

The proliferation of misleading information in everyday access media outlets such as social media feeds, news blogs, and online newspapers have made it challenging to identify trustworthy news sources, thus increasing the need for computational tools able to provide insights into the reliability of online content. In this paper, we focus on the automatic identification of fake content in online news. Our contribution is twofold. First, we introduce two novel datasets for the task of fake news detection, covering seven different news domains. We describe the collection, annotation, and validation process in detail and present several exploratory analyses on the identification of linguistic differences in fake and legitimate news content. Second, we conduct a set of learning experiments to build accurate fake news detectors, and show that we can achieve accuracies of up to 76%. In addition, we provide comparative analyses of the automatic and manual identification of fake news.

Factors Influencing the Surprising Instability of Word Embeddings
Laura Wendlandt | Jonathan K. Kummerfeld | Rada Mihalcea
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this paper, we consider one aspect of embedding spaces, namely their stability. We show that even relatively high frequency words (100-200 occurrences) are often unstable. We provide empirical evidence for how various factors contribute to the stability of word embeddings, and we analyze the effects of stability on downstream tasks.

Speaker Naming in Movies
Mahmoud Azab | Mingzhe Wang | Max Smith | Noriyuki Kojima | Jia Deng | Rada Mihalcea
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We propose a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework. To evaluate the performance of our model, we introduce a new dataset consisting of six episodes of the Big Bang Theory TV show and eighteen full movies covering different genres. Our experiments show that our multimodal model significantly outperforms several competitive baselines on the average weighted F-score metric. To demonstrate the effectiveness of our framework, we design an end-to-end memory network model that leverages our speaker naming model and achieves state-of-the-art results on the subtitles task of the MovieQA 2017 Challenge.

CASCADE: Contextual Sarcasm Detection in Online Discussion Forums
Devamanyu Hazarika | Soujanya Poria | Sruthi Gorantla | Erik Cambria | Roger Zimmermann | Rada Mihalcea
Proceedings of the 27th International Conference on Computational Linguistics

The literature in automated sarcasm detection has mainly focused on lexical-, syntactic- and semantic-level analysis of text. However, a sarcastic sentence can be expressed with contextual presumptions, background and commonsense knowledge. In this paper, we propose a ContextuAl SarCasm DEtector (CASCADE), which adopts a hybrid approach of both content- and context-driven modeling for sarcasm detection in online social media discussions. For the latter, CASCADE aims at extracting contextual information from the discourse of a discussion thread. Also, since the sarcastic nature and form of expression can vary from person to person, CASCADE utilizes user embeddings that encode stylometric and personality features of users. When used along with content-based feature extractors such as convolutional neural networks, we see a significant boost in the classification performance on a large Reddit corpus.

2017

A Computational Analysis of the Language of Drug Addiction
Carlo Strapparava | Rada Mihalcea
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a computational analysis of the language of drug users when talking about their drug experiences. We introduce a new dataset of over 4,000 descriptions of experiences reported by users of four main drug types, and show that we can predict with an F1-score of up to 88% the drug behind a certain experience. We also perform an analysis of the dominant psycholinguistic processes and dominant emotions associated with each drug type, which sheds light on the characteristics of drug users.

Identity Deception Detection
Verónica Pérez-Rosas | Quincy Davenport | Anna Mengdan Dai | Mohamed Abouelenien | Rada Mihalcea
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper addresses the task of detecting identity deception in language. Using a novel identity deception dataset, consisting of real and portrayed identities from 600 individuals, we show that we can build accurate identity detectors targeting both age and gender, with accuracies of up to 88. We also perform an analysis of the linguistic patterns used in identity deception, which lead to interesting insights into identity portrayers.

Predicting Counselor Behaviors in Motivational Interviewing Encounters
Verónica Pérez-Rosas | Rada Mihalcea | Kenneth Resnicow | Satinder Singh | Lawrence An | Kathy J. Goggin | Delwyn Catley
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

As the number of people receiving psycho-therapeutic treatment increases, the automatic evaluation of counseling practice arises as an important challenge in the clinical domain. In this paper, we address the automatic evaluation of counseling performance by analyzing counselors’ language during their interaction with clients. In particular, we present a model towards the automation of Motivational Interviewing (MI) coding, which is the current gold standard to evaluate MI counseling. First, we build a dataset of hand labeled MI encounters; second, we use text-based methods to extract and analyze linguistic patterns associated with counselor behaviors; and third, we develop an automatic system to predict these behaviors. We introduce a new set of features based on semantic information and syntactic patterns, and show that they lead to accuracy figures of up to 90%, which represent a significant improvement with respect to features used in the past.

Measuring Semantic Relations between Human Activities
Steven Wilson | Rada Mihalcea
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The things people do in their daily lives can provide valuable insights into their personality, values, and interests. Unstructured text data on social media platforms are rich in behavioral content, and automated systems can be deployed to learn about human activity on a broad scale if these systems are able to reason about the content of interest. In order to aid in the evaluation of such systems, we introduce a new phrase-level semantic textual similarity dataset comprised of human activity phrases, providing a testbed for automated systems that analyze relationships between phrasal descriptions of people’s actions. Our set of 1,000 pairs of activities is annotated by human judges across four relational dimensions including similarity, relatedness, motivational alignment, and perceived actor congruence. We evaluate a set of strong baselines for the task of generating scores that correlate highly with human ratings, and we introduce several new approaches to the phrase-level similarity task in the domain of human activities.

Understanding and Predicting Empathic Behavior in Counseling Therapy
Verónica Pérez-Rosas | Rada Mihalcea | Kenneth Resnicow | Satinder Singh | Lawrence An
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Counselor empathy is associated with better outcomes in psychology and behavioral counseling. In this paper, we explore several aspects pertaining to counseling interaction dynamics and their relation to counselor empathy during motivational interviewing encounters. Particularly, we analyze aspects such as participants’ engagement, participants’ verbal and nonverbal accommodation, as well as topics being discussed during the conversation, with the final goal of identifying linguistic and acoustic markers of counselor empathy. We also show how we can use these findings alongside other raw linguistic and acoustic features to build accurate counselor empathy classifiers with accuracies of up to 80%.

Demographic-aware word associations
Aparna Garimella | Carmen Banea | Rada Mihalcea
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Variations of word associations across different groups of people can provide insights into people’s psychologies and their world views. To capture these variations, we introduce the task of demographic-aware word associations. We build a new gold standard dataset consisting of word association responses for approximately 300 stimulus words, collected from more than 800 respondents of different gender (male/female) and from different locations (India/United States), and show that there are significant variations in the word associations made by these groups. We also introduce a new demographic-aware word association model based on a neural net skip-gram architecture, and show how computational methods for measuring word associations that specifically account for writer demographics can outperform generic methods that are agnostic to such information.

Identifying Usage Expression Sentences in Consumer Product Reviews
Shibamouli Lahiri | V. G. Vinod Vydiswaran | Rada Mihalcea
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper we introduce the problem of identifying usage expression sentences in a consumer product review. We create a human-annotated gold standard dataset of 565 reviews spanning five distinct product categories. Our dataset consists of more than 3,000 annotated sentences. We further introduce a classification system to label sentences according to whether or not they describe some “usage”. The system combines lexical, syntactic, and semantic features in a product-agnostic fashion to yield good classification performance. We show the effectiveness of our approach using importance ranking of features, error analysis, and cross-product classification experiments.

Computational Sociolinguistics – An Emerging Partnership
Rada Mihalcea
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

2016

Finding Optimists and Pessimists on Twitter
Xianzhi Ruan | Steven Wilson | Rada Mihalcea
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Disentangling Topic Models: A Cross-cultural Analysis of Personal Values through Words
Steven Wilson | Rada Mihalcea | Ryan Boyd | James Pennebaker
Proceedings of the First Workshop on NLP and Computational Social Science

Zooming in on Gender Differences in Social Media
Aparna Garimella | Rada Mihalcea
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

Men are from Mars and women are from Venus - or so the genre of relationship literature would have us believe. But there is some truth in this idea, and researchers in fields as diverse as psychology, sociology, and linguistics have explored ways to better understand the differences between genders. In this paper, we take another look at the problem of gender discrimination and attempt to move beyond the typical surface-level text classification approach, by (1) identifying semantic and psycholinguistic word classes that reflect systematic differences between men and women and (2) finding differences between genders in the ways they use the same words. We describe several experiments and report results on a large collection of blogs authored by men and women.

Building a Dataset for Possessions Identification in Text
Carmen Banea | Xi Chen | Rada Mihalcea
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Just as industrialization matured from mass production to customization and personalization, so has the Web migrated from generic content to public disclosures of one’s most intimately held thoughts, opinions and beliefs. This relatively new type of data is able to represent finer and more narrowly defined demographic slices. If until now researchers have primarily focused on leveraging personalized content to identify latent information such as gender, nationality, location, or age of the author, this study seeks to establish a structured way of extracting possessions, or items that people own or are entitled to, as a way to ultimately provide insights into people’s behaviors and characteristics. In order to promote more research in this area, we are releasing a set of 798 possessions extracted from blog genre, where possessions are marked at different confidence levels, as well as a detailed set of guidelines to help in future annotation studies.

Identifying Cross-Cultural Differences in Word Usage
Aparna Garimella | Rada Mihalcea | James Pennebaker
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Personal writings have inspired researchers in the fields of linguistics and psychology to study the relationship between language and culture to better understand the psychology of people across different cultures. In this paper, we explore this relation by developing cross-cultural word models to identify words with cultural bias – i.e., words that are used in significantly different ways by speakers from different cultures. Focusing specifically on two cultures: United States and Australia, we identify a set of words with significant usage differences, and further investigate these words through feature analysis and topic modeling, shedding light on the attributes of language that contribute to these differences.

SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Eneko Agirre | Carmen Banea | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Rada Mihalcea | German Rigau | Janyce Wiebe
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

Targeted Sentiment to Understand Student Comments
Charles Welch | Rada Mihalcea
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We address the task of targeted sentiment as a means of understanding the sentiment that students hold toward courses and instructors, as expressed by students in their comments. We introduce a new dataset consisting of student comments annotated for targeted sentiment and describe a system that can both identify the courses and instructors mentioned in student comments, as well as label the students’ sentiment toward those entities. Through several comparative evaluations, we show that our system outperforms previous work on a similar task.

Building a Motivational Interviewing Dataset
Verónica Pérez-Rosas | Rada Mihalcea | Kenneth Resnicow | Satinder Singh | Lawrence An
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

2015

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Rada Mihalcea | Joyce Chai | Anoop Sarkar
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability
Eneko Agirre | Carmen Banea | Claire Cardie | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo | Iñigo Lopez-Gazpio | Montse Maritxalar | Rada Mihalcea | German Rigau | Larraitz Uria | Janyce Wiebe
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Co-Training for Topic Classification of Scholarly Data
Cornelia Caragea | Florin Bulgarov | Rada Mihalcea
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Experiments in Open Domain Deception Detection
Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Verbal and Nonverbal Clues for Real-life Deception Detection
Verónica Pérez-Rosas | Mohamed Abouelenien | Rada Mihalcea | Yao Xiao | CJ Linton | Mihai Burzo
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Using Word Semantics To Assist English as a Second Language Learners
Mahmoud Azab | Chris Hokamp | Rada Mihalcea
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2014

A Multimodal Dataset for Deception Detection
Verónica Pérez-Rosas | Rada Mihalcea | Alexis Narvaez | Mihai Burzo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the construction of a multimodal dataset for deception detection, including physiological, thermal, and visual responses of human subjects under three deceptive scenarios. We present the experimental protocol, as well as the data acquisition process. To evaluate the usefulness of the dataset for the task of deception detection, we present a statistical analysis of the physiological and thermal modalities associated with the deceptive and truthful conditions. Initial results show that physiological and thermal responses can differentiate between deceptive and truthful states.

SemEval-2014 Task 10: Multilingual Semantic Textual Similarity
Eneko Agirre | Carmen Banea | Claire Cardie | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo | Rada Mihalcea | German Rigau | Janyce Wiebe
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

SimCompass: Using Deep Learning Word Embeddings to Assess Cross-level Similarity
Carmen Banea | Di Chen | Rada Mihalcea | Claire Cardie | Janyce Wiebe
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Iterative Constrained Clustering for Subjectivity Word Sense Disambiguation
Cem Akkaya | Janyce Wiebe | Rada Mihalcea
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

Cross-cultural Deception Detection
Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Modeling Language Proficiency Using Implicit Feedback
Chris Hokamp | Rada Mihalcea | Peter Schuelke
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We describe the results of several experiments with interactive interfaces for native and L2 English students, designed to collect implicit feedback from students as they complete a reading activity. In this study, implicit means that all data is obtained without asking the user for feedback. To test the value of implicit feedback for assessing student proficiency, we collect features of user behavior and interaction, which are then used to train classification models. Based upon the feedback collected during these experiments, a students performance on a quiz and proficiency relative to other students can be accurately predicted, which is a step on the path to our goal of providing automatic feedback and unintrusive evaluation in interactive learning environments.

Building a Dataset for Summarization and Keyword Extraction from Emails
Vanessa Loza | Shibamouli Lahiri | Rada Mihalcea | Po-Hsiang Lai
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper introduces a new email dataset, consisting of both single and thread emails, manually annotated with summaries and keywords. A total of 349 emails and threads have been annotated. The dataset is our first step toward developing automatic methods for summarization and keyword extraction from emails. We describe the email corpus, along with the annotation interface, annotator guidelines, and agreement studies.

2013

Multilingual Word Sense Disambiguation Using Wikipedia
Bharath Dandala | Rada Mihalcea | Razvan Bunescu
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Sense Clustering Using Wikipedia
Bharath Dandala | Chris Hokamp | Rada Mihalcea | Razvan Bunescu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

CPN-CORE: A Text Semantic Similarity System Infused with Opinion Knowledge
Carmen Banea | Yoonjung Choi | Lingjia Deng | Samer Hassan | Michael Mohler | Bishan Yang | Claire Cardie | Rada Mihalcea | Jan Wiebe
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

Utterance-Level Multimodal Sentiment Analysis
Verónica Pérez-Rosas | Rada Mihalcea | Louis-Philippe Morency
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Using N-gram and Word Network Features for Native Language Identification
Shibamouli Lahiri | Rada Mihalcea
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

Coarse to Fine Grained Sense Disambiguation in Wikipedia
Hui Shen | Razvan Bunescu | Rada Mihalcea
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

Word Epoch Disambiguation: Finding How Words Change Over Time
Rada Mihalcea | Vivi Nastase
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Learning Sentiment Lexicons in Spanish
Verónica Pérez-Rosas | Carmen Banea | Rada Mihalcea
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we present a framework to derive sentiment lexicons in a target language by using manually or automatically annotated data available in an electronic resource rich language, such as English. We show that bridging the language gap using the multilingual sense-level aligned WordNet structure allows us to generate a high accuracy (90%) polarity lexicon comprising 1,347 entries, and a disjoint lower accuracy (74%) one encompassing 2,496 words. By using an LSA-based vectorial expansion for the generated lexicons, we are able to obtain an average F-measure of 66% in the target language. This implies that the lexicons could be used to bootstrap higher-coverage lexicons using in-language resources.

A Parallel Corpus of Music and Lyrics Annotated with Emotions
Carlo Strapparava | Rada Mihalcea | Alberto Battocchi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we introduce a novel parallel corpus of music and lyrics, annotated with emotions at line level. We first describe the corpus, consisting of 100 popular songs, each of them including a music component, provided in the MIDI format, as well as a lyrics component, made available as raw text. We then describe our work on enhancing this corpus with emotion annotations using crowdsourcing. We also present some initial experiments on emotion classification using the music and the lyrics representations of the songs, which lead to encouraging results, thus demonstrating the promise of using joint music-lyric models for song processing.

Multimodal Sentiment Analysis
Rada Mihalcea
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

Sense and Reference Disambiguation in Wikipedia
Hui Shen | Razvan Bunescu | Rada Mihalcea
Proceedings of COLING 2012: Posters

SemEval-2012 Task 1: English Lexical Simplification
Lucia Specia | Sujay Kumar Jauhar | Rada Mihalcea
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Lyrics, Music, and Emotions
Rada Mihalcea | Carlo Strapparava
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Multilingual Natural Language Processing
Rada Mihalcea
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

Measuring Semantic Relatedness using Multilingual Representations
Samer Hassan | Carmen Banea | Rada Mihalcea
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Multilingual Subjectivity and Sentiment Analysis
Rada Mihalcea | Carmen Banea | Janyce Wiebe
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Unsupervised Word Sense Disambiguation with Multilingual Representations
Erwin Fernandez-Ordoñez | Rada Mihalcea | Samer Hassan
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we investigate the role of multilingual features in improving word sense disambiguation. In particular, we explore the use of semantic clues derived from context translation to enrich the intended sense and therefore reduce ambiguity. Our experiments demonstrate up to 26% increase in disambiguation accuracy by utilizing multilingual features as compared to the monolingual baseline.

Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature
David Elson | Anna Kazantseva | Rada Mihalcea | Stan Szpakowicz
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature

UNT: A Supervised Synergistic Approach to Semantic Text Similarity
Carmen Banea | Samer Hassan | Michael Mohler | Rada Mihalcea
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia
Bharath Dandala | Rada Mihalcea | Razvan Bunescu
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2011

Sense-level Subjectivity in a Multilingual Setting
Carmen Banea | Rada Mihalcea | Janyce Wiebe
Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011)

Word Sense Disambiguation with Multilingual Features
Carmen Banea | Rada Mihalcea
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

Measuring the semantic relatedness between words and images
Chee Wee Leong | Rada Mihalcea
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Dekang Lin | Yuji Matsumoto | Rada Mihalcea
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Dekang Lin | Yuji Matsumoto | Rada Mihalcea
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Topic Modeling on Historical Newspapers
Tze-I Yang | Andrew Torget | Rada Mihalcea
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Going Beyond Text: A Hybrid Image-Text Approach for Measuring Word Relatedness
Chee Wee Leong | Rada Mihalcea
Proceedings of 5th International Joint Conference on Natural Language Processing

Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
Michael Mohler | Razvan Bunescu | Rada Mihalcea
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

An Efficient Indexer for Large N-Gram Corpora
Hakan Ceylan | Rada Mihalcea
Proceedings of the ACL-HLT 2011 System Demonstrations

Improving the Impact of Subjectivity Word Sense Disambiguation on Contextual Opinion Analysis
Cem Akkaya | Janyce Wiebe | Alexander Conrad | Rada Mihalcea
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

2010

Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation
Cem Akkaya | Alexander Conrad | Janyce Wiebe | Rada Mihalcea
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

SemEval-2010 Task 2: Cross-Lingual Lexical Substitution
Rada Mihalcea | Ravi Sinha | Diana McCarthy
Proceedings of the 5th International Workshop on Semantic Evaluation

Multilingual Subjectivity: Are More Languages Better?
Carmen Banea | Rada Mihalcea | Janyce Wiebe
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Quantifying the Limits and Success of Extractive Summarization Systems Across Domains
Hakan Ceylan | Rada Mihalcea | Umut Özertem | Elena Lloret | Manuel Palomar
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Text Mining for Automatic Image Tagging
Chee Wee Leong | Rada Mihalcea | Samer Hassan
Coling 2010: Posters

Cross Language Text Classification by Model Translation and Semi-Supervised Learning
Lei Shi | Rada Mihalcea | Mingjun Tian
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

Explorations in Automatic Image Annotation using Textual Features
Chee Wee Leong | Rada Mihalcea
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

Combining Lexical Resources for Contextual Synonym Expansion
Ravi Sinha | Rada Mihalcea
Proceedings of the International Conference RANLP-2009

Learning to Identify Educational Materials
Samer Hassan | Rada Mihalcea
Proceedings of the International Conference RANLP-2009

SemEval-2010 Task 2: Cross-Lingual Lexical Substitution
Ravi Sinha | Diana McCarthy | Rada Mihalcea
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge
Samer Hassan | Rada Mihalcea
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Subjectivity Word Sense Disambiguation
Cem Akkaya | Janyce Wiebe | Rada Mihalcea
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Text-to-Text Semantic Similarity for Automatic Short Answer Grading
Michael Mohler | Rada Mihalcea
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

Topic Identification Using Wikipedia Graph Centrality
Kino Coursey | Rada Mihalcea
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

Using Encyclopedic Knowledge for Automatic Topic Identification
Kino Coursey | Rada Mihalcea | William Moen
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language
Rada Mihalcea | Carlo Strapparava
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Integrating Knowledge for Subjectivity Sense Labeling
Yaw Gyamfi | Janyce Wiebe | Rada Mihalcea | Cem Akkaya
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
Philipp Koehn | Rada Mihalcea
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

Linguistically Motivated Features for Enhanced Back-of-the-Book Indexing
Andras Csomai | Rada Mihalcea
Proceedings of ACL-08: HLT

How to Add a New Language on the NLP Map: Building Resources and Tools for Languages with Scarce Resources
Rada Mihalcea | Vivi Nastase
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources
Carmen Banea | Rada Mihalcea | Janyce Wiebe
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper introduces a method for creating a subjectivity lexicon for languages with scarce resources. The method is able to build a subjectivity lexicon by using a small seed set of subjective words, an online dictionary, and a small raw corpus, coupled with a bootstrapping process that ranks new candidate words based on a similarity measure. Experiments performed with a rule-based sentence level subjectivity classifier show an 18% absolute improvement in F-measure as compared to previously proposed semi-supervised methods.

Multilingual Subjectivity Analysis Using Machine Translation
Carmen Banea | Rada Mihalcea | Janyce Wiebe | Samer Hassan
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger
Rada Mihalcea
Computational Linguistics, Volume 34, Number 1, March 2008

Babylon Parallel Text Builder: Gathering Parallel Texts for Low-Density Languages
Michael Mohler | Rada Mihalcea
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes Babylon, a system that attempts to overcome the shortage of parallel texts in low-density languages by supplementing existing parallel texts with texts gathered automatically from the Web. In addition to the identification of entire Web pages, we also propose a new feature specifically designed to find parallel text chunks within a single document. Experiments carried out on the Quechua-Spanish language pair show that the system is successful in automatically identifying a significant amount of parallel texts on the Web. Evaluations of a machine translation system trained on this corpus indicate that the Web-gathered parallel texts can supplement manually compiled parallel texts and perform significantly better than the manually compiled texts when tested on other Web-gathered data.

2007

Learning Multilingual Subjective Language via Cross-Lingual Projections
Rada Mihalcea | Carmen Banea | Janyce Wiebe
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

SemEval-2007 Task 14: Affective Text
Carlo Strapparava | Rada Mihalcea
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing
Chris Biemann | Irina Matveeva | Rada Mihalcea | Dragomir Radev
Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing

Explorations in Automatic Book Summarization
Rada Mihalcea | Hakan Ceylan
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

Using Wikipedia for Automatic Word Sense Disambiguation
Rada Mihalcea
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

UNT: SubFinder: Combining Knowledge Sources for Automatic Lexical Substitution
Samer Hassan | Andras Csomai | Carmen Banea | Ravi Sinha | Rada Mihalcea
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

UNT-Yahoo: SuperSenseLearner: Combining SenseLearner with SuperSense and other Coarse Semantic Features
Rada Mihalcea | Andras Csomai | Massimiliano Ciaramita
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

Toward Communicating Simple Sentences Using Pictorial Representations
Rada Mihalcea | Ben Leong
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

This paper evaluates the hypothesis that pictorial representations can be used to effectively convey simple sentences across language barriers. Comparative evaluations show that a considerable amount of understanding can be achieved using visual descriptions of information, with evaluation figures within a comparable range of those obtained with linguistic representations produced by an automatic machine translation system.

Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing
Rada Mihalcea | Dragomir Radev
Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing

Word Sense and Subjectivity
Janyce Wiebe | Rada Mihalcea
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

Graph-based Algorithms for Natural Language Processing and Information Retrieval
Rada Mihalcea | Dragomir Radev
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts

2005

Making Computers Laugh: Investigations in Automatic Humor Recognition
Rada Mihalcea | Carlo Strapparava
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

Proceedings of the ACL Workshop on Building and Using Parallel Texts
Philipp Koehn | Joel Martin | Rada Mihalcea | Christof Monz | Ted Pedersen
Proceedings of the ACL Workshop on Building and Using Parallel Texts

A Language Independent Algorithm for Single and Multiple Document Summarization
Rada Mihalcea | Paul Tarau
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

Language Independent Extractive Summarization
Rada Mihalcea
Proceedings of the ACL Interactive Poster and Demonstration Sessions

Measuring the Semantic Similarity of Texts
Courtney Corley | Rada Mihalcea
Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment

Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling
Rada Mihalcea
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

SenseLearner: Word Sense Disambiguation for All Words in Unrestricted Text
Rada Mihalcea | Andras Csomai
Proceedings of the ACL Interactive Poster and Demonstration Sessions

Word Alignment for Languages with Scarce Resources
Joel Martin | Rada Mihalcea | Ted Pedersen
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

The Senseval-3 Multilingual English-Hindi lexical sample task
Timothy Chklovski | Rada Mihalcea | Ted Pedersen | Amruta Purandare
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

Open Text Semantic Parsing Using FrameNet and WordNet
Lei Shi | Rada Mihalcea
Demonstration Papers at HLT-NAACL 2004

Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization
Rada Mihalcea
Proceedings of the ACL Interactive Poster and Demonstration Sessions

SenseLearner: Minimally supervised Word Sense Disambiguation for all words in open text
Rada Mihalcea | Ehsanul Faruque
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

An evaluation exercise for Romanian Word Sense Disambiguation
Rada Mihalcea | Vivi Năstase | Timothy Chklovski | Doina Tătar | Dan Tufiş | Florentina Hristea
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

The Senseval-3 English lexical sample task
Rada Mihalcea | Timothy Chklovski | Adam Kilgarriff
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

PageRank on Semantic Networks, with Application to Word Sense Disambiguation
Rada Mihalcea | Paul Tarau | Elizabeth Figa
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Co-training and Self-training for Word Sense Disambiguation
Rada Mihalcea
Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004

TextRank: Bringing Order into Text
Rada Mihalcea | Paul Tarau
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

An algorithm for open text semantic parsing
Lei Shi | Rada Mihalcea
Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004)

Finding Semantic Associations on Express Lane
Vivi Năstase | Rada Mihalcea
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

An Evaluation Exercise for Word Alignment
Rada Mihalcea | Ted Pedersen
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users’ Help
Rada Mihalcea | Timothy Chklovski
Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) at EACL 2003

2002

Letter Level Learning for Language Independent Diacritics Restoration
Rada Mihalcea | Vivi Nastase
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

Instance Based Learning with Automatic Feature Selection Applied to Word Sense Disambiguation
Rada Mihalcea
COLING 2002: The 19th International Conference on Computational Linguistics

Bootstrapping Large Sense Tagged Corpora
Rada F. Mihalcea
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

Building a Sense Tagged Corpus with Open Mind Word Expert
Timothy Chklovski | Rada Mihalcea
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions

2001

Pattern Learning and Active Feature Selection for Word Sense Disambiguation
Rada F. Mihalcea | Dan I. Moldovan
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

The Role of Lexico-Semantic Feedback in Open-Domain Textual Question-Answering
Sanda Harabagiu | Dan Moldovan | Marius Paşca | Rada Mihalcea | Mihai Surdeanu | Răzvan Bunescu | Roxana Gîrju | Vasile Rus | Paul Morărescu
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

2000

Semantic Indexing using WordNet Senses
Rada Mihalcea | Dan Moldovan
ACL-2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval

The Structure and Performance of an Open-Domain Question Answering System
Dan Moldovan | Sanda Harabagiu | Marius Pasca | Rada Mihalcea | Roxana Girju | Richard Goodrum | Vasile Rus
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

1999

A Method for Word Sense Disambiguation of Unrestricted Text
Rada Mihalcea | Dan I. Moldovan
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

Word Sense Disambiguation based on Semantic Density
Rada Mihalcea | Dan I. Moldovan
Usage of WordNet in Natural Language Processing Systems

Co-authors

Jonathan K. Kummerfeld 12

Navonil Majumder 11

Deepanway Ghosal 10

Charles Welch 10

Kenneth Resnicow 9

Santiago Castro 8

Aparna Garimella 8

Bernhard Schölkopf 8

Steven Wilson 8

Razvan Bunescu 7

Artem Abzaliev 6

Mrinmaya Sachan 6

Carlo Strapparava 6

Laura Biester 5

Timothy Chklovski 5

Devamanyu Hazarika 5

Ashkan Kazemi 5

Michael Mohler 5

Mohamed Abouelenien 4

Laura Burdick 4

Claire Cardie 4

Andras Csomai 4

Chee Wee Leong 4

James Pennebaker 4

Bharath Dandala 3

Aitor González-Agirre 3

Shibamouli Lahiri 3

Dragomir Radev 3

Keenan Samway 3

Satinder Singh 3

Roger Zimmermann 3

Alexander Conrad 2

Daryna Dementieva 2

Alexander Gelbukh 2

Aylin Ece Gunal 2

Sanda Harabagiu 2

Pingxuan Huang 2

Philipp Koehn 2

Noriyuki Kojima 2

Allison Lahnala 2

Lajanugen Logeswaran 2

Yuji Matsumoto 2

Diana McCarthy 2

Emily Mower Provost 2

Joan C. Nwatu 2

David Guzman Piedrahita 2

Sahand Sabour 2

Jonathan Stroud 2

Joel Tetreault 2

Tejas Vaidhya 2

Fernando Adauto 1

S M Masrur Ahmed 1

Alham Fikri Aji 1

Akhash Amarnath 1

Lawrence C An 1

Cristian-Paul Bara 1

Alberto Battocchi 1

Thore Bergman 1

Chris Biemann 1

Kiran Bodipati 1

Florin Bulgarov 1

Cornelia Caragea 1

Delwyn Catley 1

Roberto Ceraolo 1

Ilias Chalkidis 1

Geeticka Chauhan 1

Yoonjung Choi 1

Sagnik Ray Choudhury 1

Massimiliano Ciaramita 1

K. Bretonnel Cohen 1

Michael Conway 1

Courtney D. Corley 1

Fermin Cristobal 1

Jan Christian Blaise Cruz 1

Stephane Dadian 1

Anna Mengdan Dai 1

Kapotaksha Das 1

Quincy Davenport 1

Berry De Bruijn 1

Shehzaad Dhuliawala 1

Kareem Elzeky 1

Ehsanul Faruque 1

Erwin Fernandez-Ordoñez 1

Elizabeth Figa 1

Alexander Fraser 1

Martina Galletti 1

Kathy J. Goggin 1

Fernando Gonzalez Adauto 1

Richard Goodrum 1

Sruthi Gorantla 1

Scott A. Hale 1

Md Kamrul Hasan 1

Daniel Hershcovich 1

Nabil Hossain 1

Marwa Houalla 1

Florentina Hristea 1

Katsumi Ibaraki 1

Md. Saiful Islam 1

Mimansa Jaiswal 1

Sujay Kumar Jauhar 1

Feng Jiang (蒋峰) 1

Antonia Karamolegkou 1

Priyanka Kargupta 1

Anna Kazantseva 1

Muhammad Khalifa 1

Dmitrii Kharlapenko 1

Adam Kilgarriff 1

Max Kleiman-Weiner 1

Bennett Kleinberg 1

Neema Kotonya 1

Po-Hsiang Lai 1

Gayathri Ganesh Lakshmy 1

Abhinav Lalwani 1

Alexandra Lefevre 1

Iñigo Lopez-Gazpio 1

Arushi Mangla 1

Montse Maritxalar 1

Justus Mattern 1

Trisha Maturi 1

Irina Matveeva 1

Ishani Mondal 1

Christof Monz 1

Paul Morarescu 1

Louis-Philippe Morency 1

Fatima Zahra Moudakir 1

Alexis Narvaez 1

Cristina Noujaim 1

Chimaobi Okite 1

Francesco Ortu 1

Manuel Palomar 1

Michalis Papakostas 1

Shanshan Peng 1

Humberto Perez-Espinosa 1

Giorgio Piatti 1

John D. Piette 1

Dina Pisarevskaya 1

Vitaliy Popov 1

Amruta Purandare 1

Vethavikashini Chithrra Raghuram 1

Janarthanan Rajendran 1

Amy Rechkemmer 1

Amélie Reymond 1

Naquee Rizwan 1

Nazanin Sabri 1

Bernhard Schoelkopf 1

Peter Schuelke 1

Anna Steinberg Schulten 1

Vanita Sharma 1

Ekaterina Shutova 1

Felix Soldner 1

Thamar Solorio 1

Dominik Stammbach 1

Irene Strauss 1

Alvionna Sunaryo 1

Mihai Surdeanu 1

Stan Szpakowicz 1

Anders Søgaard 1

Miu Nicole Takagi 1

Andrew Torget 1

Larraitz Uria 1

Olga Vechtomova 1

Álvaro Vega-Hidalgo 1

Karin Verspoor 1

Emilio Villa-Cueva 1

V. G. Vinod Vydiswaran 1

Byron C. Wallace 1

Cunxiang Wang 1

Steven R Wilson 1

Neemesh Yadav 1

Xinliang Zhang 1

Zheyuan Zhang 1

Guojiang Zhao 1

Jessica H Zhu 1

Arkaitz Zubiaga 1

Umut Özertem 1

Venues

WS18

NLPerspectives1