João Sedoc - ACL Anthology

João Sedoc

Also published as: Joao Sedoc

2025

Overview of Dialog System Evaluation Track: Dimensionality, Language, Culture and Safety at DSTC 12
John Mendonça | Lining Zhang | Rahul Mallidi | Alon Lavie | Isabel Trancoso | Luis Fernando D’Haro | João Sedoc
Proceedings of the Twelfth Dialog System Technology Challenge

The rapid advancement of Large Language Models (LLMs) has intensified the need for robust dialogue system evaluation, yet comprehensive assessment remains challenging. Traditional metrics often prove insufficient, and safety considerations are frequently narrowly defined or culturally biased. The DSTC12 Track 1, “Dialog System Evaluation: Dimensionality, Language, Culture and Safety,” is part of the ongoing effort to address these critical gaps. The track comprised two subtasks: (1) Dialogue-level, Multi-dimensional Automatic Evaluation Metrics, and (2) Multilingual and Multicultural Safety Detection. For Task 1, focused on 10 dialogue dimensions, a Llama-3-8B baseline achieved the highest average Spearman’s correlation (0.1681), indicating substantial room for improvement. In Task 2, while participating teams significantly outperformed a Llama-Guard-3-1B baseline on the multilingual safety subset (top ROC-AUC 0.9648), the baseline proved superior on the cultural subset (0.5126 ROC-AUC), highlighting critical needs in culturally-aware safety. This paper describes the datasets and baselines provided to participants, as well as submission evaluation results for each of the two proposed subtasks.

Evolving Stances on Reproducibility: A Longitudinal Study of NLP and ML Researchers’ Views and Experience of Reproducibility
Craig Thomson | Ehud Reiter | João Sedoc | Anya Belz
Findings of the Association for Computational Linguistics: EMNLP 2025

Over the past 10 years in NLP/ML, as in other fields of science, there has been growing interest in, and work on, reproducibility and methods for improving it. Identical experiments producing different results can be due to variation between samples of evaluation items or evaluators, but it can also be due to poor experimental practice. Both can be mitigated by bringing multiple comparable studies together in systematic reviews that can draw conclusions beyond the level of the individual studies, but such systematic reviews barely exist in NLP/ML. The alternative is to focus on improving experimental practice and study-level reproducibility, and the first step in this direction is awareness of the importance of reproducibility and knowledge of how to improve it. Here we aim to assess (i) what NLP/ML practitioners’ current views and experience of reproducibility are, and (ii) to what extent they have changed over the past two years, a period of rapidly growing interest in reproducibility. We report for the first time, results from two identical surveys, the first carried out in 2022 and the second in 2024, each time surveying 149 NLP and ML researchers. The results from the 2024 survey assess i above. We then compare the results of the two surveys in order to address ii above. We find that views and experience overall are moving towards better practice and appreciation of reproducibility.

The 2024 GEM Shared Task on Multilingual Data-to-Text Generation: English and Spanish Qualitative Evaluation Results
João Sedoc | Simon Mille | Miruna Adriana Clinciu | Yixin Liu | Kaustubh Dhole | Saad Mahamood
Proceedings of the 18th International Natural Language Generation Conference: Generation Challenges

Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Ofir Arviv | Miruna Clinciu | Kaustubh Dhole | Rotem Dror | Sebastian Gehrmann | Eliya Habba | Itay Itzhak | Simon Mille | Yotam Perlitz | Enrico Santus | João Sedoc | Michal Shmueli Scheuer | Gabriel Stanovsky | Oyvind Tafjord
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

The Sixth Workshop on Insights from Negative Results in NLP
Aleksandr Drozd | João Sedoc | Shabnam Tafreshi | Arjun Akula | Raphael Shu
The Sixth Workshop on Insights from Negative Results in NLP

2024

Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Simone Balloccu | Anya Belz | Rudali Huidrom | Ehud Reiter | Joao Sedoc | Craig Thomson
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

Proceedings of the Fifth Workshop on Insights from Negative Results in NLP
Shabnam Tafreshi | Arjun Akula | João Sedoc | Aleksandr Drozd | Anna Rogers | Anna Rumshisky
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP

Modeling Human Subjectivity in LLMs Using Explicit and Implicit Human Factors in Personas
Salvatore Giorgi | Tingting Liu | Ankit Aich | Kelsey Jane Isman | Garrick Sherman | Zachary Fried | João Sedoc | Lyle Ungar | Brenda Curtis
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) are increasingly being used in human-centered social scientific tasks, such as data annotation, synthetic data creation, and engaging in dialog. However, these tasks are highly subjective and dependent on human factors, such as one’s environment, attitudes, beliefs, and lived experiences. Thus, it may be the case that employing LLMs (which do not have such human factors) in these tasks results in a lack of variation in data, failing to reflect the diversity of human experiences. In this paper, we examine the role of prompting LLMs with human-like personas and asking the models to answer as if they were a specific human. This is done explicitly, with exact demographics, political beliefs, and lived experiences, or implicitly via names prevalent in specific populations. The LLM personas are then evaluated via (1) subjective annotation task (e.g., detecting toxicity) and (2) a belief generation task, where both tasks are known to vary across human factors. We examine the impact of explicit vs. implicit personas and investigate which human factors LLMs recognize and respond to. Results show that explicit LLM personas show mixed results when reproducing known human biases, but generally fail to demonstrate implicit biases. We conclude that LLMs may capture the statistical patterns of how people speak, but are generally unable to model the complex interactions and subtleties of human perceptions, potentially limiting their effectiveness in social science applications.

Large Human Language Models: A Need and the Challenges
Nikita Soni | H. Andrew Schwartz | João Sedoc | Niranjan Balasubramanian
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

As research in human-centered NLP advances, there is a growing recognition of the importance of incorporating human and social factors into NLP models. At the same time, our NLP systems have become heavily reliant on LLMs, most of which do not model authors. To build NLP systems that can truly understand human language, we must better integrate human contexts into LLMs. This brings to the fore a range of design considerations and challenges in terms of what human aspects to capture, how to represent them, and what modeling strategies to pursue. To address these, we advocate for three positions toward creating large human language models (LHLMs) using concepts from psychological and behavioral sciences: First, LM training should include the human context. Second, LHLMs should recognize that people are more than their group(s). Third, LHLMs should be able to account for the dynamic and temporally-dependent nature of the human context. We refer to relevant advances and present open challenges that need to be addressed and their possible solutions in realizing these goals.

Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract
Anya Belz | João Sedoc | Craig Thomson | Simon Mille | Rudali Huidrom
Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract

The 2024 GEM Shared Task on Multilingual Data-to-Text Generation and Summarization: Overview and Preliminary Results
Simon Mille | João Sedoc | Yixin Liu | Elizabeth Clark | Agnes Johanna Axelsson | Miruna Adriana Clinciu | Yufang Hou | Saad Mahamood | Ishmael Nyunya Obonyo | Lining Zhang
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges

We present an overview of the GEM 2024 shared task, which comprised of both data-to-text generation and summarization. New datasets were compiled specifically for the task to reduce data contamination in the large language models, which the participants were likely to use. The paper describes the tasks, the datasets, the participating systems, the evaluation methods, and some preliminary results. The full results will be presented at INLG ‘24.

Findings of WASSA 2024 Shared Task on Empathy and Personality Detection in Interactions
Salvatore Giorgi | João Sedoc | Valentin Barriere | Shabnam Tafreshi
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This paper presents the results of the WASSA 2024 shared task on predicting empathy, emotion, and personality in conversations and reactions to news articles. Participating teams were given access to a new, unpublished extension of the WASSA 2023 shared task dataset. This task is both multi-level and multi-modal: data is available at the person, essay, dialog, and dialog-turn levels and includes formal (news articles) and informal text (essays and dialogs), self-report data (personality and distress), and third-party annotations (empathy and emotion). The shared task included a new focus on conversations between humans and LLM-based virtual agents which occur immediately after reading and reacting to the news articles. Participants were encouraged to explore the multi-level and multi-modal nature of this data. Participation was encouraged in four tracks: (i) predicting the perceived empathy at the dialog level, (ii) predicting turn-level empathy, emotion polarity, and emotion intensity in conversations, (iii) predicting state empathy and distress scores, and (iv) predicting personality. In total, 14 teams participated in the shared task. We summarize the methods and resources used by the participating teams.

From Text to Context: Contextualizing Language with Humans, Groups, and Communities for Socially Aware NLP
Adithya V Ganesan | Siddharth Mangalik | Vasudha Varadarajan | Nikita Soni | Swanie Juhng | João Sedoc | H. Andrew Schwartz | Salvatore Giorgi | Ryan L Boyd
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts)

Aimed at the NLP researchers or practitioners who would like to integrate human - individual, group, or societal level factors into their analyses, this tutorial will cover recent techniques and libraries for doing so at each level of analysis. Starting with human-centered techniques that provide benefit to traditional document- or word-level NLP tasks (Garten et al., 2019; Lynn et al., 2017), we undertake a thorough exploration of critical human-level aspects as they pertain to NLP, gradually moving up to higher levels of analysis: individual persons, individual with agent (chat/dialogue), groups of people, and finally communities or societies.

At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs areconcise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages?ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategiesto approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when rankingshort summaries, but may not help as much when ranking systems or longer summaries.

Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Orphée De Clercq | Valentin Barriere | Jeremy Barnes | Roman Klinger | João Sedoc | Shabnam Tafreshi
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Item Response Theory for Natural Language Processing
John P. Lalor | Pedro Rodriguez | João Sedoc | Jose Hernandez-Orallo
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

This tutorial will introduce the NLP community to Item Response Theory (IRT; Baker 2001). IRT is a method from the field of psychometrics for model and dataset assessment. IRT has been used for decades to build test sets for human subjects and estimate latent characteristics of dataset examples. Recently, there has been an uptick in work applying IRT to tasks in NLP. It is our goal to introduce the wider NLP community to IRT and show its benefits for a number of NLP tasks. From this tutorial, we hope to encourage wider adoption of IRT among NLP researchers.

The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units
Anya Belz | João Sedoc | Craig Thomson | Simon Mille | Rudali Huidrom
Proceedings of the 17th International Natural Language Generation Conference: Tutorial Abstract

Following numerous calls in the literature for improved practices and standardisation in human evaluation in Natural Language Processing over the past ten years, we held a tutorial on the topic at the 2024 INLG Conference. The tutorial addressed the structure, development, design, implementation, execution and analysis of human evaluations of NLP system quality. Hands-on practical sessions were run, designed to facilitate assimilation of the material presented. Slides, lecture recordings, code and data have been made available on GitHub (https://github.com/Human-Evaluation-Tutorial/INLG-2024-Tutorial). In this paper, we provide summaries of the content of the eight units of the tutorial, alongside its research context and aims.

2023

A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
Lining Zhang | Simon Mille | Yufang Hou | Daniel Deutsch | Elizabeth Clark | Yixin Liu | Saad Mahamood | Sebastian Gehrmann | Miruna Clinciu | Khyathi Raghavi Chandu | João Sedoc
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.

Conceptor-Aided Debiasing of Large Language Models
Li S. Yifei | Lyle Ungar | João Sedoc
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Pre-trained large language models (LLMs) reflect the inherent social biases of their training corpus. Many methods have been proposed to mitigate this issue, but they often fail to debias or they sacrifice model accuracy. We use *conceptors*–a soft projection method–to identify and remove the bias subspace in LLMs such as BERT and GPT. We propose two methods of applying conceptors (1) bias subspace projection by post-processing by the conceptor NOT operation; and (2) a new architecture, conceptor-intervened BERT (CI-BERT), which explicitly incorporates the conceptor projection into all layers during training. We find that conceptor post-processing achieves state-of-the-art (SoTA) debiasing results while maintaining LLMs’ performance on the GLUE benchmark. Further, it is robust in various scenarios and can mitigate intersectional bias efficiently by its AND operation on the existing bias subspaces. Although CI-BERT’s training takes all layers’ bias into account and can beat its post-processing counterpart in bias mitigation, CI-BERT reduces the language model accuracy. We also show the importance of carefully constructing the bias subspace. The best results are obtained by removing outliers from the list of biased words, combining them (via the OR operation), and computing their embeddings using the sentences from a cleaner corpus.

Common Law Annotations: Investigating the Stability of Dialog System Output Annotations
Seunggun Lee | Alexandra DeLucia | Nikita Nangia | Praneeth Ganedi | Ryan Guan | Rubing Li | Britney Ngaw | Aditya Singhal | Shalaka Vaidya | Zijun Yuan | Lining Zhang | João Sedoc
Findings of the Association for Computational Linguistics: ACL 2023

Metrics for Inter-Annotator Agreement (IAA), like Cohen’s Kappa, are crucial for validating annotated datasets. Although high agreement is often used to show the reliability of annotation procedures, it is insufficient to ensure or reproducibility. While researchers are encouraged to increase annotator agreement, this can lead to specific and tailored annotation guidelines. We hypothesize that this may result in diverging annotations from different groups. To study this, we first propose the Lee et al. Protocol (LEAP), a standardized and codified annotation protocol. LEAP strictly enforces transparency in the annotation process, which ensures reproducibility of annotation guidelines. Using LEAP to annotate a dialog dataset, we empirically show that while research groups may create reliable guidelines by raising agreement, this can cause divergent annotations across different research groups, thus questioning the validity of the annotations. Therefore, we caution NLP researchers against using reliability as a proxy for reproducibility and validity.

Benchmark Data and Evaluation Framework for Intent Discovery Around COVID-19 Vaccine Hesitancy
Shai Gretz | Assaf Toledo | Roni Friedman | Dan Lahav | Rose Weeks | Naor Bar-Zeev | João Sedoc | Pooja Sangha | Yoav Katz | Noam Slonim
Findings of the Association for Computational Linguistics: EACL 2023

The COVID-19 pandemic has made a huge global impact and cost millions of lives. As COVID-19 vaccines were rolled out, they were quickly met with widespread hesitancy. To address the concerns of hesitant people, we launched VIRA, a public dialogue system aimed at addressing questions and concerns surrounding the COVID-19 vaccines. Here, we release VIRADialogs, a dataset of over 8k dialogues conducted by actual users with VIRA, providing a unique real-world conversational dataset. In light of rapid changes in users’ intents, due to updates in guidelines or in response to new information, we highlight the important task of intent discovery in this use-case. We introduce a novel automatic evaluation framework for intent discovery, leveraging the existing intent classifier of VIRA. We use this framework to report baseline intent discovery results over VIRADialogs, that highlight the difficulty of this task.

Overview of Robust and Multilingual Automatic Evaluation Metricsfor Open-Domain Dialogue Systems at DSTC 11 Track 4
Mario Rodríguez-Cantelar | Chen Zhang | Chengguang Tang | Ke Shi | Sarik Ghazarian | João Sedoc | Luis Fernando D’Haro | Alexander I. Rudnicky
Proceedings of the Eleventh Dialog System Technology Challenge

The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics’ correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.

Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Anya Belz | Maja Popović | Ehud Reiter | Craig Thomson | João Sedoc
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
Shabnam Tafreshi | Arjun Akula | João Sedoc | Aleksandr Drozd | Anna Rogers | Anna Rumshisky
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives
Young Min Cho | Sunny Rai | Lyle Ungar | João Sedoc | Sharath Guntuku
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Mental health conversational agents (a.k.a. chatbots) are widely studied for their potential to offer accessible support to those experiencing mental health challenges. Previous surveys on the topic primarily consider papers published in either computer science or medicine, leading to a divide in understanding and hindering the sharing of beneficial knowledge between both domains. To bridge this gap, we conduct a comprehensive literature review using the PRISMA framework, reviewing 534 papers published in both computer science and medicine. Our systematic review reveals 136 key papers on building mental health-related conversational agents with diverse characteristics of modeling and experimental design techniques. We find that computer science papers focus on LLM techniques and evaluating response quality using automated metrics with little attention to the application while medical papers use rule-based conversational agents and outcome metrics to measure the health outcomes of participants. Based on our findings on transparency, ethics, and cultural heterogeneity in this review, we provide a few recommendations to help bridge the disciplinary divide and enable the cross-disciplinary development of mental health conversational agents.

Automatic Reflection Generation for Peer-to-Peer Counseling
Emma O’neil | João Sedoc | Diyi Yang | Haiyi Zhu | Lyle Ungar
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Online peer counseling platforms enable conversations between millions of people seeking and offering mental health support. Among counseling skills, reflective listening, i.e., capturing and returning to the client something the client has said, is important for positive therapeutic outcomes. We introduce a reflection generation system for online mental health support conversations leveraging GPT-3, a large language model. We compare few-shot learning against fine-tuning and assess the impact of the quality of training examples as measured by fluency, reflection resemblance, and overall preference. Fine-tuned GPT-3 generates responses that human evaluators rate as comparable in reflection quality to responses used for tuning. Models based on high-quality responses generate substantially better reflections than ones tuned on actual responses from a large online counseling service–and better reflections than the actual counselor responses. These results suggest the care needed in selecting examples for tuning generative models.

Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Sebastian Gehrmann | Alex Wang | João Sedoc | Elizabeth Clark | Kaustubh Dhole | Khyathi Raghavi Chandu | Enrico Santus | Hooman Sedghamiz
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Findings of WASSA 2023 Shared Task on Empathy, Emotion and Personality Detection in Conversation and Reactions to News Articles
Valentin Barriere | João Sedoc | Shabnam Tafreshi | Salvatore Giorgi
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This paper presents the results of the WASSA 2023 shared task on predicting empathy, emotion, and personality in conversations and reactions to news articles. Participating teams were given access to a new dataset from Omitaomu et al. (2022) comprising empathic and emotional reactions to news articles. The dataset included formal and informal text, self-report data, and third-party annotations. Specifically, the dataset contained news articles (where harm is done to a person, group, or other) and crowd-sourced essays written in reaction to the article. After reacting via essays, crowd workers engaged in conversations about the news articles. Finally, the crowd workers self-reported their empathic concern and distress, personality (using the Big Five), and multi-dimensional empathy (via the Interpersonal Reactivity Index). A third-party annotated both the conversational turns (for empathy, emotion polarity, and emotion intensity) and essays (for multi-label emotions). Thus, the dataset contained outcomes (self-reported or third-party annotated) at the turn level (within conversations) and the essay level. Participation was encouraged in five tracks: (i) predicting turn-level empathy, emotion polarity, and emotion intensity in conversations, (ii) predicting state empathy and distress scores, (iii) predicting emotion categories, (iv) predicting personality, and (v) predicting multi-dimensional trait empathy. In total, 21 teams participated in the shared task. We summarize the methods and resources used by the participating teams.

Conditioning on Dialog Acts improves Empathy Style Transfer
Renyi Qu | Lyle Ungar | João Sedoc
Findings of the Association for Computational Linguistics: EMNLP 2023

We explore the role of dialog acts in style transfer, specifically empathy style transfer – rewriting a sentence to make it more empathetic without changing its meaning. Specifically, we use two novel few-shot prompting strategies: target prompting, which only uses examples of the target style (unlike traditional prompting with source/target pairs), and dialog-act-conditioned prompting, which first estimates the dialog act of the source sentence and then makes it more empathetic using few-shot examples of the same dialog act. Our study yields two key findings: (1) Target prompting typically improves empathy more effectively while maintaining the same level of semantic similarity; (2) Dialog acts matter. Dialog-act-conditioned prompting enhances empathy while preserving both semantics and the dialog-act type. Different dialog acts benefit differently from different prompting methods, highlighting the need for further investigation of the role of dialog acts in style transfer.

2022

Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory
Pedro Rodriguez | Phu Mon Htut | John Lalor | João Sedoc
Proceedings of the Third Workshop on Insights from Negative Results in NLP

In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.

Automatic Document Selection for Efficient Encoder Pretraining
Yukun Feng | Patrick Xia | Benjamin Van Durme | João Sedoc
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.

Measuring the Language of Self-Disclosure across Corpora
Ann-Katrin Reuel | Sebastian Peralta | João Sedoc | Garrick Sherman | Lyle Ungar
Findings of the Association for Computational Linguistics: ACL 2022

Being able to reliably estimate self-disclosure – a key component of friendship and intimacy – from language is important for many psychology studies. We build single-task models on five self-disclosure corpora, but find that these models generalize poorly; the within-domain accuracy of predicted message-level self-disclosure of the best-performing model (mean Pearson’s r=0.69) is much higher than the respective across data set accuracy (mean Pearson’s r=0.32), due to both variations in the corpora (e.g., medical vs. general topics) and labeling instructions (target variables: self-disclosure, emotional disclosure, intimacy). However, some lexical features, such as expression of negative emotions and use of first person personal pronouns such as ‘I’ reliably predict self-disclosure across corpora. We develop a multi-task model that yields better results, with an average Pearson’s r of 0.37 for out-of-corpora prediction.

Inducing Generalizable and Interpretable Lexica
Yilin Geng | Zetian Wu | Roshan Santhosh | Tejas Srivastava | Lyle Ungar | João Sedoc
Findings of the Association for Computational Linguistics: EMNLP 2022

Lexica – words and associated scores – are widely used as simple, interpretable, generalizable language features to predict sentiment, emotions, mental health, and personality. They also provide insight into the psychological features behind those moods and traits. Such lexica, historically created by human experts, are valuable to linguists, psychologists, and social scientists, but they take years of refinement and have limited coverage. In this paper, we investigate how the lexica that provide psycholinguistic insights could be computationally induced and how they should be assessed. We identify generalizability and interpretability as two essential properties of such lexica. We induce lexica using both context-oblivious and context-aware approaches, compare their predictive performance both within the training corpus and across various corpora, and evaluate their quality using crowd-worker assessment. We find that lexica induced from context-oblivious models are more generalizable and interpretable than those from more accurate context-aware transformer models. In addition, lexicon scores can identify explanatory words more reliably than a high performing transformer with feature-importance measures like SHAP.

Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis
Jeremy Barnes | Orphée De Clercq | Valentin Barriere | Shabnam Tafreshi | Sawsan Alqahtani | João Sedoc | Roman Klinger | Alexandra Balahur
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis

Gendered Language in Resumes and its Implications for Algorithmic Bias in Hiring
Prasanna Parasurama | João Sedoc
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Despite growing concerns around gender bias in NLP models used in algorithmic hiring, there is little empirical work studying the extent and nature of gendered language in resumes. Using a corpus of 709k resumes from IT firms, we train a series of models to classify the gender of the applicant, thereby measuring the extent of gendered information encoded in resumes. We also investigate whether it is possible to obfuscate gender from resumes by removing gender identifiers, hobbies, gender sub-space in embedding models, etc. We find that there is a significant amount of gendered information in resumes even after obfuscation.A simple Tf-Idf model can learn to classify gender with AUROC=0.75, and more sophisticated transformer-based models achieve AUROC=0.8.We further find that gender predictive values have low correlation with gender direction of embeddings – meaning that, what is predictive of gender is much more than what is “gendered” in the masculine/feminine sense. We discuss the algorithmic bias and fairness implications of these findings in the hiring context.

WASSA 2022 Shared Task: Predicting Empathy, Emotion and Personality in Reaction to News Stories
Valentin Barriere | Shabnam Tafreshi | João Sedoc | Sawsan Alqahtani
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis

This paper presents the results that were obtained from WASSA 2022 shared task on predicting empathy, emotion, and personality in reaction to news stories. Participants were given access to a dataset comprising empathic reactions to news stories where harm is done to a person, group, or other. These reactions consist of essays and Batson’s empathic concern and personal distress scores. The dataset was further extended in WASSA 2021 shared task to include news articles, person-level demographic information (e.g. age, gender), personality information, and Ekman’s six basic emotions at essay level Participation was encouraged in four tracks: predicting empathy and distress scores, predicting emotion categories, predicting personality and predicting interpersonal reactivity. In total, 14 teams participated in the shared task. We summarize the methods and resources used by the participating teams.

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann | Abhik Bhattacharjee | Abinaya Mahendiran | Alex Wang | Alexandros Papangelis | Aman Madaan | Angelina Mcmillan-major | Anna Shvets | Ashish Upadhyay | Bernd Bohnet | Bingsheng Yao | Bryan Wilie | Chandra Bhagavatula | Chaobin You | Craig Thomson | Cristina Garbacea | Dakuo Wang | Daniel Deutsch | Deyi Xiong | Di Jin | Dimitra Gkatzia | Dragomir Radev | Elizabeth Clark | Esin Durmus | Faisal Ladhak | Filip Ginter | Genta Indra Winata | Hendrik Strobelt | Hiroaki Hayashi | Jekaterina Novikova | Jenna Kanerva | Jenny Chim | Jiawei Zhou | Jordan Clive | Joshua Maynez | João Sedoc | Juraj Juraska | Kaustubh Dhole | Khyathi Raghavi Chandu | Laura Perez Beltrachini | Leonardo F . R. Ribeiro | Lewis Tunstall | Li Zhang | Mahim Pushkarna | Mathias Creutz | Michael White | Mihir Sanjay Kale | Moussa Kamal Eddine | Nico Daheim | Nishant Subramani | Ondrej Dusek | Paul Pu Liang | Pawan Sasanka Ammanamanchi | Qi Zhu | Ratish Puduppully | Reno Kriz | Rifat Shahriyar | Ronald Cardenas | Saad Mahamood | Salomey Osei | Samuel Cahyawijaya | Sanja Štajner | Sebastien Montella | Shailza Jolly | Simon Mille | Tahmid Hasan | Tianhao Shen | Tosin Adewumi | Vikas Raunak | Vipul Raheja | Vitaly Nikolaev | Vivian Tsai | Yacine Jernite | Ying Xu | Yisi Sang | Yixin Liu | Yufang Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

Proceedings of the Third Workshop on Insights from Negative Results in NLP
Shabnam Tafreshi | João Sedoc | Anna Rogers | Aleksandr Drozd | Anna Rumshisky | Arjun Akula
Proceedings of the Third Workshop on Insights from Negative Results in NLP

2021

Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Orphee De Clercq | Alexandra Balahur | Joao Sedoc | Valentin Barriere | Shabnam Tafreshi | Sven Buechel | Veronique Hoste
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Proceedings of the Second Workshop on Insights from Negative Results in NLP
João Sedoc | Anna Rogers | Anna Rumshisky | Shabnam Tafreshi
Proceedings of the Second Workshop on Insights from Negative Results in NLP

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann | Tosin Adewumi | Karmanya Aggarwal | Pawan Sasanka Ammanamanchi | Anuoluwapo Aremu | Antoine Bosselut | Khyathi Raghavi Chandu | Miruna-Adriana Clinciu | Dipanjan Das | Kaustubh Dhole | Wanyu Du | Esin Durmus | Ondřej Dušek | Chris Chinenye Emezue | Varun Gangal | Cristina Garbacea | Tatsunori Hashimoto | Yufang Hou | Yacine Jernite | Harsh Jhamtani | Yangfeng Ji | Shailza Jolly | Mihir Kale | Dhruv Kumar | Faisal Ladhak | Aman Madaan | Mounica Maddela | Khyati Mahajan | Saad Mahamood | Bodhisattwa Prasad Majumder | Pedro Henrique Martins | Angelina McMillan-Major | Simon Mille | Emiel van Miltenburg | Moin Nadeem | Shashi Narayan | Vitaly Nikolaev | Andre Niyongabo Rubungo | Salomey Osei | Ankur Parikh | Laura Perez-Beltrachini | Niranjan Ramesh Rao | Vikas Raunak | Juan Diego Rodriguez | Sashank Santhanam | João Sedoc | Thibault Sellam | Samira Shaikh | Anastasia Shimorina | Marco Antonio Sobrevilla Cabezudo | Hendrik Strobelt | Nishant Subramani | Wei Xu | Diyi Yang | Akhila Yerukola | Jiawei Zhou
Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.

Topic Modeling for Maternal Health Using Reddit
Shuang Gao | Shivani Pandya | Smisha Agarwal | João Sedoc
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

This paper applies topic modeling to understand maternal health topics, concerns, and questions expressed in online communities on social networking sites. We examine Latent Dirichlet Analysis (LDA) and two state-of-the-art methods: neural topic model with knowledge distillation (KD) and Embedded Topic Model (ETM) on maternal health texts collected from Reddit. The models are evaluated on topic quality and topic inference, using both auto-evaluation metrics and human assessment. We analyze a disconnect between automatic metrics and human evaluations. While LDA performs the best overall with the auto-evaluation metrics NPMI and Coherence, Neural Topic Model with Knowledge Distillation is favorable by expert evaluation. We also create a new partially expert annotated gold-standard maternal health topic

Multi-Emotion Classification for Song Lyrics
Darren Edmonds | João Sedoc
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Song lyrics convey a multitude of emotions to the listener and powerfully portray the emotional state of the writer or singer. This paper examines a variety of modeling approaches to the multi-emotion classification problem for songs. We introduce the Edmonds Dance dataset, a novel emotion-annotated lyrics dataset from the reader’s perspective, and annotate the dataset of Mihalcea and Strapparava (2012) at the song level. We find that models trained on relatively small song datasets achieve marginally better performance than BERT (Devlin et al., 2018) fine-tuned on large social media or dialog datasets.

Decoding Methods for Neural Narrative Generation
Alexandra DeLucia | Aaron Mueller | Xiang Lisa Li | João Sedoc
Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Narrative generation is an open-ended NLP task in which a model generates a story given a prompt. The task is similar to neural response generation for chatbots; however, innovations in response generation are often not applied to narrative generation, despite the similarity between these tasks. We aim to bridge this gap by applying and evaluating advances in decoding methods for neural response generation to neural narrative generation. In particular, we employ GPT-2 and perform ablations across nucleus sampling thresholds and diverse decoding hyperparameters—specifically, maximum mutual information—analyzing results over multiple criteria with automatic and human evaluation. We find that (1) nucleus sampling is generally best with thresholds between 0.7 and 0.9; (2) a maximum mutual information objective can improve the quality of generated stories; and (3) established automatic metrics do not correlate well with human judgments of narrative quality on any qualitative metric.

Measuring the ‘I don’t know’ Problem through the Lens of Gricean Quantity
Huda Khayrallah | João Sedoc
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We consider the intrinsic evaluation of neural generative dialog models through the lens of Grice’s Maxims of Conversation (1975). Based on the maxim of Quantity (be informative), we propose Relative Utterance Quantity (RUQ) to diagnose the ‘I don’t know’ problem, in which a dialog system produces generic responses. The linguistically motivated RUQ diagnostic compares the model score of a generic response to that of the reference response. We find that for reasonable baseline models, ‘I don’t know’ is preferred over the reference the majority of the time, but this can be reduced to less than 5% with hyperparameter tuning. RUQ allows for the direct analysis of the ‘I don’t know’ problem, which has been addressed but not analyzed by prior work.

WASSA 2021 Shared Task: Predicting Empathy and Emotion in Reaction to News Stories
Shabnam Tafreshi | Orphee De Clercq | Valentin Barriere | Sven Buechel | João Sedoc | Alexandra Balahur
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This paper presents the results that were obtained from the WASSA 2021 shared task on predicting empathy and emotions. The participants were given access to a dataset comprising empathic reactions to news stories where harm is done to a person, group, or other. These reactions consist of essays, Batson empathic concern, and personal distress scores, and the dataset was further extended with news articles, person-level demographic information (age, gender, ethnicity, income, education level), and personality information. Additionally, emotion labels, namely Ekman’s six basic emotions, were added to the essays at both the document and sentence level. Participation was encouraged in two tracks: predicting empathy and predicting emotion categories. In total five teams participated in the shared task. We summarize the methods and resources used by the participating teams.

2020

Proceedings of the First Workshop on Insights from Negative Results in NLP
Anna Rogers | João Sedoc | Anna Rumshisky
Proceedings of the First Workshop on Insights from Negative Results in NLP

Item Response Theory for Efficient Human Evaluation of Chatbots
João Sedoc | Lyle Ungar
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.

SMRT Chatbots: Improving Non-Task-Oriented Dialog with Simulated Multiple Reference Training
Huda Khayrallah | João Sedoc
Findings of the Association for Computational Linguistics: EMNLP 2020

Non-task-oriented dialog models suffer from poor quality and non-diverse responses. To overcome limited conversational data, we apply Simulated Multiple Reference Training (SMRT; Khayrallah et al., 2020), and use a paraphraser to simulate multiple responses per training prompt. We find SMRT improves over a strong Transformer baseline as measured by human and automatic quality scores and lexical diversity. We also find SMRT is comparable to pretraining in human evaluation quality, and outperforms pretraining on automatic quality and lexical diversity, without requiring related-domain dialog data.

We release a dataset of over 2,100 COVID19 related Frequently asked Question-Answer pairs scraped from over 40 trusted websites. We include an additional 24, 000 questions pulled from online sources that have been aligned by experts with existing answered questions from our dataset. This paper describes our efforts in collecting the dataset and summarizes the resulting data. Our dataset is automatically updated daily and available at https://github.com/JHU-COVID-QA/ scraping-qas. So far, this data has been used to develop a chatbot providing users information about COVID-19. We encourage others to build analytics and tools upon this dataset as well.

Using the Poly-encoder for a COVID-19 Question Answering System
Seolhwa Lee | João Sedoc
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

To combat misinformation regarding COVID- 19 during this unprecedented pandemic, we propose a conversational agent that answers questions related to COVID-19. We adapt the Poly-encoder (Humeau et al., 2020) model for informational retrieval from FAQs. We show that after fine-tuning, the Poly-encoder can achieve a higher F1 score. We make our code publicly available for other researchers to use.

Learning Emotion from 100 Observations: Unexpected Robustness of Deep Learning under Strong Data Limitations
Sven Buechel | João Sedoc | H. Andrew Schwartz | Lyle Ungar
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

One of the major downsides of Deep Learning is its supposed need for vast amounts of training data. As such, these techniques appear ill-suited for NLP areas where annotated data is limited, such as less-resourced languages or emotion analysis, with its many nuanced and hard-to-acquire annotation formats. We conduct a questionnaire study indicating that indeed the vast majority of researchers in emotion analysis deems neural models inferior to traditional machine learning when training data is limited. In stark contrast to those survey results, we provide empirical evidence for English, Polish, and Portuguese that commonly used neural architectures can be trained on surprisingly few observations, outperforming n-gram based ridge regression on only 100 data points. Our analysis suggests that high-quality, pre-trained word embeddings are a main factor for achieving those results.

Learning Word Ratings for Empathy and Distress from Document-Level User Responses
João Sedoc | Sven Buechel | Yehonathan Nachmany | Anneke Buffone | Lyle Ungar
Proceedings of the Twelfth Language Resources and Evaluation Conference

Despite the excellent performance of black box approaches to modeling sentiment and emotion, lexica (sets of informative words and associated weights) that characterize different emotions are indispensable to the NLP community because they allow for interpretable and robust predictions. Emotion analysis of text is increasing in popularity in NLP; however, manually creating lexica for psychological constructs such as empathy has proven difficult. This paper automatically creates empathy word ratings from document-level ratings. The underlying problem of learning word ratings from higher-level supervision has to date only been addressed in an ad hoc fashion and has not used deep learning methods. We systematically compare a number of approaches to learning word ratings from higher-level supervision against a Mixed-Level Feed Forward Network (MLFFN), which we find performs best, and use the MLFFN to create the first-ever empathy lexicon. We then use Signed Spectral Clustering to gain insights into the resulting words. The empathy and distress lexica are publicly available at: http://www.wwbp.org/lexica.html.

Incremental Neural Coreference Resolution in Constant Memory
Patrick Xia | João Sedoc | Benjamin Van Durme
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We investigate modeling coreference resolution under a fixed memory constraint by extending an incremental clustering algorithm to utilize contextualized encoders and neural components. Given a new sentence, our end-to-end algorithm proposes and scores each mention span against explicit entity representations created from the earlier document context (if any). These spans are then used to update the entity’s representations before being forgotten; we only retain a fixed set of salient entities throughout the document. In this work, we successfully convert a high-performing model (Joshi et al., 2020), asymptotically reducing its memory usage to constant space with only a 0.3% relative loss in F1 on OntoNotes 5.0.

COD3S: Diverse Generation with Discrete Semantic Signatures
Nathaniel Weir | João Sedoc | Benjamin Van Durme
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present COD3S, a novel method for generating semantically diverse sentences using neural sequence-to-sequence (seq2seq) models. Conditioned on an input, seq2seqs typically produce semantically and syntactically homogeneous sets of sentences and thus perform poorly on one-to-many sequence generation tasks. Our two-stage approach improves output diversity by conditioning generation on locality-sensitive hash (LSH)-based semantic sentence codes whose Hamming distances highly correlate with human judgments of semantic textual similarity. Though it is generally applicable, we apply to causal generation, the task of predicting a proposition’s plausible causes or effects. We demonstrate through automatic and human evaluation that responses produced using our method exhibit improved diversity without degrading task performance.

2019

Complexity-Weighted Loss and Diverse Reranking for Sentence Simplification
Reno Kriz | João Sedoc | Marianna Apidianaki | Carolina Zheng | Gaurav Kumar | Eleni Miltsakaki | Chris Callison-Burch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Sentence simplification is the task of rewriting texts so they are easier to understand. Recent research has applied sequence-to-sequence (Seq2Seq) models to this task, focusing largely on training-time improvements via reinforcement learning and memory augmentation. One of the main problems with applying generic Seq2Seq models for simplification is that these models tend to copy directly from the original sentence, resulting in outputs that are relatively long and complex. We aim to alleviate this issue through the use of two main techniques. First, we incorporate content word complexities, as predicted with a leveled word complexity model, into our loss function during training. Second, we generate a large set of diverse candidate simplifications at test time, and rerank these to promote fluency, adequacy, and simplicity. Here, we measure simplicity through a novel sentence complexity model. These extensions allow our models to perform competitively with state-of-the-art systems while generating simpler sentences. We report standard automatic and human evaluation metrics.

Conceptor Debiasing of Word Representations Evaluated on WEAT
Saket Karve | Lyle Ungar | João Sedoc
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Bias in word representations, such as Word2Vec, has been widely reported and investigated, and efforts made to debias them. We apply the debiasing conceptor for post-processing both traditional and contextualized word embeddings. Our method can simultaneously remove racial and gender biases from word representations. Unlike standard debiasing methods, the debiasing conceptor can utilize heterogeneous lists of biased words without loss in performance. Finally, our empirical experiments show that the debiasing conceptor diminishes racial and gender bias of word representations as measured using the Word Embedding Association Test (WEAT) of Caliskan et al. (2017).

The Role of Protected Class Word Lists in Bias Identification of Contextualized Word Representations
João Sedoc | Lyle Ungar
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Systemic bias in word embeddings has been widely reported and studied, and efforts made to debias them; however, new contextualized embeddings such as ELMo and BERT are only now being similarly studied. Standard debiasing methods require heterogeneous lists of target words to identify the “bias subspace”. We show show that using new contextualized word embeddings in conceptor debiasing allows us to more accurately debias word embeddings by breaking target word lists into more homogeneous subsets and then combining (”Or’ing”) the debiasing conceptors of the different subsets.

Comparison of Diverse Decoding Methods from Conditional Language Models
Daphne Ippolito | Reno Kriz | João Sedoc | Maria Kustikova | Chris Callison-Burch
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

While conditional language models have greatly improved in their ability to output high quality natural language, many NLP applications benefit from being able to generate a diverse set of candidate sequences. Diverse decoding strategies aim to, within a given-sized candidate list, cover as much of the space of high-quality outputs as possible, leading to improvements for tasks that rerank and combine candidate outputs. Standard decoding methods, such as beam search, optimize for generating high likelihood sequences rather than diverse ones, though recent work has focused on increasing diversity in these methods. In this work, we perform an extensive survey of decoding-time strategies for generating diverse outputs from a conditional language model. In addition, we present a novel method where we over-sample candidates, then use clustering to remove similar sequences, thus achieving high diversity without sacrificing quality.

ChatEval: A Tool for Chatbot Evaluation
João Sedoc | Daphne Ippolito | Arun Kirubarajan | Jai Thirani | Lyle Ungar | Chris Callison-Burch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Open-domain dialog systems (i.e. chatbots) are difficult to evaluate. The current best practice for analyzing and comparing these dialog systems is the use of human judgments. However, the lack of standardization in evaluation procedures, and the fact that model parameters and code are rarely published hinder systematic human evaluation experiments. We introduce a unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems. Researchers can submit their trained models to the ChatEval web interface and obtain comparisons with baselines and prior work. The evaluation code is open-source to ensure standardization and transparency. In addition, we introduce open-source baseline models and evaluation datasets. ChatEval can be found at https://chateval.org.

Continual Learning for Sentence Representations Using Conceptors
Tianlin Liu | Lyle Ungar | João Sedoc
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Distributed representations of sentences have become ubiquitous in natural language processing tasks. In this paper, we consider a continual learning scenario for sentence representations: Given a sequence of corpora, we aim to optimize the sentence encoder with respect to the new corpus while maintaining its accuracy on the old corpora. To address this problem, we propose to initialize sentence encoders with the help of corpus-independent features, and then sequentially update sentence encoders using Boolean operations of conceptor matrices to learn corpus-dependent features. We evaluate our approach on semantic textual similarity tasks and show that our proposed sentence encoder can continually learn features from new corpora while retaining its competence on previously encountered corpora.

2018

Modeling Empathy and Distress in Reaction to News Stories
Sven Buechel | Anneke Buffone | Barry Slaff | Lyle Ungar | João Sedoc
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Computational detection and understanding of empathy is an important factor in advancing human-computer interaction. Yet to date, text-based empathy prediction has the following major limitations: It underestimates the psychological complexity of the phenomenon, adheres to a weak notion of ground truth where empathic states are ascribed by third parties, and lacks a shared corpus. In contrast, this contribution presents the first publicly available gold standard for empathy prediction. It is constructed using a novel annotation methodology which reliably captures empathy assessments by the writer of a statement using multi-item scales. This is also the first computational work distinguishing between multiple forms of empathy, empathic concern, and personal distress, as recognized throughout psychology. Finally, we present experimental results for three different predictive models, of which a CNN performs the best.

ChatEval: A Tool for the Systematic Evaluation of Chatbots
João Sedoc | Daphne Ippolito | Arun Kirubarajan | Jai Thirani | Lyle Ungar | Chris Callison-Burch
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)

2017

Semantic Word Clusters Using Signed Spectral Clustering
João Sedoc | Jean Gallier | Dean Foster | Lyle Ungar
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Vector space representations of words capture many aspects of word similarity, but such methods tend to produce vector spaces in which antonyms (as well as synonyms) are close to each other. For spectral clustering using such word embeddings, words are points in a vector space where synonyms are linked with positive weights, while antonyms are linked with negative weights. We present a new signed spectral normalized graph cut algorithm, signed clustering, that overlays existing thesauri upon distributionally derived vector representations of words, so that antonym relationships between word pairs are represented by negative weights. Our signed clustering algorithm produces clusters of words that simultaneously capture distributional and synonym relations. By using randomized spectral decomposition (Halko et al., 2011) and sparse matrices, our method is both fast and scalable. We validate our clusters using datasets containing human judgments of word pair similarities and show the benefit of using our word clusters for sentiment prediction.

Predicting Emotional Word Ratings using Distributional Representations and Signed Clustering
João Sedoc | Daniel Preoţiuc-Pietro | Lyle Ungar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Inferring the emotional content of words is important for text-based sentiment analysis, dialogue systems and psycholinguistics, but word ratings are expensive to collect at scale and across languages or domains. We develop a method that automatically extends word-level ratings to unrated words using signed clustering of vector space word representations along with affect ratings. We use our method to determine a word’s valence and arousal, which determine its position on the circumplex model of affect, the most popular dimensional model of emotion. Our method achieves superior out-of-sample word rating prediction on both affective dimensions across three different languages when compared to state-of-the-art word similarity based methods. Our method can assist building word ratings for new languages and improve downstream tasks such as sentiment analysis and emotion detection.

Co-authors

Sebastian Gehrmann 6

Saad Mahamood 6

Craig Thomson 6

Kaustubh Dhole 5

Anna Rumshisky 5

Chris Callison-Burch 4

Khyathi Raghavi Chandu 4

Elizabeth Clark 4

Orphee De Clercq 4

Aleksandr Drozd 4

Salvatore Giorgi 4

Alexandra Balahur 3

Daniel Deutsch 3

Benjamin Van Durme 3

Rudali Huidrom 3

Daphne Ippolito 3

H. Andrew Schwartz 3

Tosin Adewumi 2

Smisha Agarwal 2

Sawsan Alqahtani 2

Pawan Sasanka Ammanamanchi 2

Jeremy Barnes 2

Anneke Buffone 2

Alexandra DeLucia 2

Ondřej Dušek 2

Luis Fernando D’Haro 2

Cristina Garbacea 2

Yacine Jernite 2

Shailza Jolly 2

Huda Khayrallah 2

Arun Kirubarajan 2

Roman Klinger 2

Faisal Ladhak 2

John P. Lalor 2

Angelina McMillan-Major 2

Vitaly Nikolaev 2

Shivani Pandya 2

Laura Perez-Beltrachini 2

Leonardo F. R. Ribeiro 2

Pedro Rodriguez 2

Enrico Santus 2

Garrick Sherman 2

Hendrik Strobelt 2

Nishant Subramani 2

Milind Agarwal 1

Karmanya Aggarwal 1

Marianna Apidianaki 1

Anuoluwapo Aremu 1

Agnes Johanna Axelsson 1

Niranjan Balasubramanian 1

Simone Balloccu 1

Naor Bar-Zeev 1

Chandra Bhagavatula 1

Abhik Bhattacharjee 1

Antoine Bosselut 1

Samuel Cahyawijaya 1

Ronald Cardenas 1

Khyathi Chandu 1

Young Min Cho 1

Cash Costello 1

Mathias Creutz 1

Brenda Curtis 1

Moussa Kamal Eddine 1

Darren Edmonds 1

Chris Chinenye Emezue 1

Zachary Fried 1

Roni Friedman 1

Praneeth Ganedi 1

Sarik Ghazarian 1

Dimitra Gkatzia 1

Sharath Guntuku 1

Xu Han (韩旭) 1

Tatsunori B. Hashimoto 1

Hiroaki Hayashi 1

Jose Hernandez-Orallo 1

Veronique Hoste 1

Kelsey Jane Isman 1

Nicola Ivanov 1

Harsh Jhamtani 1

Juraj Juraska 1

Mihir Sanjay Kale 1

Jenna Kanerva 1

Maria Kustikova 1

Xiang Lisa Li 1

Paul Pu Liang 1

Mounica Maddela 1

Khyati Mahajan 1

Abinaya Mahendiran 1

Bodhisattwa Prasad Majumder 1

Rahul Mallidi 1

Siddharth Mangalik 1

Pedro Henrique Martins 1

Joshua Maynez 1

John Mendonça 1

Eleni Miltsakaki 1

Sebastien Montella 1

Aaron Mueller 1

Kenton Murray 1

Yehonathan Nachmany 1

Nikita Nangia 1

Shashi Narayan 1

Marcel Nawrath 1

Jekaterina Novikova 1

Agnieszka Nowak 1

Ishmael Nyunya Obonyo 1

Emma O’neil 1

Alexandros Papangelis 1

Prasanna Parasurama 1

Sebastian Peralta 1

Yotam Perlitz 1

Maja Popović 1

Daniel Preoţiuc-Pietro 1

Ratish Puduppully 1

Mahim Pushkarna 1

Dragomir Radev 1

Niranjan Ramesh Rao 1

Ann-Katrin Reuel 1

Juan Diego Rodriguez 1

Mario Rodríguez-Cantelar 1

Andre Niyongabo Rubungo 1

Alexander Rudnicky 1

Sashank Santhanam 1

Roshan Santhosh 1

Hooman Sedghamiz 1

Thibault Sellam 1

Rifat Shahriyar 1

Samira Shaikh 1

Anastasia Shimorina 1

Michal Shmueli-Scheuer 1

Aditya Singhal 1

Marco Antonio Sobrevilla Cabezudo 1

Kaushik Srinivasan 1

Tejas Srivastava 1

Gabriel Stanovsky 1

Oyvind Tafjord 1

Chengguang Tang 1

Isabel Trancoso 1

Lewis Tunstall 1

Ashish Upadhyay 1

Adithya V. Ganesan 1

Shalaka Vaidya 1

Emiel Van Miltenburg 1

Vasudha Varadarajan 1

Danilo C. Walenta 1

Nathaniel Weir 1

Michael White 1

Genta Indra Winata 1

Deyi Xiong (德意熊) 1

Bingsheng Yao 1

Mahsa Yarmohammadi 1

Akhila Yerukola 1

Carolina Zheng 1

Sanja Štajner 1

Venues