Malihe Alikhani - ACL Anthology

Malihe Alikhani

2025

SignAlignLM: Integrating Multimodal Sign Language Processing into Large Language Models
Mert Inan | Anthony Sicilia | Malihe Alikhani
Findings of the Association for Computational Linguistics: ACL 2025

Deaf and Hard-of-Hearing (DHH) users increasingly utilize Large Language Models (LLMs), yet face significant challenges due to these models’ limited understanding of sign language grammar, multimodal sign inputs, and Deaf cultural contexts. Further, current approaches that try to address these limitations, frequently reduce sign language processing (SLP) to traditional translation tasks, neglecting the multimodal and linguistic complexity inherent in signed languages. In this paper, we present an empirical investigation informed by learning theory into natively integrating sign language support within LLMs, directly addressing the documented needs of DHH users. We introduce the first text-based and multimodal LLMs capable of sign language processing called SignAlignLM, and propose new prompting and fine-tuning strategies incorporating sign linguistic rules and conventions. We show that LLMs can be generalized interfaces for both spoken and signed languages if trained with a multitasking paradigm. Our code and model checkpoints are open-source.

SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation
Saki Imai | Mert Inan | Anthony B. Sicilia | Malihe Alikhani
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language—such as facial expressions, spatial grammar, and prosody—but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

Measuring Bias and Agreement in Large Language Model Presupposition Judgments
Katherine Atwell | Mandy Simons | Malihe Alikhani
Findings of the Association for Computational Linguistics: ACL 2025

Identifying linguistic bias in text demands the identification not only of explicitly asserted content but also of implicit content including presuppositions. Large language models (LLMs) offer a promising automated approach to detecting presuppositions, yet the extent to which their judgments align with human intuitions remains unexplored. Moreover, LLMs may inadvertently reflect societal biases when identifying presupposed content. To empirically investigate this, we prompt multiple large language models to evaluate presuppositions across diverse textual domains, drawing from three distinct datasets annotated by human raters. We calculate the agreement between LLMs and human raters, and find several linguistic factors associated with fluctuations in human-model agreement. Our observations reveal discrepancies in human-model alignment, suggesting potential biases in LLMs, notably influenced by gender and political ideology.

An Active Learning Framework for Inclusive Generation by Large Language Models
Sabit Hassan | Anthony B. Sicilia | Malihe Alikhani
Proceedings of the 31st International Conference on Computational Linguistics

Ensuring that Large Language Models (LLMs) generate text representative of diverse sub-populations is essential, particularly when key concepts related to under-represented groups are scarce in the training data. We address this challenge with a novel clustering-based active learning framework, enhanced with knowledge distillation. The proposed framework transforms the intermediate outputs of the learner model, enabling effective active learning for generative tasks for the first time. Integration of clustering and knowledge distillation yields more representative models without prior knowledge of underlying data distribution and overbearing human efforts. We validate our approach in practice through case studies in counter-narration and style transfer. We construct two new datasets in tandem with model training, showing a performance improvement of 2%–10% over baseline models. Our results also show more consistent performance across various data subgroups and increased lexical diversity, underscoring our model’s resilience to skewness in available data. Further, our results show that the data acquired via our approach improves the performance of secondary models not involved in the learning loop, showcasing practical utility of the framework.

Accounting for Sycophancy in Language Model Uncertainty Estimation
Anthony Sicilia | Mert Inan | Malihe Alikhani
Findings of the Association for Computational Linguistics: NAACL 2025

Effective human-machine collaboration requires machine learning models to externalize uncertainty, so users can reflect and intervene when necessary. For language models, these representations of uncertainty may be impacted by sycophancy bias: proclivity to agree with users, even if they are wrong. For instance, models may be over-confident in (incorrect) problem solutions suggested by a user. We study the relationship between sycophancy and uncertainty estimation for the first time. We propose a generalization of the definition of sycophancy bias to measure downstream impacts on uncertainty estimation, and also propose a new algorithm (SyRoUP) to account for sycophancy in the uncertainty estimation process. Unlike previous works, we study a broad array of user behaviors, varying both correctness and confidence of user suggestions to see how model answers (and their certainty) change. Our experiments across conversation forecasting and question-answering tasks show that user confidence plays a critical role in modulating the effects of sycophancy, and that SyRoUP can better predict these effects. From these results, we argue that externalizing both model and user uncertainty can help to mitigate the impacts of sycophancy bias.

How to Align Multiple Signed Language Corpora for Better Sign-to-Sign Translations?
Mert Inan | Yang Zhong | Vidya Ganesh | Malihe Alikhani
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

There are more than 300 documented signed languages worldwide, which are indispensable avenues for computational linguists to study cross-cultural and cross-linguistic factors that affect automatic sign understanding and generation. Yet, these are studied under critically low-resource settings, especially when examining multiple signed languages simultaneously. In this work, we hypothesize that a linguistically informed alignment algorithm can improve the results of sign-to-sign translation models. To this end, we first conduct a qualitative analysis of similarities and differences across three signed languages: American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS). We then introduce a novel generation and alignment algorithm for translating one sign language to another, exploring Large Language Models (LLMs) as intermediary translators and paraphrasers. We also compile a dataset of sign-to-sign translation pairs between these signed languages. Our model trained on this dataset performs well on automatic metrics for sign-to-sign translation and generation. Our code and data will be available for the camera-ready version of the paper.

How LLMs Influence Perceived Bias in Journalism
Asteria Kaeberlein | Malihe Alikhani
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

As the use of generative AI tools in journalistic writing becomes more common, reporters have expressed growing concerns about how it may introduce bias to their works. This paper investigates how the integration of large language models (LLMs) into journalistic writing, both as editors and independent ‘authors’, can alter user perception of bias in media. We show novel insights into how human perception of media bias differs from automatic evaluations. Through human evaluations comparing original human-authored articles, AI-edited articles, and AI-generated articles, we show that while LLMs rarely introduce new bias and often trend towards neutrality, this supposedly ‘safe’ behavior can have harmful impacts. This is most observable in sensitive human rights contexts, where the AI’s neutral and measured tone can reduce the representation of relevant voices and present misinformation in a more convincing manner. Furthermore, we demonstrate the existence of previously unidentified patterns that existing automated bias detection methods fail to accurately capture. We underscore the critical need for human-centered evaluation frameworks in AI-assisted journalism by introducing human evaluations and contrasting against a state-of-the-art automated bias detection system.

Evaluating Theory of (an uncertain) Mind: Predicting the Uncertain Beliefs of Others from Conversational Cues
Anthony Sicilia | Malihe Alikhani
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Typically, when evaluating Theory of Mind, we consider the beliefs of others to be binary: held or not held. But what if someone is unsure about their own beliefs? How can we quantify this uncertainty? We propose a new suite of tasks, challenging language models (LMs) to model the uncertainty of participants in a dialogue. We design these tasks around conversation forecasting, where the goal is to predict the probability of an unobserved conversation outcome. Uniquely, we view conversation agents themselves as forecasters, asking an LM to predict the uncertainty of an individual from their language use. We experiment with scaling methods, bagging, and demographic context for this regression task, conducting experiments on three dialogue corpora (social, negotiation, task-oriented) with eight LMs. While LMs can explain up to 7% variance in the uncertainty of others, we highlight the difficulty of the tasks and room for future work, especially in tasks that require explicit shifts in perspective.

Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI
Yuya Asano | Sabit Hassan | Paras Sharma | Anthony B. Sicilia | Katherine Atwell | Diane Litman | Malihe Alikhani
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

General-purpose automatic speech recognition (ASR) systems do not always perform well in goal-oriented dialogue. Existing ASR correction methods rely on prior user data or named entities. We extend correction to tasks that have no prior user data and exhibit linguistic flexibility such as lexical and syntactic variations. We propose a novel context augmentation with a large language model and a ranking strategy that incorporates contextual information from the dialogue states of a goal-oriented conversational AI and its tasks. Our method ranks (1) n-best ASR hypotheses by their lexical and semantic similarity with context and (2) context by phonetic correspondence with ASR hypotheses. Evaluated in home improvement and cooking domains with real-world users, our method improves recall and F1 of correction by 34% and 16%, respectively, while maintaining precision and false positive rate. Users rated .8-1 point (out of 5) higher when our correction method worked properly, with no decrease due to false positives.

Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation
Mert Inan | Anthony Sicilia | Alex Xie | Saujas Vaduguru | Daniel Fried | Malihe Alikhani
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker’s intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g. the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, and therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.

Measuring How (Not Just Whether) VLMs Build Common Ground
Saki Imai | Mert Inan | Anthony B. Sicilia | Malihe Alikhani
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

Adaptive Platt Scaling with Causal Interpretations for Self-Reflective Language Model Uncertainty Estimates
Anthony Sicilia | Malihe Alikhani
Findings of the Association for Computational Linguistics: EMNLP 2025

As large language models (LLMs) are consumed by more users and deployed in increasingly autonomous capacities, their ability to self-monitor and ask for human intervention is of vital importance. Underlying this capability are fundamental skills like self-reflection and expression of uncertainty. In this work, we provide a formal analysis of LLM self-reflection for uncertainty estimation, using domain adaptation theory to model the shift between base predictions and reflective judgments. We use this to motivate a temperature scaling algorithm that calibrates uncertainty using comparisons between base predictions and LLM self-reflections. We evaluate our approach on challenging question-answering tasks requiring reasoning, demonstrating that our methods can improve calibration of uncertainty estimates and also offer improvements in human interpretation. More broadly, this use case shows how domain adaptation presents a promising analytical tool for understanding the underlying statistical properties of LLM self-reflections.

Reversing Causal Assumptions: Explainability in Online Sports Dialogues
Asteria Kaeberlein | Malihe Alikhani
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Prior XAI research often assumes inputs must be “causes” and outputs must be “effects”, severely limiting applicability to analyzing behaviors that emerge as reactions or consequences. Many linguistic tasks, such as dialogues and conversations, involve such behaviors. To address this, we propose that the assumed causality from inputs to outputs can be reversed and still remain valid by using outputs that cause changes in features. We show how this enables analysis of complex feature sets through simpler metrics, propose a framework that is generalizable to most linguistic tasks, and highlight best practices for applying our framework. By training a predictive model from complex effects to simple causes, we apply feature attributions to estimate how the inputs change with the outputs. We demonstrate an application of this by studying sports fans’ comments made during a game and compare those comments to a simpler metric, win probability. We also expand on a prior study of intergroup bias, demonstrating how our framework can uncover behaviors that other XAI methods may overlook. We discuss the implications of these findings for advancing interpretability in computational linguistics and improving data-driven-decision-making in social contexts.

2024

HumBEL: A Human-in-the-Loop Approach for Evaluating Demographic Factors of Language Models in Human-Machine Conversations
Anthony Sicilia | Jennifer Gates | Malihe Alikhani
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

While demographic factors like age and gender change the way people talk, and in particular, the way people talk to machines, there is little investigation into how large pre-trained language models (LMs) can adapt to these changes. To remedy this gap, we consider how demographic factors in LM language skills can be measured to determine compatibility with a target demographic. We suggest clinical techniques from Speech Language Pathology, which has norms for acquisition of language skills in humans. We conduct evaluation with a domain expert (i.e., a clinically licensed speech language pathologist), and also propose automated techniques to complement clinical evaluation at scale. Empirically, we focus on age, finding LM capability varies widely depending on task: GPT-3.5 mimics the ability of humans ranging from age 6-15 at tasks requiring inference, and simultaneously, outperforms a typical 21 year old at memorization. GPT-3.5 also has trouble with social language use, exhibiting less than 50% of the tested pragmatic skills. Findings affirm the importance of considering demographic alignment and conversational goals when using LMs as public-facing tools. Code, data, and a package will be available.

Combining Discourse Coherence with Large Language Models for More Inclusive, Equitable, and Robust Task-Oriented Dialogue
Katherine Atwell | Mert Inan | Anthony B. Sicilia | Malihe Alikhani
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) are capable of generating well-formed responses, but using LLMs to generate responses on the fly is not yet feasible for many task-oriented systems. Modular architectures are often still required for safety and privacy guarantees on the output. We hypothesize that an offline generation approach using discourse theories, formal grammar rules, and LLMs can allow us to generate human-like, coherent text in a more efficient, robust, and inclusive manner within a task-oriented setting. To this end, we present the first discourse-aware multimodal task-oriented dialogue system that combines discourse theories with offline LLM generation. We deploy our bot as an app to the general public and keep track of the user ratings for six months. Our user ratings show an improvement from 2.8 to 3.5 out of 5 with the introduction of discourse coherence theories. We also show that our model reduces misunderstandings in the dialect of African-American Vernacular English from 93% to 57%. While terms of use prevent us from releasing our entire codebase, we release our code in a format that can be integrated into most existing dialogue systems.

Proceedings of the 28th Conference on Computational Natural Language Learning
Libby Barak | Malihe Alikhani
Proceedings of the 28th Conference on Computational Natural Language Learning

Studying and Mitigating Biases in Sign Language Understanding Models
Katherine Atwell | Danielle Bragg | Malihe Alikhani
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Ensuring that the benefits of sign language technologies are distributed equitably among all community members is crucial. Thus, it is important to address potential biases and inequities that may arise from the design or use of these resources. Crowd-sourced sign language datasets, such as the ASL Citizen dataset, are great resources for improving accessibility and preserving linguistic diversity, but they must be used thoughtfully to avoid reinforcing existing biases.In this work, we utilize the rich information about participant demographics and lexical features present in the ASL Citizen dataset to study and document the biases that may result from models trained on crowd-sourced sign datasets. Further, we apply several bias mitigation techniques during model training, and find that these techniques reduce performance disparities without decreasing accuracy. With the publication of this work, we release the demographic information about the participants in the ASL Citizen dataset to encourage future bias mitigation work in this space.

Generating Signed Language Instructions in Large-Scale Dialogue Systems
Mert Inan | Katherine Atwell | Anthony Sicilia | Lorna Quandt | Malihe Alikhani
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

We introduce a goal-oriented conversational AI system enhanced with American Sign Language (ASL) instructions, presenting the first implementation of such a system on a worldwide multimodal conversational AI platform. Accessible through a touch-based interface, our system receives input from users and seamlessly generates ASL instructions by leveraging retrieval methods and cognitively based gloss translations. Central to our design is a sign translation module powered by Large Language Models, alongside a token-based video retrieval system for delivering instructional content from recipes and wikiHow guides. Our development process is deeply rooted in a commitment to community engagement, incorporating insights from the Deaf and Hard-of-Hearing community, as well as experts in cognitive and ASL learning sciences. The effectiveness of our signing instructions is validated by user feedback, achieving ratings on par with those of the system in its non-signing variant. Additionally, our system demonstrates exceptional performance in retrieval accuracy and text-generation quality, measured by metrics such as BERTScore. We have made our codebase and datasets publicly accessible at https://github.com/Merterm/signed-dialogue, and a demo of our signed instruction video retrieval system is available at https://huggingface.co/spaces/merterm/signed-instructions.

Seeing Eye-to-Eye: Cross-Modal Coherence Relations Inform Eye-gaze Patterns During Comprehension & Production
Mert Inan | Malihe Alikhani
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Context influences how we engage with multimodal documents. Describing and processing the content of images is highly correlated with the goals of the discourse. It is known that these underlying cognitive processes can be tapped into by looking at eye movements, but the connection between discourse goals and eye movements is a missing link. In this study, we carry out both augmented reality and webcam-based eye-tracking experiments during comprehension and production tasks. We build on computational frameworks of coherence in text and images that study causal, logical, elaborative, and temporal inferences to understand how eye gaze patterns and coherence relations influence each other. No state-of-the-art techniques exist to analyze eye movements in multimodal language settings. So, we introduce a new eye gaze pattern ranking algorithm and a semantic gaze visualization technique to study this phenomenon better. Our results demonstrate that eye gaze durations are person-dependent, and during comprehension and production, ranked gaze patterns are significantly different for different types of coherence relations. We also present a case study of how Multimodal Large Language Models represent this connection of eye gaze patterns and coherence relations. We make all of our code and novel analysis tools available through https://github.com/Merterm/eye-gaze-coherence.

Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models
Anthony Sicilia | Hyunwoo Kim | Khyathi Chandu | Malihe Alikhani | Jack Hessel
Findings of the Association for Computational Linguistics: ACL 2024

Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing “conversation forecasting” task: instead of just accuracy, evaluation is conducted with uncertainty-aware metrics, effectively enabling abstention on individual instances. We study two ways in which language models potentially represent outcome uncertainty (internally, using scores and directly, using tokens) and propose fine-tuning strategies to improve calibration of both representations. Experiments on eight difficult negotiation corpora demonstrate that our proposed fine-tuning strategies (a traditional supervision strategy and an off-policy reinforcement learning strategy) can calibrate smaller open-source models to compete with pre-trained models 10x their size.

Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors
Anthony Sicilia | Malihe Alikhani
Proceedings of the Third Workshop on NLP for Positive Impact

Conversation forecasting tasks a model with predicting the outcome of an unfolding conversation. For instance, it can be applied in social media moderation to predict harmful user behaviors before they occur, allowing for preventative interventions. While large language models (LLMs) have recently been proposed as an effective tool for conversation forecasting, it’s unclear what biases they may have, especially against forecasting the (potentially harmful) outcomes we request them to predict during moderation. This paper explores to what extent model uncertainty can be used as a tool to mitigate potential biases. Specifically, we ask three primary research questions: 1) how does LLM forecasting accuracy change when we ask models to represent their uncertainty; 2) how does LLM bias change when we ask models to represent their uncertainty; 3) how can we use uncertainty representations to reduce or completely mitigate biases without many training data points. We address these questions for 5 open-source language models tested on 2 datasets designed to evaluate conversation forecasting for social media moderation.

Active Learning for Robust and Representative LLM Generation in Safety-Critical Scenarios
Sabit Hassan | Anthony Sicilia | Malihe Alikhani
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

Ensuring robust safety measures across a wide range of scenarios is crucial for user-facing systems. While Large Language Models (LLMs) can generate valuable data for safety measures, they often exhibit distributional biases, focusing on common scenarios and neglecting rare but critical cases. This can undermine the effectiveness of safety protocols developed using such data. To address this, we propose a novel framework that integrates active learning with clustering to guide LLM generation, enhancing their representativeness and robustness in safety scenarios. We demonstrate the effectiveness of our approach by constructing a dataset of 5.4K potential safety violations through an iterative process involving LLM generation and an active learner model’s feedback. Our results show that the proposed framework produces a more representative set of safety scenarios without requiring prior knowledge of the underlying data distribution. Additionally, data acquired through our method improves the accuracy and F1 score of both the active learner model as well models outside the scope of active learning process, highlighting its broad applicability.

2023

SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Hyunwoo Kim | Jack Hessel | Liwei Jiang | Peter West | Ximing Lu | Youngjae Yu | Pei Zhou | Ronan Bras | Malihe Alikhani | Gunhee Kim | Maarten Sap | Yejin Choi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Data scarcity has been a long standing issue in the field of open-domain social dialogue. To quench this thirst, we present SODA: the first publicly available, million-scale high-quality social dialogue dataset. By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions from a large language model. Human evaluation shows that conversations in SODA are more consistent, specific, and (surprisingly) natural than those in prior human-authored datasets. Using SODA, we train COSMO: a generalizable conversation model that is significantly more natural and consistent on unseen datasets than best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna). Experiments reveal COSMO is sometimes even preferred to the original human-written gold responses. Additionally, our results shed light on the distinction between knowledge-enriched conversations and natural social chitchats. We plan to make our data, model, and code public.

How people talk about each other: Modeling Generalized Intergroup Bias and Emotion
Venkata Subrahmanyan Govindarajan | Katherine Atwell | Barea Sinno | Malihe Alikhani | David I. Beaver | Junyi Jessy Li
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Current studies of bias in NLP rely mainly on identifying (unwanted or negative) bias towards a specific demographic group. While this has led to progress recognizing and mitigating negative bias, and having a clear notion of the targeted group is necessary, it is not always practical. In this work we extrapolate to a broader notion of bias, rooted in social science and psychology literature. We move towards predicting interpersonal group relationship (IGR) - modeling the relationship between the speaker and the target in an utterance - using fine-grained interpersonal emotions as an anchor. We build and release a dataset of English tweets by US Congress members annotated for interpersonal emotion - the first of its kind, and ‘found supervision’ for IGR labels; our analyses show that subtle emotional signals are indicative of different biases. While humans can perform better than chance at identifying IGR given an utterance, we show that neural models perform much better; furthermore, a shared encoding between IGR and interpersonal perceived emotion enabled performance gains in both tasks.

Learning Multimodal Cues of Children’s Uncertainty
Qi Cheng | Mert Inan | Rahma Mbarki | Grace Grmek | Theresa Choi | Yiming Sun | Kimele Persaud | Jenny Wang | Malihe Alikhani
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Understanding uncertainty plays a critical role in achieving common ground (Clark et al., 1983). This is especially important for multimodal AI systems that collaborate with users to solve a problem or guide the user through a challenging concept. In this work, for the first time, we present a dataset annotated in collaboration with developmental and cognitive psychologists for the purpose of studying nonverbal cues of uncertainty. We then present an analysis of the data, studying different roles of uncertainty and its relationship with task difficulty and performance. Lastly, we present a multimodal machine learning model that can predict uncertainty given a real-time video clip of a participant, which we find improves upon a baseline multimodal transformer model. This work informs research on cognitive coordination between human-human and human-AI and has broad implications for gesture understanding and generation. The anonymized version of our data and code will be publicly available upon the completion of the required consent forms and data sheets.

This paper presents the results of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23; https://www.wmt-slt.com/). This shared task is concerned with automatic translation between signed and spoken languages. The task is unusual in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT). The task offers four tracks involving the following languages: Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), Italian Sign Language of Switzerland (LIS-CH), German, French and Italian. Four teams (including one working on a baseline submission) participated in this second edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora and reproducible baseline systems. Finally, the task also resulted in publicly available sets of system outputs and more human evaluation scores for sign language translation.

Proceedings of the 3rd Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics (SpLU-RoboNLP 2023)
Aishwarya Padmakumar | Mert Inan | Yue Fan | Xin Wang | Malihe Alikhani
Proceedings of the 3rd Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics (SpLU-RoboNLP 2023)

Learning to Generate Equitable Text in Dialogue from Biased Training Data
Anthony Sicilia | Malihe Alikhani
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The ingrained principles of fairness in a dialogue system’s decision-making process and generated responses are crucial for user engagement, satisfaction, and task achievement. Absence of equitable and inclusive principles can hinder the formation of common ground, which in turn negatively impacts the overall performance of the system. For example, misusing pronouns in a user interaction may cause ambiguity about the intended subject. Yet, there is no comprehensive study of equitable text generation in dialogue. Aptly, in this work, we use theories of computational learning to study this problem. We provide formal definitions of equity in text generation, and further, prove formal connections between learning human-likeness and learning equity: algorithms for improving equity ultimately reduce to algorithms for improving human-likeness (on augmented data). With this insight, we also formulate reasonable conditions under which text generation algorithms can learn to generate equitable text without any modifications to the biased training data on which they learn. To exemplify our theory in practice, we look at a group of algorithms for the GuessWhat?! visual dialogue game and, using this example, test our theory empirically. Our theory accurately predicts relative-performance of multiple algorithms in generating equitable text as measured by both human and automated evaluation.

Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Svetlana Stoyanchev | Shafiq Joty | David Schlangen | Ondrej Dusek | Casey Kennington | Malihe Alikhani
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Including Facial Expressions in Contextual Embeddings for Sign Language Generation
Carla Viegas | Mert Inan | Lorna Quandt | Malihe Alikhani
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

State-of-the-art sign language generation frameworks lack expressivity and naturalness which is the result of only focusing manual signs, neglecting the affective, grammatical and semantic functions of facial expressions. The purpose of this work is to augment semantic representation of sign language through grounding facial expressions. We study the effect of modeling the relationship between text, gloss, and facial expressions on the performance of the sign generation systems. In particular, we propose a Dual Encoder Transformer able to generate manual signs as well as facial expressions by capturing the similarities and differences found in text and sign gloss annotation. We take into consideration the role of facial muscle activity to express intensities of manual signs by being the first to employ facial action units in sign language generation. We perform a series of experiments showing that our proposed model improves the quality of automatically generated sign language.

Multilingual Content Moderation: A Case Study on Reddit
Meng Ye | Karan Sikka | Katherine Atwell | Sabit Hassan | Ajay Divakaran | Malihe Alikhani
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Content moderation is the process of flagging content based on pre-defined platform rules. There has been a growing need for AI moderators to safeguard users as well as protect the mental health of human moderators from traumatic content. While prior works have focused on identifying hateful/offensive language, they are not adequate for meeting the challenges of content moderation since 1) moderation decisions are based on violation of rules, which subsumes detection of offensive speech, and 2) such rules often differ across communities which entails an adaptive solution. We propose to study the challenges of content moderation by introducing a multilingual dataset of 1.8 Million Reddit comments spanning 56 subreddits in English, German, Spanish and French1. We perform extensive experimental analysis to highlight the underlying challenges and suggest related research problems such as cross-lingual transfer, learning under label noise (human biases), transfer of moderation models, and predicting the violated rule. Our dataset and analysis can help better prepare for the challenges and opportunities of auto moderation.

MedNgage: A Dataset for Understanding Engagement in Patient-Nurse Conversations
Yan Wang | Heidi Donovan | Sabit Hassan | Malihe Alikhani
Findings of the Association for Computational Linguistics: ACL 2023

Patients who effectively manage their symptoms often demonstrate higher levels of engagement in conversations and interventions with healthcare practitioners. This engagement is multifaceted, encompassing cognitive and social dimensions. Consequently, it is crucial for AI systems to understand the engagement in natural conversations between patients and practitioners to better contribute toward patient care. In this paper, we present a novel dataset (MedNgage), which consists of patient-nurse conversations about cancer symptom management. We manually annotate the dataset with a novel framework of categories of patient engagement from two different angles, namely: i) socio-affective engagement (3.1K spans), and ii) cognitive engagement (1.8K spans). Through statistical analysis of the data that is annotated using our framework, we show a positive correlation between patient symptom management outcomes and their engagement in conversations. Additionally, we demonstrate that pre-trained transformer models fine-tuned on our dataset can reliably predict engagement categories in patient-nurse conversations. Lastly, we use LIME (Ribeiro et al., 2016) to analyze the underlying challenges of the tasks that state-of-the-art transformer models encounter. The de-identified data is available for research purposes upon request.

PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically
Sedrick Scott Keh | Steven Y. Feng | Varun Gangal | Malihe Alikhani | Eduard Hovy
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called TT-Corp, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters.

D-CALM: A Dynamic Clustering-based Active Learning Approach for Mitigating Bias
Sabit Hassan | Malihe Alikhani
Findings of the Association for Computational Linguistics: ACL 2023

Despite recent advancements, NLP models continue to be vulnerable to bias. This bias often originates from the uneven distribution of real-world data and can propagate through the annotation process. Escalated integration of these models in our lives calls for methods to mitigate bias without overbearing annotation costs. While active learning (AL) has shown promise in training models with a small amount of annotated data, AL’s reliance on the model’s behavior for selective sampling can lead to an accumulation of unwanted bias rather than bias mitigation. However, infusing clustering with AL can overcome the bias issue of both AL and traditional annotation methods while exploiting AL’s annotation efficiency. In this paper, we propose a novel adaptive clustering-based active learning algorithm, D-CALM, that dynamically adjusts clustering and annotation efforts in response to an estimated classifier error-rate. Experiments on eight datasets for a diverse set of text classification tasks, including emotion, hatespeech, dialog act, and book type detection, demonstrate that our proposed algorithm significantly outperforms baseline AL approaches with both pretrained transformers and traditional Support Vector Machines. D-CALM showcases robustness against different measures of information gain and, as evident from our analysis of label and error distribution, can significantly reduce unwanted model bias.

DisCGen: A Framework for Discourse-Informed Counterspeech Generation
Sabit Hassan | Malihe Alikhani
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Practical Tools from Domain Adaptation for Designing Inclusive, Equitable, and Robust Generative AI
Anthony Sicilia | Malihe Alikhani
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract

2022

Modeling Intensification for Sign Language Generation: A Computational Approach
Mert Inan | Yang Zhong | Sabit Hassan | Lorna Quandt | Malihe Alikhani
Findings of the Association for Computational Linguistics: ACL 2022

End-to-end sign language generation models do not accurately represent the prosody in sign language. A lack of temporal and spatial variations leads to poor-quality generated presentations that confuse human interpreters. In this paper, we aim to improve the prosody in generated sign languages by modeling intensification in a data-driven manner. We present different strategies grounded in linguistics of sign language that inform how intensity modifiers can be represented in gloss annotations. To employ our strategies, we first annotate a subset of the benchmark PHOENIX-14T, a German Sign Language dataset, with different levels of intensification. We then use a supervised intensity tagger to extend the annotated dataset and obtain labels for the remaining portion of it. This enhanced dataset is then used to train state-of-the-art transformer models for sign language generation. We find that our efforts in intensification modeling yield better results when evaluated with automatic metrics. Human evaluation also indicates a higher preference of the videos generated using our model.

APPDIA: A Discourse-aware Transformer-based Style Transfer Model for Offensive Social Media Conversations
Katherine Atwell | Sabit Hassan | Malihe Alikhani
Proceedings of the 29th International Conference on Computational Linguistics

Using style-transfer models to reduce offensiveness of social media comments can help foster a more inclusive environment. However, there are no sizable datasets that contain offensive texts and their inoffensive counterparts, and fine-tuning pretrained models with limited labeled data can lead to the loss of original meaning in the style-transferred text. To address this issue, we provide two major contributions. First, we release the first publicly-available, parallel corpus of offensive Reddit comments and their style-transferred counterparts annotated by expert sociolinguists. Then, we introduce the first discourse-aware style-transfer models that can effectively reduce offensiveness in Reddit text while preserving the meaning of the original text. These models are the first to examine inferential links between the comment and the text it is replying to when transferring the style of offensive Reddit text. We propose two different methods of integrating discourse relations with pretrained transformer models and evaluate them on our dataset of offensive comments from Reddit and their inoffensive counterparts. Improvements over the baseline with respect to both automatic metrics and human evaluation indicate that our discourse-aware models are better at preserving meaning in style-transferred text when compared to the state-of-the-art discourse-agnostic models.

LEATHER: A Framework for Learning to Generate Human-like Text in Dialogue
Anthony Sicilia | Malihe Alikhani
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Algorithms for text-generation in dialogue can be misguided. For example, in task-oriented settings, reinforcement learning that optimizes only task-success can lead to abysmal lexical diversity. We hypothesize this is due to poor theoretical understanding of the objectives in text-generation and their relation to the learning process (i.e., model training). To this end, we propose a new theoretical framework for learning to generate text in dialogue. Compared to existing theories of learning, our framework allows for analysis of the multi-faceted goals inherent to text-generation. We use our framework to develop theoretical guarantees for learners that adapt to unseen data. As an example, we apply our theory to study data-shift within a cooperative learning algorithm proposed for the GuessWhat?! visual dialogue game. From this insight, we propose a new algorithm, and empirically, we demonstrate our proposal improves both task-success and human-likeness of the generated text. Finally, we show statistics from our theory are empirically predictive of multiple qualities of the generated dialogue, suggesting our theory is useful for model-selection when human evaluations are not available.

Modeling Non-Cooperative Dialogue: Theoretical and Empirical Insights
Anthony Sicilia | Tristan Maidment | Pat Healy | Malihe Alikhani
Transactions of the Association for Computational Linguistics, Volume 10

Investigating cooperativity of interlocutors is central in studying pragmatics of dialogue. Models of conversation that only assume cooperative agents fail to explain the dynamics of strategic conversations. Thus, we investigate the ability of agents to identify non-cooperative interlocutors while completing a concurrent visual-dialogue task. Within this novel setting, we study the optimality of communication strategies for achieving this multi-task objective. We use the tools of learning theory to develop a theoretical model for identifying non-cooperative interlocutors and apply this theory to analyze different communication strategies. We also introduce a corpus of non-cooperative conversations about images in the GuessWhat?! dataset proposed by De Vries et al. (2017). We use reinforcement learning to implement multiple communication strategies in this context and find that empirical results validate our theory.

Zero-shot Cross-Linguistic Learning of Event Semantics
Malihe Alikhani | Thomas Kober | Bashar Alhafni | Yue Chen | Mert Inan | Elizabeth Nielsen | Shahab Raji | Mark Steedman | Matthew Stone
Proceedings of the 15th International Conference on Natural Language Generation

The Change that Matters in Discourse Parsing: Estimating the Impact of Domain Shift on Parser Error
Katherine Atwell | Anthony Sicilia | Seong Jae Hwang | Malihe Alikhani
Findings of the Association for Computational Linguistics: ACL 2022

Discourse analysis allows us to attain inferences of a text document that extend beyond the sentence-level. The current performance of discourse models is very low on texts outside of the training distribution’s coverage, diminishing the practical utility of existing models. There is need for a measure that can inform us to what extent our model generalizes from the training to the test sample when these samples may be drawn from distinct distributions. While this can be estimated via distribution shift, we argue that this does not directly correlate with change in the observed error of a classifier (i.e. error-gap). Thus, we propose to use a statistic from the theoretical domain adaptation literature which can be directly tied to error-gap. We study the bias of this statistic as an estimator of error-gap both theoretically and through a large-scale empirical study of over 2400 experiments on 6 discourse datasets from domains including, but not limited to: news, biomedical texts, TED talks, Reddit posts, and fiction. Our results not only motivate our proposal and help us to understand its limitations, but also provide insight on the properties of discourse models and datasets which improve performance in domain adaptation. For instance, we find that non-news datasets are slightly easier to transfer to than news datasets when the training and test sets are very different. Our code and an associated Python package are available to allow practitioners to make more informed model and dataset choices.

Political Ideology and Polarization: A Multi-dimensional Approach
Barea Sinno | Bernardo Oviedo | Katherine Atwell | Malihe Alikhani | Junyi Jessy Li
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Analyzing ideology and polarization is of critical importance in advancing our grasp of modern politics. Recent research has made great strides towards understanding the ideological bias (i.e., stance) of news media along the left-right spectrum. In this work, we instead take a novel and more nuanced approach for the study of ideology based on its left or right positions on the issue being discussed. Aligned with the theoretical accounts in political science, we treat ideology as a multi-dimensional construct, and introduce the first diachronic dataset of news articles whose ideological positions are annotated by trained political scientists and linguists at the paragraph level. We showcase that, by controlling for the author’s stance, our method allows for the quantitative and temporal measurement and analysis of polarization as a multidimensional ideological distance. We further present baseline models for ideology prediction, outlining a challenging task distinct from stance detection.

Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Oliver Lemon | Dilek Hakkani-Tur | Junyi Jessy Li | Arash Ashrafzadeh | Daniel Hernández Garcia | Malihe Alikhani | David Vandyke | Ondřej Dušek
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

The Role of Context and Uncertainty in Shallow Discourse Parsing
Katherine Atwell | Remi Choi | Junyi Jessy Li | Malihe Alikhani
Proceedings of the 29th International Conference on Computational Linguistics

Discourse parsing has proven to be useful for a number of NLP tasks that require complex reasoning. However, over a decade since the advent of the Penn Discourse Treebank, predicting implicit discourse relations in text remains challenging. There are several possible reasons for this, and we hypothesize that models should be exposed to more context as it plays an important role in accurate human annotation; meanwhile adding uncertainty measures can improve model accuracy and calibration. To thoroughly investigate this phenomenon, we perform a series of experiments to determine 1) the effects of context on human judgments, and 2) the effect of quantifying uncertainty with annotator confidence ratings on model accuracy and calibration (which we measure using the Brier score (Brier et al, 1950)). We find that including annotator accuracy and confidence improves model accuracy, and incorporating confidence in the model’s temperature function can lead to models with significantly better-calibrated confidence measures. We also find some insightful qualitative results regarding human and model behavior on these datasets.

PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification Data for Learning Enhanced Generation
Sedrick Scott Keh | Kevin Lu | Varun Gangal | Steven Y. Feng | Harsh Jhamtani | Malihe Alikhani | Eduard Hovy
Proceedings of the 29th International Conference on Computational Linguistics

A personification is a figure of speech that endows inanimate entities with properties and actions typically seen as requiring animacy. In this paper, we explore the task of personification generation. To this end, we propose PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation. We curate a corpus of personifications called PersonifCorp, together with automatically generated de-personified literalizations of these personifications. We demonstrate the usefulness of this parallel corpus by training a seq2seq model to personify a given literal input. Both automatic and human evaluations show that fine-tuning with PersonifCorp leads to significant gains in personification-related qualities such as animacy and interestingness. A detailed qualitative analysis also highlights key strengths and imperfections of PINEAPPLE over baselines, demonstrating a strong ability to generate diverse and creative personifications that enhance the overall appeal of a sentence.

2021

Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics
Malihe Alikhani | Valts Blukis | Parisa Kordjamshidi | Aishwarya Padmakumar | Hao Tan
Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics

Signed Coreference Resolution
Kayo Yin | Kenneth DeHaan | Malihe Alikhani
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Coreference resolution is key to many natural language processing tasks and yet has been relatively unexplored in Sign Language Processing. In signed languages, space is primarily used to establish reference. Solving coreference resolution for signed languages would not only enable higher-level Sign Language Processing systems, but also enhance our understanding of language in different modalities and of situated references, which are key problems in studying grounded language. In this paper, we: (1) introduce Signed Coreference Resolution (SCR), a new challenge for coreference modeling and Sign Language Processing; (2) collect an annotated corpus of German Sign Language with gold labels for coreference together with an annotation software for the task; (3) explore features of hand gesture, iconicity, and spatial situated properties and move forward to propose a set of linguistically informed heuristics and unsupervised models for the task; (4) put forward several proposals about ways to address the complexities of this challenge effectively.

COSMic: A Coherence-Aware Generation Metric for Image Descriptions
Mert Inan | Piyush Sharma | Baber Khalid | Radu Soricut | Matthew Stone | Malihe Alikhani
Findings of the Association for Computational Linguistics: EMNLP 2021

Developers of text generation models rely on automated evaluation metrics as a stand-in for slow and expensive manual evaluations. However, image captioning metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of output text. We address this weakness by introducing the first discourse-aware learned generation metric for evaluating image descriptions. Our approach is inspired by computational theories of discourse for capturing information goals using coherence. We present a dataset of image–description pairs annotated with coherence relations. We then train a coherence-aware metric on a subset of the Conceptual Captions dataset and measure its effectiveness—its ability to predict human ratings of output captions—on a test set composed of out-of-domain images. We demonstrate a higher Kendall Correlation Coefficient for our proposed metric with the human judgments for the results of a number of state-of-the-art coherence-aware caption generation models when compared to several other metrics including recently proposed learned metrics such as BLEURT and BERTScore.

Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks—reading comprehension, textual entailment, and so on. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Additionally, we present the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.1

Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)
Song Feng | Siva Reddy | Malihe Alikhani | He He | Yangfeng Ji | Mohit Iyyer | Zhou Yu
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

Examining Covert Gender Bias: A Case Study in Turkish and English Machine Translation Models
Chloe Ciora | Nur Iren | Malihe Alikhani
Proceedings of the 14th International Conference on Natural Language Generation

As Machine Translation (MT) has become increasingly more powerful, accessible, and widespread, the potential for the perpetuation of bias has grown alongside its advances. While overt indicators of bias have been studied in machine translation, we argue that covert biases expose a problem that is further entrenched. Through the use of the gender-neutral language Turkish and the gendered language English, we examine cases of both overt and covert gender bias in MT models. Specifically, we introduce a method to investigate asymmetrical gender markings. We also assess bias in the attribution of personhood and examine occupational and personality stereotypes through overt bias indicators in MT models. Our work explores a deeper layer of bias in MT models and demonstrates the continued need for language-specific, interdisciplinary methodology in MT model development.

Including Signed Languages in Natural Language Processing
Kayo Yin | Amit Moryossef | Julie Hochgesang | Yoav Goldberg | Malihe Alikhani
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Signed languages are the primary means of communication for many deaf and hard of hearing individuals. Since signed languages exhibit all the fundamental linguistic properties of natural language, we believe that tools and theories of Natural Language Processing (NLP) are crucial towards its modeling. However, existing research in Sign Language Processing (SLP) seldom attempt to explore and leverage the linguistic organization of signed languages. This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact. We first discuss the linguistic properties of signed languages to consider during their modeling. Then, we review the limitations of current SLP models and identify the open challenges to extend NLP to signed languages. Finally, we urge (1) the adoption of an efficient tokenization method; (2) the development of linguistically-informed models; (3) the collection of real-world signed language data; (4) the inclusion of local signed language communities as an active and leading voice in the direction of research.

Entheos: A Multimodal Dataset for Studying Enthusiasm
Carla Viegas | Malihe Alikhani
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Where Are We in Discourse Relation Recognition?
Katherine Atwell | Junyi Jessy Li | Malihe Alikhani
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Discourse parsers recognize the intentional and inferential relationships that organize extended texts. They have had a great influence on a variety of NLP tasks as well as theoretical studies in linguistics and cognitive science. However it is often difficult to achieve good results from current discourse models, largely due to the difficulty of the task, particularly recognizing implicit discourse relations. Recent developments in transformer-based models have shown great promise on these analyses, but challenges still remain. We present a position paper which provides a systematic analysis of the state of the art discourse parsers. We aim to examine the performance of current discourse parsing models via gradual domain shift: within the corpus, on in-domain texts, and on out-of-domain texts, and discuss the differences between the transformer-based models and the previous models in predicting different types of implicit relations both inter- and intra-sentential. We conclude by describing several shortcomings of the existing models and a discussion of how future work should approach this problem.

2020

Combining Cognitive Modeling and Reinforcement Learning for Clarification in Dialogue
Baber Khalid | Malihe Alikhani | Matthew Stone
Proceedings of the 28th International Conference on Computational Linguistics

In many domains, dialogue systems need to work collaboratively with users to successfully reconstruct the meaning the user had in mind. In this paper, we show how cognitive models of users’ communicative strategies can be leveraged in a reinforcement learning approach to dialogue planning to enable interactive systems to give targeted, effective feedback about the system’s understanding. We describe a prototype system that collaborates on reference tasks that distinguish arbitrarily varying color patches from similar distractors, and use experiments with crowd workers and analyses of our learned policies to document that our approach leads to context-sensitive clarification strategies that focus on key missing information, elicit correct answers that the system understands, and contribute to increasing dialogue success.

Aspectuality Across Genre: A Distributional Semantics Approach
Thomas Kober | Malihe Alikhani | Matthew Stone | Mark Steedman
Proceedings of the 28th International Conference on Computational Linguistics

The interpretation of the lexical aspect of verbs in English plays a crucial role in tasks such as recognizing textual entailment and learning discourse-level inferences. We show that two elementary dimensions of aspectual class, states vs. events, and telic vs. atelic events, can be modelled effectively with distributional semantics. We find that a verb’s local context is most indicative of its aspectual class, and we demonstrate that closed class words tend to be stronger discriminating contexts than content words. Our approach outperforms previous work on three datasets. Further, we present a new dataset of human-human conversations annotated with lexical aspects and present experiments that show the correlation of telicity with genre and discourse goals.

Cross-modal Coherence Modeling for Caption Generation
Malihe Alikhani | Piyush Sharma | Shengjie Li | Radu Soricut | Matthew Stone
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning. Using an annotation protocol specifically devised for capturing image–caption coherence relations, we annotate 10,000 instances from publicly-available image–caption pairs. We introduce a new task for learning inferences in imagery and text, coherence relation prediction, and show that these coherence annotations can be exploited to learn relation classifiers as an intermediary step, and also train coherence-aware, controllable image captioning models. The results show a dramatic improvement in the consistency and quality of the generated captions with respect to information needs specified via coherence relations.

Achieving Common Ground in Multi-modal Dialogue
Malihe Alikhani | Matthew Stone
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

All communication aims at achieving common ground (grounding): interlocutors can work together effectively only with mutual beliefs about what the state of the world is, about what their goals are, and about how they plan to make their goals a reality. Computational dialogue research offers some classic results on grouding, which unfortunately offer scant guidance to the design of grounding modules and behaviors in cutting-edge systems. In this tutorial, we focus on three main topic areas: 1) grounding in human-human communication; 2) grounding in dialogue systems; and 3) grounding in multi-modal interactive systems, including image-oriented conversations and human-robot interactions. We highlight a number of achievements of recent computational research in coordinating complex content, show how these results lead to rich and challenging opportunities for doing grounding in more flexible and powerful ways, and canvass relevant insights from the literature on human–human conversation. We expect that the tutorial will be of interest to researchers in dialogue systems, computational semantics and cognitive modeling, and hope that it will catalyze research and system building that more directly explores the creative, strategic ways conversational agents might be able to seek and offer evidence about their understanding of their interlocutors.

Proceedings of the Third International Workshop on Spatial Language Understanding
Parisa Kordjamshidi | Archna Bhatia | Malihe Alikhani | Jason Baldridge | Mohit Bansal | Marie-Francine Moens
Proceedings of the Third International Workshop on Spatial Language Understanding

2019

“Caption” as a Coherence Relation: Evidence and Implications
Malihe Alikhani | Matthew Stone
Proceedings of the Second Workshop on Shortcomings in Vision and Language

We study verbs in image–text corpora, contrasting caption corpora, where texts are explicitly written to characterize image content, with depiction corpora, where texts and images may stand in more general relations. Captions show a distinctively limited distribution of verbs, with strong preferences for specific tense, aspect, lexical aspect, and semantic field. These limitations, which appear in data elicited by a range of methods, restrict the utility of caption corpora to inform image retrieval, multimodal document generation, and perceptually-grounded semantic models. We suggest that these limitations reflect the discourse constraints in play when subjects write texts to accompany imagery, so we argue that future development of image–text corpora should work to increase the diversity of event descriptions, while looking explicitly at the different ways text and imagery can be coherently related.

CITE: A Corpus of Image-Text Discourse Relations
Malihe Alikhani | Sreyasi Nag Chowdhury | Gerard de Melo | Matthew Stone
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This paper presents a novel crowd-sourced resource for multimodal discourse: our resource characterizes inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations. Like previous corpora annotating discourse structure between text arguments, such as the Penn Discourse Treebank, our new corpus aids in establishing a better understanding of natural communication and common-sense reasoning, while our findings have implications for a wide range of applications, such as understanding and generation of multimodal documents.

2018

Arrows are the Verbs of Diagrams
Malihe Alikhani | Matthew Stone
Proceedings of the 27th International Conference on Computational Linguistics

Arrows are a key ingredient of schematic pictorial communication. This paper investigates the interpretation of arrows through linguistic, crowdsourcing and machine-learning methodology. Our work establishes a novel analogy between arrows and verbs: we advocate representing arrows in terms of qualitatively different structural and semantic frames, and resolving frames to specific interpretations using shallow world knowledge.

Co-authors

Junyi Jessy Li 5

Anthony B. Sicilia 5

Ondřej Dušek 2

Steven Y. Feng 2

Asteria Kaeberlein 2

Sedrick Scott Keh 2

Parisa Kordjamshidi 2

Amit Moryossef 2

Aishwarya Padmakumar 2

Piyush Sharma 2

Mark Steedman 2

Bashar Alhafni 1

Moin Aminnaseri 1

Arash Ashrafzadeh 1

Eleftherios Avramidis 1

Erfan Sadeqi Azer 1

Jason Baldridge 1

David I. Beaver 1

Archna Bhatia 1

Marzieh Bitaab 1

Richard Bowden 1

Annelies Braffort 1

Danielle Bragg 1

Faeze Brahman 1

Necati Cihan Camgöz 1

Khyathi Chandu 1

Gerard De Melo 1

Kenneth DeHaan 1

Ajay Divakaran 1

Heidi Donovan 1

Cristina España-Bonet 1

Daniel Hernández Garcia 1

Jennifer Gates 1

Sarik Ghazarian 1

Mozhdeh Gheini 1

Yoav Goldberg 1

Venkata Subrahmanyan Govindarajan 1

Roman Grundkiewicz 1

Anne Göhring 1

Dilek Hakkani-Tur 1

Julie Hochgesang 1

Pedram Hosseini 1

Seong Jae Hwang 1

Harsh Jhamtani 1

Casey Kennington 1

Daniel Khashabi 1

Davy Van Landuyt 1

Rabeeh Karimi Mahabadi 1

Tristan Maidment 1

Omid Memarrast 1

Marie Francine Moens 1

Ahmadreza Mosallanezhad 1

Mathias Müller 1

Sreyasi Nag Chowdhury 1

Elizabeth Nielsen 1

Bernardo Oviedo 1

Kimele Persaud 1

Pouya Pezeshkpour 1

Mohammad Sadegh Rasooli 1

Annette Rios Gonzales 1

Sepideh Sadeghi 1

Niloofar Safi Samghabadi 1

David Schlangen 1

Mahsa Shafaei 1

Siamak Shakeri 1

Saber Sheybani 1

Dimitar Shterionov 1

Sandra Sidler-Miserez 1

Svetlana Stoyanchev 1

Saujas Vaduguru 1

David Vandyke 1

Yadollah Yaghoobzadeh 1

Venues