Ninareh Mehrabi - ACL Anthology

Ninareh Mehrabi

2025

DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning
Tanmay Parekh | Kartik Mehta | Ninareh Mehrabi | Kai-Wei Chang | Nanyun Peng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4–7% average F1 gains over the best baseline – establishing DiCoRe as a strong zero-shot ED framework.

Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Trista Cao | Anubrata Das | Tharindu Kumarage | Yixin Wan | Satyapriya Krishna | Ninareh Mehrabi | Jwala Dhamala | Anil Ramakrishna | Aram Galystan | Anoop Kumar | Rahul Gupta | Kai-Wei Chang
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)

On Localizing and Deleting Toxic Memories in Large Language Models
Anubrata Das | Manoj Kumar | Ninareh Mehrabi | Anil Ramakrishna | Anna Rumshisky | Kai-Wei Chang | Aram Galstyan | Morteza Ziyadi | Rahul Gupta
Findings of the Association for Computational Linguistics: NAACL 2025

Warning: This paper contains offensive language.Ensuring that large language models (LLMs) do not generate harmful text is critical for their safe deployment. A common failure mode involves producing toxic responses to otherwise innocuous prompts. While various detoxification methods have been proposed, the underlying mechanisms that drive toxic generation in LLMs are not yet fully understood. Our work aims to provide a mechanistic understanding of toxic generation against innocuous-seeming adversarial prompts through the lens of memory localization. We find evidence of localization of toxic memories in the early Multilayer Perceptron (MLP) layers of GPT-2-XL. We further investigate the effects of editing and deleting these toxic memories in MLP layers to reduce toxic generation. Editing significantly reduces toxic generation, from 62.86% to 28.61%. However, this reduction comes with a trade-off in generation quality as perplexity increases from 78.18 on GPT2-XL against the adversarial prompts to 106.06 after editing. Localization-informed deletion achieves a better toxicity-perplexity tradeoff compared to random early layer editing, which reduces toxicity but leads to greater perplexity increases.

Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Huihan Li | You Chen | Siyuan Wang | Yixin He | Ninareh Mehrabi | Rahul Gupta | Xiang Ren
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources – local, mid-range, or long-range – based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
Tharindu Kumarage | Ninareh Mehrabi | Anil Ramakrishna | Xinyan Zhao | Richard Zemel | Kai-Wei Chang | Aram Galstyan | Rahul Gupta | Charith Peris
Findings of the Association for Computational Linguistics: ACL 2025

Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy.

2024

Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
Fei Wang | Ninareh Mehrabi | Palash Goyal | Rahul Gupta | Kai-Wei Chang | Aram Galstyan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Data are crucial element in large language model (LLM) alignment. Recent studies have explored using LLMs for efficient data collection. However, LLM-generated data often suffers from quality issues, with underrepresented or absent aspects and low-quality datapoints. To address these problems, we propose Data Advisor, an enhanced LLM-based method for generating data that takes into account the characteristics of the desired dataset. Starting from a set of pre-defined principles in hand, Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation accordingly. Data Advisor can be easily integrated into existing data generation methods to enhance data quality and coverage. Experiments on safety alignment of three representative LLMs (i.e., Mistral, Llama2, and Falcon) demonstrate the effectiveness of Data Advisor in enhancing model safety against various fine-grained safety issues without sacrificing model utility.

Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies
Anaelia Ovalle | Ninareh Mehrabi | Palash Goyal | Jwala Dhamala | Kai-Wei Chang | Richard Zemel | Aram Galstyan | Yuval Pinter | Rahul Gupta
Findings of the Association for Computational Linguistics: NAACL 2024

Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM), such as the inability to correctly use gender-diverse English neopronouns (e.g., xe, zir, fae). While data scarcity is a known culprit, the precise mechanisms through which scarcity affects this behavior remain underexplored. We discover LLM misgendering is significantly influenced by Byte-Pair Encoding (BPE) tokenization, the tokenizer powering many popular LLMs. Unlike binary pronouns, BPE overfragments neopronouns, a direct consequence of data scarcity during tokenizer training. This disparate tokenization mirrors tokenizer limitations observed in multilingual and low-resource NLP, unlocking new misgendering mitigation strategies. We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency. Our proposed methods outperform finetuning with standard BPE, improving neopronoun accuracy from 14.1% to 58.4%. Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.

FLIRT: Feedback Loop In-context Red Teaming
Ninareh Mehrabi | Palash Goyal | Christophe Dupuy | Qian Hu | Shalini Ghosh | Richard Zemel | Kai-Wei Chang | Aram Galstyan | Rahul Gupta
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Warning: this paper contains content that may be inappropriate or offensive.As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. In this work, we propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. In particular, taking text-to-image models as target models, we explore different feedback mechanisms to automatically learn effective and diverse adversarial prompts. Our experiments demonstrate that even with enhanced safety features, Stable Diffusion (SD) models are vulnerable to our adversarial prompts, raising concerns on their robustness in practical uses. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models.

The steerability of large language models toward data-driven personas
Junyi Li | Charith Peris | Ninareh Mehrabi | Palash Goyal | Kai-Wei Chang | Aram Galstyan | Richard Zemel | Rahul Gupta
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented. Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs, that can be leveraged to produce multiple perspectives and to reflect the diverse opinions. Moving beyond the traditional reliance on demographics like age, gender, or party affiliation, we introduce a data-driven notion of persona grounded in collaborative filtering, which is defined as either a single individual or a cohort of individuals manifesting similar views across specific inquiries. As individuals in the same demographic group may have different personas, our data-driven persona definition allows for a more nuanced understanding of different (latent) social groups present in the population. In addition to this, we also explore an efficient method to steer LLMs toward the personas that we define. We show that our data-driven personas significantly enhance model steerability, with improvements of between 57%-77% over our best performing baselines.

Prompt Perturbation Consistency Learning for Robust Language Models
Yao Qiang | Subhrangshu Nandi | Ninareh Mehrabi | Greg Ver Steeg | Anoop Kumar | Anna Rumshisky | Aram Galstyan
Findings of the Association for Computational Linguistics: EACL 2024

Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks, such as question answering and text summarization. However, their performance on sequence labeling tasks such as intent classification and slot filling (IC-SF), which is a central component in personal assistant systems, lags significantly behind discriminative models. Furthermore, there is a lack of substantive research on robustness of LLMs to various perturbations in the input prompts. The contributions of this paper are three-fold. First, we show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models. Next, we systematically analyze the performance deterioration of those fine-tuned models due to three distinct yet relevant types of input perturbations - oronyms, synonyms, and paraphrasing. Finally, we propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples. Our experiments show that PPCL can recover on an average 59% and 69% of the performance drop for IC and SF tasks, respectively. Furthermore, PPCL beats data augmentation approach while using ten times fewer augmented data samples.

Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification
Tao Meng | Ninareh Mehrabi | Palash Goyal | Anil Ramakrishna | Aram Galstyan | Richard Zemel | Kai-Wei Chang | Rahul Gupta | Charith Peris
Findings of the Association for Computational Linguistics: EMNLP 2024

We propose a constraint learning schema forfine-tuning Large Language Models (LLMs)with attribute control. Given a training corpusand control criteria formulated as a sequence-level constraint on model outputs, our methodfine-tunes the LLM on the training corpus whileenhancing constraint satisfaction with minimalimpact on its utility and generation quality.Specifically, our approach regularizes the LLMtraining by penalizing the KL divergence be-tween the desired output distribution, which sat-isfies the constraints, and the LLM’s posterior.This regularization term can be approximatedby an auxiliary model trained to decomposethe sequence-level constraints into token-levelguidance, allowing the term to be measuredby a closed-form formulation. To further im-prove efficiency, we design a parallel schemefor concurrently updating both the LLM andthe auxiliary model. We evaluate the empiricalperformance of our approach by controlling thetoxicity when training an LLM. We show thatour approach leads to an LLM that producesfewer inappropriate responses while achievingcompetitive performance on benchmarks and atoxicity detection task

Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Anaelia Ovalle | Kai-Wei Chang | Yang Trista Cao | Ninareh Mehrabi | Jieyu Zhao | Aram Galstyan | Jwala Dhamala | Anoop Kumar | Rahul Gupta
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs
Elan Markowitz | Anil Ramakrishna | Jwala Dhamala | Ninareh Mehrabi | Charith Peris | Rahul Gupta | Kai-Wei Chang | Aram Galstyan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge graphs (KGs) complement Large Language Models (LLMs) by providing reliable, structured, domain-specific, and up-to-date external knowledge. However, KGs and LLMs are often developed separately and must be integrated after training. We introduce Tree-of-Traversals, a novel zero-shot reasoning algorithm that enables augmentation of black-box LLMs with one or more KGs. The algorithm equips a LLM with actions for interfacing a KG and enables the LLM to perform tree search over possible thoughts and actions to find high confidence reasoning paths. Tree-of-Traversals significantly improves performance on question answering and KG question answering tasks. Code is available at https://github.com/amazon-science/tree-of-traversals

BELIEVE: Belief-Enhanced Instruction Generation and Augmentation for Zero-Shot Bias Mitigation
Lisa Bauer | Ninareh Mehrabi | Palash Goyal | Kai-Wei Chang | Aram Galstyan | Rahul Gupta
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

Language models, pre-trained on large amounts of unmoderated content, have been shown to contain societal biases. Mitigating such biases typically requires access to model parameters and training schemas. In this work, we address bias mitigation at inference time, such that it can be applied to any black-box model. To this end, we propose a belief generation and augmentation framework, BELIEVE, that demonstrates effective bias mitigation for natural language generation by augmenting input prompts with automatically generated instruction-based beliefs. Our framework eases the bottleneck required for manually crafting these instruction-based beliefs, by extending a recently proposed iterative in-context learning framework to automatically generate beliefs via a language model. We assess the impact of this system on fairness, and demonstrate effective bias mitigation on pretrained and instruction-tuned models for both sentiment and regard with respect to multiple protected classes including race, gender, and political ideology.

MICo: Preventative Detoxification of Large Language Models through Inhibition Control
Roy Siegelmann | Ninareh Mehrabi | Palash Goyal | Prasoon Goyal | Lisa Bauer | Jwala Dhamala | Aram Galstyan | Rahul Gupta | Reza Ghanadan
Findings of the Association for Computational Linguistics: NAACL 2024

Large Language Models (LLMs) are powerful tools which have been both dominant and commonplace in the field of Artificial Intelligence. Yet, LLMs have a tendency to devolve into toxic degeneration, wherein otherwise safe and unproblematic models begin generating toxic content. For the sake of social responsibility and inspired by the biological mechanisms of inhibition control, we introduce the paradigm of Education for Societal Norms (ESN). By collecting and labeling examples as acceptable and unacceptable (in this case toxic and non-toxic), and including a corresponding acceptable rewrite with every unacceptable example, we introduce a new mechanism for LLM detoxification. We annotate a dataset of 2,850 entries and use it to fine-tune a model, which we call a Model with Inhibition Control (MICo). Evaluating this model on toxicity detection capability, rewrite detoxification, meaning preservation, and overall toxicity reduction, we discover significant improvements over the baseline model. In our experiments we show that overall toxicity of this model is more than 60% reduced, with over 75% reduction in severe toxicity.

2023

Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
Anaelia Ovalle | Kai-Wei Chang | Ninareh Mehrabi | Yada Pruksachatkun | Aram Galystan | Jwala Dhamala | Apurv Verma | Trista Cao | Anoop Kumar | Rahul Gupta
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

Resolving Ambiguities in Text-to-Image Generative Models
Ninareh Mehrabi | Palash Goyal | Apurv Verma | Jwala Dhamala | Varun Kumar | Qian Hu | Kai-Wei Chang | Richard Zemel | Aram Galstyan | Rahul Gupta
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans can handle ambiguities effectively by asking clarifying questions and/or relying on contextual cues and common-sense knowledge, resolving ambiguities can be notoriously hard for machines. In this work, we study ambiguities that arise in text-to-image generative models. We curate the Text-to-image Ambiguity Benchmark (TAB) dataset to study different types of ambiguities in text-to-image generative models. We then propose the Text-to-ImagE Disambiguation (TIED) framework to disambiguate the prompts given to the text-to-image generative models by soliciting clarifications from the end user. Through automatic and human evaluations, we show the effectiveness of our framework in generating more faithful images aligned with end user intention in the presence of ambiguities.

2022

Robust Conversational Agents against Imperceptible Toxicity Triggers
Ninareh Mehrabi | Ahmad Beirami | Fred Morstatter | Aram Galstyan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Warning: this paper contains content that maybe offensive or upsetting. Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. In this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language. We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow. Through automatic and human evaluations, we show that our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy. Lastly, we establish the generalizability of such a defense mechanism on language generation models beyond conversational agents.

Attributing Fair Decisions with Attention Interventions
Ninareh Mehrabi | Umang Gupta | Fred Morstatter | Greg Ver Steeg | Aram Galstyan
Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022)

The widespread use of Artificial Intelligence (AI) in consequential domains, such as health-care and parole decision-making systems, has drawn intense scrutiny on the fairness of these methods. However, ensuring fairness is often insufficient as the rationale for a contentious decision needs to be audited, understood, and defended. We propose that the attention mechanism can be used to ensure fair outcomes while simultaneously providing feature attributions to account for how a decision was made. Toward this goal, we design an attention-based model that can be leveraged as an attribution framework. It can identify features responsible for both performance and fairness of the model through attention interventions and attention weight manipulation. Using this attribution framework, we then design a post-processing bias mitigation strategy and compare it with a suite of baselines. We demonstrate the versatility of our approach by conducting experiments on two distinct data types, tabular and textual.

Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022)
Bill Yuchen Lin | Chaoyang He | Chulin Xie | Fatemehsadat Mireshghallah | Ninareh Mehrabi | Tian Li | Mahdi Soltanolkotabi | Xiang Ren
Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022)

2021

Lawyers are Dishonest? Quantifying Representational Harms in Commonsense Knowledge Resources
Ninareh Mehrabi | Pei Zhou | Fred Morstatter | Jay Pujara | Xiang Ren | Aram Galstyan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Warning: this paper contains content that may be offensive or upsetting. Commonsense knowledge bases (CSKB) are increasingly used for various natural language processing tasks. Since CSKBs are mostly human-generated and may reflect societal biases, it is important to ensure that such biases are not conflated with the notion of commonsense. Here we focus on two widely used CSKBs, ConceptNet and GenericsKB, and establish the presence of bias in the form of two types of representational harms, overgeneralization of polarized perceptions and representation disparity across different demographic groups in both CSKBs. Next, we find similar representational harms for downstream models that use ConceptNet. Finally, we propose a filtering-based approach for mitigating such harms, and observe that our filtered-based approach can reduce the issues in both resources and models but leads to a performance drop, leaving room for future work to build fairer and stronger commonsense models.

Co-authors

Venues