2024
pdf
bib
abs
FLIRT: Feedback Loop In-context Red Teaming
Ninareh Mehrabi
|
Palash Goyal
|
Christophe Dupuy
|
Qian Hu
|
Shalini Ghosh
|
Richard Zemel
|
Kai-Wei Chang
|
Aram Galstyan
|
Rahul Gupta
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Warning: this paper contains content that may be inappropriate or offensive.As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. In this work, we propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. In particular, taking text-to-image models as target models, we explore different feedback mechanisms to automatically learn effective and diverse adversarial prompts. Our experiments demonstrate that even with enhanced safety features, Stable Diffusion (SD) models are vulnerable to our adversarial prompts, raising concerns on their robustness in practical uses. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models.
pdf
bib
abs
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
Fei Wang
|
Ninareh Mehrabi
|
Palash Goyal
|
Rahul Gupta
|
Kai-Wei Chang
|
Aram Galstyan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Data are crucial element in large language model (LLM) alignment. Recent studies have explored using LLMs for efficient data collection. However, LLM-generated data often suffers from quality issues, with underrepresented or absent aspects and low-quality datapoints. To address these problems, we propose Data Advisor, an enhanced LLM-based method for generating data that takes into account the characteristics of the desired dataset. Starting from a set of pre-defined principles in hand, Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation accordingly. Data Advisor can be easily integrated into existing data generation methods to enhance data quality and coverage. Experiments on safety alignment of three representative LLMs (i.e., Mistral, Llama2, and Falcon) demonstrate the effectiveness of Data Advisor in enhancing model safety against various fine-grained safety issues without sacrificing model utility.
pdf
bib
abs
MICo: Preventative Detoxification of Large Language Models through Inhibition Control
Roy Siegelmann
|
Ninareh Mehrabi
|
Palash Goyal
|
Prasoon Goyal
|
Lisa Bauer
|
Jwala Dhamala
|
Aram Galstyan
|
Rahul Gupta
|
Reza Ghanadan
Findings of the Association for Computational Linguistics: NAACL 2024
Large Language Models (LLMs) are powerful tools which have been both dominant and commonplace in the field of Artificial Intelligence. Yet, LLMs have a tendency to devolve into toxic degeneration, wherein otherwise safe and unproblematic models begin generating toxic content. For the sake of social responsibility and inspired by the biological mechanisms of inhibition control, we introduce the paradigm of Education for Societal Norms (ESN). By collecting and labeling examples as acceptable and unacceptable (in this case toxic and non-toxic), and including a corresponding acceptable rewrite with every unacceptable example, we introduce a new mechanism for LLM detoxification. We annotate a dataset of 2,850 entries and use it to fine-tune a model, which we call a Model with Inhibition Control (MICo). Evaluating this model on toxicity detection capability, rewrite detoxification, meaning preservation, and overall toxicity reduction, we discover significant improvements over the baseline model. In our experiments we show that overall toxicity of this model is more than 60% reduced, with over 75% reduction in severe toxicity.
pdf
bib
abs
Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies
Anaelia Ovalle
|
Ninareh Mehrabi
|
Palash Goyal
|
Jwala Dhamala
|
Kai-Wei Chang
|
Richard Zemel
|
Aram Galstyan
|
Yuval Pinter
|
Rahul Gupta
Findings of the Association for Computational Linguistics: NAACL 2024
Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM), such as the inability to correctly use gender-diverse English neopronouns (e.g., xe, zir, fae). While data scarcity is a known culprit, the precise mechanisms through which scarcity affects this behavior remain underexplored. We discover LLM misgendering is significantly influenced by Byte-Pair Encoding (BPE) tokenization, the tokenizer powering many popular LLMs. Unlike binary pronouns, BPE overfragments neopronouns, a direct consequence of data scarcity during tokenizer training. This disparate tokenization mirrors tokenizer limitations observed in multilingual and low-resource NLP, unlocking new misgendering mitigation strategies. We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency. Our proposed methods outperform finetuning with standard BPE, improving neopronoun accuracy from 14.1% to 58.4%. Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.
pdf
bib
abs
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification
Tao Meng
|
Ninareh Mehrabi
|
Palash Goyal
|
Anil Ramakrishna
|
Aram Galstyan
|
Richard Zemel
|
Kai-Wei Chang
|
Rahul Gupta
|
Charith Peris
Findings of the Association for Computational Linguistics: EMNLP 2024
We propose a constraint learning schema forfine-tuning Large Language Models (LLMs)with attribute control. Given a training corpusand control criteria formulated as a sequence-level constraint on model outputs, our methodfine-tunes the LLM on the training corpus whileenhancing constraint satisfaction with minimalimpact on its utility and generation quality.Specifically, our approach regularizes the LLMtraining by penalizing the KL divergence be-tween the desired output distribution, which sat-isfies the constraints, and the LLM’s posterior.This regularization term can be approximatedby an auxiliary model trained to decomposethe sequence-level constraints into token-levelguidance, allowing the term to be measuredby a closed-form formulation. To further im-prove efficiency, we design a parallel schemefor concurrently updating both the LLM andthe auxiliary model. We evaluate the empiricalperformance of our approach by controlling thetoxicity when training an LLM. We show thatour approach leads to an LLM that producesfewer inappropriate responses while achievingcompetitive performance on benchmarks and atoxicity detection task
pdf
bib
abs
The steerability of large language models toward data-driven personas
Junyi Li
|
Charith Peris
|
Ninareh Mehrabi
|
Palash Goyal
|
Kai-Wei Chang
|
Aram Galstyan
|
Richard Zemel
|
Rahul Gupta
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented. Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs, that can be leveraged to produce multiple perspectives and to reflect the diverse opinions. Moving beyond the traditional reliance on demographics like age, gender, or party affiliation, we introduce a data-driven notion of persona grounded in collaborative filtering, which is defined as either a single individual or a cohort of individuals manifesting similar views across specific inquiries. As individuals in the same demographic group may have different personas, our data-driven persona definition allows for a more nuanced understanding of different (latent) social groups present in the population. In addition to this, we also explore an efficient method to steer LLMs toward the personas that we define. We show that our data-driven personas significantly enhance model steerability, with improvements of between 57%-77% over our best performing baselines.
pdf
bib
abs
BELIEVE: Belief-Enhanced Instruction Generation and Augmentation for Zero-Shot Bias Mitigation
Lisa Bauer
|
Ninareh Mehrabi
|
Palash Goyal
|
Kai-Wei Chang
|
Aram Galstyan
|
Rahul Gupta
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Language models, pre-trained on large amounts of unmoderated content, have been shown to contain societal biases. Mitigating such biases typically requires access to model parameters and training schemas. In this work, we address bias mitigation at inference time, such that it can be applied to any black-box model. To this end, we propose a belief generation and augmentation framework, BELIEVE, that demonstrates effective bias mitigation for natural language generation by augmenting input prompts with automatically generated instruction-based beliefs. Our framework eases the bottleneck required for manually crafting these instruction-based beliefs, by extending a recently proposed iterative in-context learning framework to automatically generate beliefs via a language model. We assess the impact of this system on fairness, and demonstrate effective bias mitigation on pretrained and instruction-tuned models for both sentiment and regard with respect to multiple protected classes including race, gender, and political ideology.
2023
pdf
bib
abs
Resolving Ambiguities in Text-to-Image Generative Models
Ninareh Mehrabi
|
Palash Goyal
|
Apurv Verma
|
Jwala Dhamala
|
Varun Kumar
|
Qian Hu
|
Kai-Wei Chang
|
Richard Zemel
|
Aram Galstyan
|
Rahul Gupta
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans can handle ambiguities effectively by asking clarifying questions and/or relying on contextual cues and common-sense knowledge, resolving ambiguities can be notoriously hard for machines. In this work, we study ambiguities that arise in text-to-image generative models. We curate the Text-to-image Ambiguity Benchmark (TAB) dataset to study different types of ambiguities in text-to-image generative models. We then propose the Text-to-ImagE Disambiguation (TIED) framework to disambiguate the prompts given to the text-to-image generative models by soliciting clarifications from the end user. Through automatic and human evaluations, we show the effectiveness of our framework in generating more faithful images aligned with end user intention in the presence of ambiguities.
pdf
bib
abs
Faithful Model Evaluation for Model-Based Metrics
Qian Hu
|
Palash Goyal
|
Rahul Gupta
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Statistical significance testing is used in natural language processing (NLP) to determine whether the results of a study or experiment are likely to be due to chance or if they reflect a genuine relationship. A key step in significance testing is the estimation of confidence interval which is a function of sample variance. Sample variance calculation is straightforward when evaluating against ground truth. However, in many cases, a metric model is often used for evaluation. For example, to compare toxicity of two large language models, a toxicity classifier is used for evaluation. Existing works usually do not consider the variance change due to metric model errors, which can lead to wrong conclusions. In this work, we establish the mathematical foundation of significance testing for model-based metrics. With experiments on public benchmark datasets and a production system, we show that considering metric model errors to calculate sample variances for model-based metrics changes the conclusions in certain experiments.