2024
pdf
bib
abs
FLIRT: Feedback Loop In-context Red Teaming
Ninareh Mehrabi
|
Palash Goyal
|
Christophe Dupuy
|
Qian Hu
|
Shalini Ghosh
|
Richard Zemel
|
Kai-Wei Chang
|
Aram Galstyan
|
Rahul Gupta
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Warning: this paper contains content that may be inappropriate or offensive.As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. In this work, we propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. In particular, taking text-to-image models as target models, we explore different feedback mechanisms to automatically learn effective and diverse adversarial prompts. Our experiments demonstrate that even with enhanced safety features, Stable Diffusion (SD) models are vulnerable to our adversarial prompts, raising concerns on their robustness in practical uses. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models.
pdf
bib
abs
Toward Informal Language Processing: Knowledge of Slang in Large Language Models
Zhewei Sun
|
Qian Hu
|
Rahul Gupta
|
Richard Zemel
|
Yang Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent advancement in large language models (LLMs) has offered a strong potential for natural language systems to process informal language. A representative form of informal language is slang, used commonly in daily conversations and online social media. To date, slang has not been comprehensively evaluated in LLMs due partly to the absence of a carefully designed and publicly accessible benchmark. Using movie subtitles, we construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang. For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications: 1) slang detection, and 2) identification of regional and historical sources of slang from natural sentences. We also show how our dataset can be used to probe the output distributions of LLMs for interpretive insights. We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance. Furthermore, we show that our dataset enables finetuning of LLMs such as GPT-3.5 that achieve substantially better performance than strong zero-shot baselines. Our work offers a comprehensive evaluation and a high-quality benchmark on English slang based on the OpenSubtitles corpus, serving both as a publicly accessible resource and a platform for applying tools for informal language processing.
pdf
bib
abs
Towards Multi-Modal Co-Reference Resolution in Conversational Shopping Agents
Samuel Osebe
|
Prashan Wanigasekara
|
Thomas Gueudre
|
Thanh Tran
|
Rahul Sharma
|
Fan Yang
|
Qian Hu
|
Weitong Ruan
|
Emre Barut
|
Chengwei Su
Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024
The context of modern smart voice assistants is often multi-modal, where images, audio and video content are consumed by users simultaneously. In such a setup, co-reference resolution is especially challenging, and runs across modalities and dialogue turns. We explore the problem of multi-modal co-reference resolution in multi-turn dialogues and quantify the performance of multi-modal LLMs on a specially curated dataset of long, image-interleaved conversations between a voice assistant and human in a shopping use case. We propose a custom architecture for multi-modal embedding alignment using a novel parameter augmentation technique. Our proposed Parameter Augmented LLM approach shows a 4.9% absolute F1 improvement above a cross-attention baseline while reducing the number of parameters being trained by 4x.
pdf
bib
abs
Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought
Jooyoung Lee
|
Fan Yang
|
Thanh Tran
|
Qian Hu
|
Emre Barut
|
Kai-Wei Chang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., <1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks. Specifically, the lightweight LM first generates a rationale for each input instance. The Frozen large LM is then prompted to predict a task output based on the rationale generated by the lightweight LM. Our approach is resource-efficient in the sense that it only requires training the lightweight LM. We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals. We assess our method with multi-hop extractive question answering (QA) benchmarks, HotpotQA, and 2WikiMultiHopQA. Experimental results show that our approach outperforms all baselines regarding answer prediction accuracy. We also find that reinforcement learning helps the model to produce higher-quality rationales with improved QA performance.
2023
pdf
bib
abs
Resolving Ambiguities in Text-to-Image Generative Models
Ninareh Mehrabi
|
Palash Goyal
|
Apurv Verma
|
Jwala Dhamala
|
Varun Kumar
|
Qian Hu
|
Kai-Wei Chang
|
Richard Zemel
|
Aram Galstyan
|
Rahul Gupta
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans can handle ambiguities effectively by asking clarifying questions and/or relying on contextual cues and common-sense knowledge, resolving ambiguities can be notoriously hard for machines. In this work, we study ambiguities that arise in text-to-image generative models. We curate the Text-to-image Ambiguity Benchmark (TAB) dataset to study different types of ambiguities in text-to-image generative models. We then propose the Text-to-ImagE Disambiguation (TIED) framework to disambiguate the prompts given to the text-to-image generative models by soliciting clarifications from the end user. Through automatic and human evaluations, we show the effectiveness of our framework in generating more faithful images aligned with end user intention in the presence of ambiguities.
pdf
bib
abs
Evaluating Large Language Models on Controlled Generation Tasks
Jiao Sun
|
Yufei Tian
|
Wangchunshu Zhou
|
Nan Xu
|
Qian Hu
|
Rahul Gupta
|
John Wieting
|
Nanyun Peng
|
Xuezhe Ma
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the controllability of large language models on generation tasks. We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models. We conclude that *large language models struggle at meeting fine-grained hard constraints*.
pdf
bib
abs
Log-FGAER: Logic-Guided Fine-Grained Address Entity Recognition from Multi-Turn Spoken Dialogue
Xue Han
|
Yitong Wang
|
Qian Hu
|
Pengwei Hu
|
Chao Deng
|
Junlan Feng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Fine-grained address entity recognition (FGAER) from multi-turn spoken dialogues is particularly challenging. The major reason lies in that a full address is often formed through a conversation process. Different parts of an address are distributed through multiple turns of a dialogue with spoken noises. It is nontrivial to extract by turn and combine them. This challenge has not been well emphasized by main-stream entity extraction algorithms. To address this issue, we propose in this paper a logic-guided fine-grained address recognition method (Log-FGAER), where we formulate the address hierarchy relationship as the logic rule and softly apply it in a probabilistic manner to improve the accuracy of FGAER. In addition, we provide an ontology-based data augmentation methodology that employs ChatGPT to augment a spoken dialogue dataset with labeled address entities. Experiments are conducted using datasets generated by the proposed data augmentation technique and derived from real-world scenarios. The results of the experiment demonstrate the efficacy of our proposal.
pdf
bib
abs
Faithful Model Evaluation for Model-Based Metrics
Qian Hu
|
Palash Goyal
|
Rahul Gupta
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Statistical significance testing is used in natural language processing (NLP) to determine whether the results of a study or experiment are likely to be due to chance or if they reflect a genuine relationship. A key step in significance testing is the estimation of confidence interval which is a function of sample variance. Sample variance calculation is straightforward when evaluating against ground truth. However, in many cases, a metric model is often used for evaluation. For example, to compare toxicity of two large language models, a toxicity classifier is used for evaluation. Existing works usually do not consider the variance change due to metric model errors, which can lead to wrong conclusions. In this work, we establish the mathematical foundation of significance testing for model-based metrics. With experiments on public benchmark datasets and a production system, we show that considering metric model errors to calculate sample variances for model-based metrics changes the conclusions in certain experiments.
2021
pdf
bib
abs
Collaborative Data Relabeling for Robust and Diverse Voice Apps Recommendation in Intelligent Personal Assistants
Qian Hu
|
Thahir Mohamed
|
Wei Xiao
|
Zheng Gao
|
Xibin Gao
|
Radhika Arava
|
Xiyao Ma
|
Mohamed AbdelHady
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI
Intelligent personal assistants (IPAs) such as Amazon Alexa, Google Assistant and Apple Siri extend their built-in capabilities by supporting voice apps developed by third-party developers. Sometimes the smart assistant is not able to successfully respond to user voice commands (aka utterances). There are many reasons including automatic speech recognition (ASR) error, natural language understanding (NLU) error, routing utterances to an irrelevant voice app or simply that the user is asking for a capability that is not supported yet. The failure to handle a voice command leads to customer frustration. In this paper, we introduce a fallback skill recommendation system to suggest a voice app to a customer for an unhandled voice command. One of the prominent challenges of developing a skill recommender system for IPAs is partial observation. To solve the partial observation problem, we propose collaborative data relabeling (CDR) method. In addition, CDR also improves the diversity of the recommended skills. We evaluate the proposed method both offline and online. The offline evaluation results show that the proposed system outperforms the baselines. The online A/B testing results show significant gain of customer experience metrics.
2004
pdf
bib
Audio Hot Spotting and Retrieval using Multiple Features
Qian Hu
|
Fred Goodman
|
Stanley Boykin
|
Randy Fish
|
Warren Greiff
Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004