Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming

Rui Li, Peiyi Wang, Jingyuan Ma, Di Zhang, Lei Sha, Zhifang Sui


Abstract
Large Language Models (LLMs) have gained increasing attention for their remarkable capacity, alongside concerns about safety arising from their potential to produce harmful content. Red teaming aims to find prompts that could elicit harmful responses from LLMs, and is essential to discover and mitigate safety risks before real-world deployment. However, manual red teaming is both time-consuming and expensive, rendering it unscalable. In this paper, we propose RTPE, a scalable evolution framework to evolve red teaming prompts across both breadth and depth dimensions, facilitating the automatic generation of numerous high-quality and diverse red teaming prompts. Specifically, in-breadth evolving employs a novel enhanced in-context learning method to create a multitude of quality prompts, whereas in-depth evolving applies customized transformation operations to enhance both content and form of prompts, thereby increasing diversity. Extensive experiments demonstrate that RTPE surpasses existing representative automatic red teaming methods on both attack success rate and diversity. In addition, based on 4,800 red teaming prompts created by RTPE, we further provide a systematic analysis of 8 representative LLMs across 8 sensitive topics.
Anthology ID:
2024.findings-emnlp.188
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3287–3301
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.188
DOI:
Bibkey:
Cite (ACL):
Rui Li, Peiyi Wang, Jingyuan Ma, Di Zhang, Lei Sha, and Zhifang Sui. 2024. Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3287–3301, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming (Li et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.188.pdf