Zhengwei Fang
2025
AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs
Jiawei Chen
|
Xiao Yang
|
Zhengwei Fang
|
Yu Tian
|
Yinpeng Dong
|
Zhaoxia Yin
|
Hang Su
Findings of the Association for Computational Linguistics: NAACL 2025
Recent studies show that large language models (LLMs) are vulnerable to jailbreak attacks, which can bypass their defense mechanisms. However, existing jailbreak research often exhibits limitations in universality, validity, and efficiency. Therefore, we rethink jailbreaking LLMs and define three key properties to guide the design of effective jailbreak methods. We introduce AutoBreach, a novel black-box approach that uses wordplay-guided mapping rule sampling to create universal adversarial prompts. By leveraging LLMs’ summarization and reasoning abilities, AutoBreach minimizes manual effort. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Also, we propose a two-stage mapping rule optimization that initially optimizes mapping rules before querying target LLMs to enhance efficiency. Experimental results indicate AutoBreach efficiently identifies security vulnerabilities across various LLMs (Claude-3, GPT-4, etc.), achieving an average success rate of over 80% with fewer than 10 queries. Notably, the adversarial prompts generated by AutoBreach for GPT-4 can directly bypass the defenses of the advanced commercial LLM GPT o1-preview, demonstrating strong transferability and universality.
Search
Fix data
Co-authors
- Jiawei Chen (陈佳炜) 1
- Yinpeng Dong 1
- Hang Su 1
- Yu Tian 1
- Xiao Yang (杨潇) 1
- show all...