AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs

Jiawei Chen, Xiao Yang, Zhengwei Fang, Yu Tian, Yinpeng Dong, Zhaoxia Yin, Hang Su


Abstract
Recent studies show that large language models (LLMs) are vulnerable to jailbreak attacks, which can bypass their defense mechanisms. However, existing jailbreak research often exhibits limitations in universality, validity, and efficiency. Therefore, we rethink jailbreaking LLMs and define three key properties to guide the design of effective jailbreak methods. We introduce AutoBreach, a novel black-box approach that uses wordplay-guided mapping rule sampling to create universal adversarial prompts. By leveraging LLMs’ summarization and reasoning abilities, AutoBreach minimizes manual effort. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Also, we propose a two-stage mapping rule optimization that initially optimizes mapping rules before querying target LLMs to enhance efficiency. Experimental results indicate AutoBreach efficiently identifies security vulnerabilities across various LLMs (Claude-3, GPT-4, etc.), achieving an average success rate of over 80% with fewer than 10 queries. Notably, the adversarial prompts generated by AutoBreach for GPT-4 can directly bypass the defenses of the advanced commercial LLM GPT o1-preview, demonstrating strong transferability and universality.
Anthology ID:
2025.findings-naacl.378
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6777–6798
Language:
URL:
https://aclanthology.org/2025.findings-naacl.378/
DOI:
Bibkey:
Cite (ACL):
Jiawei Chen, Xiao Yang, Zhengwei Fang, Yu Tian, Yinpeng Dong, Zhaoxia Yin, and Hang Su. 2025. AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 6777–6798, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs (Chen et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-naacl.378.pdf