Experience-Driven Multi-Agent Optimization for Black-Box Jailbreak Attacks on Large Language Models

Zhaoyang Han; Yihe Liu; Kai Zhang; Ping Li

Experience-Driven Multi-Agent Optimization for Black-Box Jailbreak Attacks on Large Language Models

Zhaoyang Han, Yihe Liu, Kai Zhang, Ping Li

Abstract

The rapid discovery of jailbreak prompts has revealed the alarming fragility of safety alignment in frontier large language models (LLMs). While jailbreak techniques play a critical role in red-teaming and safety evaluation, existing methods exhibit three key limitations: (i) poor transferability across model families, requiring model-specific manual tuning; (ii) heavy reliance on large-scale prompt enumeration or exhaustive search, causing prohibitive query costs and poor scalability; and (iii) high sensitivity to input preprocessing or refusal-oriented fine-tuning, leading to attack failures once the underlying model is updated. To address these, we propose Experience-driven Multi-agent Jailbreak Optimization (EMJO), which couples three collaborating agents (Attacker, Analyzer, and Judge) into a closed-loop “probe–evaluate–revise” process, together with a dynamic experience bank accumulating high-quality successful prompts and reusable strategy patterns across iterations and tasks. This design enables query-efficient and transferable jailbreak optimization under black-box access. Extensive experiments on diverse LLMs demonstrate that EMJO consistently outperforms existing black-box jailbreak baselines, achieving up to 11% absolute improvement in attack success rate while reducing the average query cost by up to 7.9× across two benchmark datasets. These results indicate that EMJO offers an effective and scalable paradigm for systematic jailbreak discovery.

Anthology ID:: 2026.findings-acl.1188
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23729–23747
Language:
URL:: https://aclanthology.org/2026.findings-acl.1188/
DOI:
Bibkey:
Cite (ACL):: Zhaoyang Han, Yihe Liu, Kai Zhang, and Ping Li. 2026. Experience-Driven Multi-Agent Optimization for Black-Box Jailbreak Attacks on Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 23729–23747, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Experience-Driven Multi-Agent Optimization for Black-Box Jailbreak Attacks on Large Language Models (Han et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1188.pdf
Checklist:: 2026.findings-acl.1188.checklist.pdf

PDF Cite Search Checklist Fix data