Jailbreaking? One Step Is Enough!

Weixiong Zheng; Peijian Zeng; Yiwei Li; Hongyan Wu; Nankai Lin (林楠铠); Junhao Chen; Aimin Yang (阳爱民); Yongmei Zhou (周咏梅)

doi:10.18653/v1/2025.acl-long.570

Jailbreaking? One Step Is Enough!

Weixiong Zheng, Peijian Zeng, YiWei Li, Hongyan Wu, Nankai Lin, Junhao Chen, Aimin Yang, Yongmei Zhou

Abstract

Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model’s defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the “defense”. intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers that it is guiding the target model to deal with harmful content, while the target model thinks it is performing a defensive task, creating an illusion of cooperation between the two. Additionally, to enhance the model’s confidence and guidance in “defensive” intentions, we adopt in-context learning (ICL) with a small number of attack examples and construct a corresponding dataset of attack examples. Extensive evaluations demonstrate that the REDA method enables cross-model attacks without the need to redesign attack strategies for different models, enables successful jailbreak in one iteration, and outperforms existing methods on both open-source and closed-source models.

Anthology ID:: 2025.acl-long.570
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11623–11642
Language:
URL:: https://aclanthology.org/2025.acl-long.570/
DOI:: 10.18653/v1/2025.acl-long.570
Bibkey:
Cite (ACL):: Weixiong Zheng, Peijian Zeng, YiWei Li, Hongyan Wu, Nankai Lin, Junhao Chen, Aimin Yang, and Yongmei Zhou. 2025. Jailbreaking? One Step Is Enough!. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11623–11642, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Jailbreaking? One Step Is Enough! (Zheng et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.570.pdf

PDF Cite Search Fix data