Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction

Yuting Huang; Chengyuan Liu; Yifeng Feng; Yiquan Wu; Chao Wu; Fei Wu; Kun Kuang

doi:10.18653/v1/2025.findings-acl.189

Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction

Yuting Huang, Chengyuan Liu, Yifeng Feng, Yiquan Wu, Chao Wu, Fei Wu, Kun Kuang

Abstract

As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing jailbreak methods create a forced instruction-following scenario, or search adversarial prompts with prefix or suffix tokens to achieve a specific representation manually or automatically. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we point out that simply rewriting the original instruction can achieve a jailbreak, and we find that this rewriting approach is learnable and transferable. We propose the **R**ewrite to **J**ailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. The jailbreak is more efficient and hard to identify since no additional features are introduced. Extensive experiments and analysis demonstrate the effectiveness of R2J, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety. The code can be found at [https://github.com/ythuang02/R2J/.](https://github.com/ythuang02/R2J/)

Anthology ID:: 2025.findings-acl.189
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3669–3690
Language:
URL:: https://aclanthology.org/2025.findings-acl.189/
DOI:: 10.18653/v1/2025.findings-acl.189
Bibkey:
Cite (ACL):: Yuting Huang, Chengyuan Liu, Yifeng Feng, Yiquan Wu, Chao Wu, Fei Wu, and Kun Kuang. 2025. Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction. In Findings of the Association for Computational Linguistics: ACL 2025, pages 3669–3690, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction (Huang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.189.pdf

PDF Cite Search Fix data