DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

Hao Wang; Hao Li; Junda Zhu; Xinyuan Wang; Chengwei Pan; Minlie Huang; Lei Sha

doi:10.18653/v1/2025.emnlp-main.1128

DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

Hao Wang, Hao Li, Junda Zhu, Xinyuan Wang, Chengwei Pan, Minlie Huang, Lei Sha

Abstract

Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq diffusion model, allowing more flexible token modifications. This approach preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the diffusion model’s output distribution differentiable, eliminating the need for iterative token search. Extensive experiments on Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous methods across various evaluation metrics, including attack success rate (ASR), fluency, and diversity.

Anthology ID:: 2025.emnlp-main.1128
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22182–22194
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1128/
DOI:: 10.18653/v1/2025.emnlp-main.1128
Bibkey:
Cite (ACL):: Hao Wang, Hao Li, Junda Zhu, Xinyuan Wang, Chengwei Pan, Minlie Huang, and Lei Sha. 2025. DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22182–22194, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1128.pdf
Checklist:: 2025.emnlp-main.1128.checklist.pdf

PDF Cite Search Checklist Fix data