Steering Away from Refusal: A Black-box Jailbreak Method Based on First-Token Distribution

Shuangjie Fu; Du Su; Xin Chen; Fei Sun; Huawei Shen (沈华伟); Xueqi Cheng (程学旗)

Steering Away from Refusal: A Black-box Jailbreak Method Based on First-Token Distribution

Shuangjie Fu, Du Su, Xin Chen, Fei Sun, Huawei Shen, Xueqi Cheng

Abstract

Investigating black-box jailbreak attacks is crucial for revealing the actual security risks faced by operational Large Language Models (LLMs). The primary challenge in black-box jailbreak attack is the absence of direct optimization signals, such as gradients, to guide the refinement of adversarial prompts. While current mainstream methods like PAIR and TAP attempt to leverage the model’s textual output as feedback, facing a critical limitation when models consistently generate static refusal responses, depriving the attacker of any actionable signal to distinguish better prompts. To overcome the bottleneck and reveal whether there is potential risk to open access to partial logprobs information, we investigate LLM output distribution. Our empirical analysis reveals that refusal responses exhibit a highly consistent distributional pattern at the first generated token, suggesting that the deviation from this standard pattern can serve as a quantifiable metric for LLM generating refusal response. Based on this insight, we propose Distribution Jailbreak (DJ), an attack method that select effective jailbreak templates and then iteratively optimizes adversarial suffixes by maximizing the KL divergence from the standard refusal distribution. Extensive experiments demonstrate that DJ achieves state-of-the-art Attack Success Rate(ASR). Notably, DJ achieves over 90% ASR on all tested open-source models, and delivers over 94% ASR on GPT-4.1. Our code is publicly available at https://github.com/Zed630/DistributionJailbreak.

Anthology ID:: 2026.findings-acl.1294
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25969–25979
Language:
URL:: https://aclanthology.org/2026.findings-acl.1294/
DOI:
Bibkey:
Cite (ACL):: Shuangjie Fu, Du Su, Xin Chen, Fei Sun, Huawei Shen, and Xueqi Cheng. 2026. Steering Away from Refusal: A Black-box Jailbreak Method Based on First-Token Distribution. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25969–25979, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Steering Away from Refusal: A Black-box Jailbreak Method Based on First-Token Distribution (Fu et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1294.pdf
Checklist:: 2026.findings-acl.1294.checklist.pdf

PDF Cite Search Checklist Fix data