Sugar-Coated Poison: Benign Generation Unlocks Jailbreaking

Yuhang Wu, Yu-Jie Xiong, Hao Zhang, Jia-Chen Zhang, Zheng Zhou


Abstract
With the increasingly deep integration of large language models (LLMs) across diverse domains, the effectiveness of their safety mechanisms is encountering severe challenges. Currently, jailbreak attacks based on prompt engineering, which induce models to generate potentially harmful content, have become a major security threat. However, existing methods primarily rely on black-box manipulation of prompt templates, resulting in high costs and poor generalizability. To break through the bottleneck, this study reveals the potential impact of the generation of LLMs on safety for the first time that Defense Threshold Decay (DTD) phenomena: as benign content generation increases, the model’s attention to input instructions progressively diminishes. Building on this insight, we propose the Sugar-Coated Poison (SCP) attack paradigm, using a “semantic reversal” strategy, where benign inputs that are opposite in meaning to malicious intent are crafted to induce the model into a safety response mode. When the defense threshold decays, an adversarial reasoning mechanism easily bypasses safety mechanisms. Experiments show SCP outperforms existing baselines. For defense, we propose Part-of-Speech Defense (POSD), leveraging verb-noun dependencies for syntactic analysis to enhance robustness and security of LLMs. Our code is available at https://anonymous.4open.science/r/SCP-9092.
Anthology ID:
2025.findings-emnlp.512
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9645–9665
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.512/
DOI:
Bibkey:
Cite (ACL):
Yuhang Wu, Yu-Jie Xiong, Hao Zhang, Jia-Chen Zhang, and Zheng Zhou. 2025. Sugar-Coated Poison: Benign Generation Unlocks Jailbreaking. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 9645–9665, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Sugar-Coated Poison: Benign Generation Unlocks Jailbreaking (Wu et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.512.pdf
Checklist:
 2025.findings-emnlp.512.checklist.pdf