Intention Analysis Makes LLMs A Good Jailbreak Defender

Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao


Abstract
Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis (IA). IA works by triggering LLMs’ inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably,IA is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that IA could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our IA, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, IA is robust to errors in generated intentions. Further analyses reveal the underlying principle of IA: suppressing LLM’s tendency to follow jailbreak prompts, thereby enhancing safety.
Anthology ID:
2025.coling-main.199
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2947–2968
Language:
URL:
https://aclanthology.org/2025.coling-main.199/
DOI:
Bibkey:
Cite (ACL):
Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. 2025. Intention Analysis Makes LLMs A Good Jailbreak Defender. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2947–2968, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Intention Analysis Makes LLMs A Good Jailbreak Defender (Zhang et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.199.pdf