Unraveling the Mystery: Defending Against Jailbreak Attacks Via Unearthing Real Intention

Yanhao Li, Hongshen Chen, Heng Zhang, Zhiwei Ge, Tianhao Li, Sulong Xu, Guibo Luo


Abstract
As Large Language Models (LLMs) become more advanced, the security risks they pose also increase. Ensuring that LLM behavior aligns with human values, particularly in mitigating jailbreak attacks with elusive and implicit intentions, has become a significant challenge. To address this issue, we propose a jailbreak defense method called Real Intentions Defense (RID), which involves two phases: soft extraction and hard deletion. In the soft extraction phase, LLMs are leveraged to extract unbiased, genuine intentions, while in the hard deletion phase, a greedy gradient-based algorithm is used to remove the least important parts of a sentence, based on the insight that words with smaller gradients have less impact on its meaning. We conduct extensive experiments on Vicuna and Llama2 models using eight state-of-the-art jailbreak attacks and six benchmark datasets. Our results show a significant reduction in both Attack Success Rate (ASR) and Harmful Score of jailbreak attacks, while maintaining overall model performance. Further analysis sheds light on the underlying mechanisms of our approach.
Anthology ID:
2025.coling-main.560
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8374–8384
Language:
URL:
https://aclanthology.org/2025.coling-main.560/
DOI:
Bibkey:
Cite (ACL):
Yanhao Li, Hongshen Chen, Heng Zhang, Zhiwei Ge, Tianhao Li, Sulong Xu, and Guibo Luo. 2025. Unraveling the Mystery: Defending Against Jailbreak Attacks Via Unearthing Real Intention. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8374–8384, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Unraveling the Mystery: Defending Against Jailbreak Attacks Via Unearthing Real Intention (Li et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.560.pdf