Guibo Luo

2025

As Large Language Models (LLMs) become more advanced, the security risks they pose also increase. Ensuring that LLM behavior aligns with human values, particularly in mitigating jailbreak attacks with elusive and implicit intentions, has become a significant challenge. To address this issue, we propose a jailbreak defense method called Real Intentions Defense (RID), which involves two phases: soft extraction and hard deletion. In the soft extraction phase, LLMs are leveraged to extract unbiased, genuine intentions, while in the hard deletion phase, a greedy gradient-based algorithm is used to remove the least important parts of a sentence, based on the insight that words with smaller gradients have less impact on its meaning. We conduct extensive experiments on Vicuna and Llama2 models using eight state-of-the-art jailbreak attacks and six benchmark datasets. Our results show a significant reduction in both Attack Success Rate (ASR) and Harmful Score of jailbreak attacks, while maintaining overall model performance. Further analysis sheds light on the underlying mechanisms of our approach.

Co-authors

Heng Zhang 1

Venues

coling1

Fix author