SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Kaiwen Zhou; Xuandong Zhao; Jayanth Srinivasa; Gaowen Liu; Aosong Feng; Dawn Song; Xin Eric Wang

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Kaiwen Zhou, Xuandong Zhao, Jayanth Srinivasa, Gaowen Liu, Aosong Feng, Dawn Song, Xin Eric Wang

Abstract

Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the ‘key sentence’ that follows models’ query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha-moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model’s internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models’ attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the harmfulness rate by 9.6%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.

Anthology ID:: 2025.emnlp-main.1291
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25407–25423
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1291/
DOI:
Bibkey:
Cite (ACL):: Kaiwen Zhou, Xuandong Zhao, Jayanth Srinivasa, Gaowen Liu, Aosong Feng, Dawn Song, and Xin Eric Wang. 2025. SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25407–25423, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning (Zhou et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1291.pdf
Checklist:: 2025.emnlp-main.1291.checklist.pdf

PDF Cite Search Checklist Fix data