When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao; Chunkang Zhang; Junxiang Wang; Xinyan Guan; Boxi Cao; Yaojie Lu; Hongyu Lin; Xianpei Han; Le Sun

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Abstract

Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose Chain-of-Guardrail (CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.

Anthology ID:: 2026.findings-acl.1118
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22274–22302
Language:
URL:: https://aclanthology.org/2026.findings-acl.1118/
DOI:
Bibkey:
Cite (ACL):: Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2026. When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 22274–22302, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models (Mao et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1118.pdf
Checklist:: 2026.findings-acl.1118.checklist.pdf

PDF Cite Search Checklist Fix data