SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang; Zhangchen Xu; Yuetai Li; Luyao Niu; Zhen Xiang; Bo Li; Bill Yuchen Lin; Radha Poovendran

doi:10.18653/v1/2025.findings-acl.1197

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran

Abstract

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 13 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.

Anthology ID:: 2025.findings-acl.1197
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23303–23320
Language:
URL:: https://aclanthology.org/2025.findings-acl.1197/
DOI:: 10.18653/v1/2025.findings-acl.1197
Bibkey:
Cite (ACL):: Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Poovendran. 2025. SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities. In Findings of the Association for Computational Linguistics: ACL 2025, pages 23303–23320, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities (Jiang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1197.pdf

PDF Cite Search Fix data