SecDecoding: Steerable Decoding for Safer LLM Generation

Jiayou Wang, Rundong Liu, Yue Hu, Huijia Wu, Zhaofeng He


Abstract
Large language models (LLMs) have achieved remarkable performance across diverse tasks, yet ensuring output safety remains a fundamental challenge. Existing defense methods often suffer from limited generalization, high computational overhead, or significant utility degradation. In this work, we present SecDecoding, a lightweight decoding-time defense framework that significantly improves output safety without compromising model helpfulness. SecDecoding leverages a pair of small contrastive models, namely a base model and a safety fine-tuned expert, to estimate token-level safety signals by measuring divergence in their output distributions. These signals dynamically steer the target model’s generation toward safer trajectories, effectively suppressing unsafe content. Experimental results show that SecDecoding achieves near-zero attack success rates against a wide spectrum of advanced jailbreak attacks across multiple LLMs, while maintaining the model’s helpfulness with minimal degradation. Additionally, SecDecoding is a modular and resource-efficient approach that requires only an auxiliary 1-billion-parameter model and is compatible with speculative decoding, offering up to 1.5× inference speedup.
Anthology ID:
2025.findings-emnlp.1118
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20504–20521
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.1118/
DOI:
Bibkey:
Cite (ACL):
Jiayou Wang, Rundong Liu, Yue Hu, Huijia Wu, and Zhaofeng He. 2025. SecDecoding: Steerable Decoding for Safer LLM Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20504–20521, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
SecDecoding: Steerable Decoding for Safer LLM Generation (Wang et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.1118.pdf
Checklist:
 2025.findings-emnlp.1118.checklist.pdf