A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs

Shaona Ghosh; Amrita Bhattacharjee; Yftah Ziser; Christopher Parisien

doi:10.18653/v1/2025.emnlp-main.1781

A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs

Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, Christopher Parisien

Abstract

Fine-tuning large language models (LLMs) to meet evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, but its potential for precise, customizable safety adjustments remains underexplored. We propose SafeSteer, a simple and effective method to guide LLM outputs by (i) leveraging category-specific steering vectors for fine-grained control, (ii) applying a gradient-free, unsupervised approach that enhances safety while preserving text quality and topic relevance without forcing explicit refusals, and (iii) eliminating the need for contrastive safe data. Across multiple LLMs, datasets, and risk categories, SafeSteer provides precise control, avoids blanket refusals, and directs models to generate safe, relevant content, aligning with recent findings that simple activation-steering techniques often outperform more complex alternatives.

Anthology ID:: 2025.emnlp-main.1781
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35128–35148
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1781/
DOI:: 10.18653/v1/2025.emnlp-main.1781
Bibkey:
Cite (ACL):: Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, and Christopher Parisien. 2025. A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35128–35148, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs (Ghosh et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1781.pdf
Checklist:: 2025.emnlp-main.1781.checklist.pdf

PDF Cite Search Checklist Fix data