SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Maithili Joshi; Palash Nandi; Tanmoy Chakraborty

doi:10.18653/v1/2025.emnlp-main.825

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Maithili Joshi, Palash Nandi, Tanmoy Chakraborty

Abstract

Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capability. Typically LLMs undergo careful alignment training involving human feedback to ensure the acceptance of safe inputs while rejection of harmful or unsafe ones. However, these humongous models are still vulnerable to jailbreak attacks, in which malicious users attempt to generate harmful outputs that safety-aligned LLMs are trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly prevalent in the middle-to-late layers. Based on this observation, we introduce a novel white-box jailbreak method SABER (Safety Alignment Bypass via Extra Residuals) that connects two intermediate layer s and e such that s<e with a residual connection, achieving an improvement of 51% over the best performing baseline GCG on HarmBench test set. Moreover, model demonstrates only a marginal shift in perplexity when evaluated on the validation set of HarmBench.

Anthology ID:: 2025.emnlp-main.825
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16299–16314
Language:
URL:: https://aclanthology.org/2025.emnlp-main.825/
DOI:: 10.18653/v1/2025.emnlp-main.825
Bibkey:
Cite (ACL):: Maithili Joshi, Palash Nandi, and Tanmoy Chakraborty. 2025. SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16299–16314, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection (Joshi et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.825.pdf
Checklist:: 2025.emnlp-main.825.checklist.pdf

PDF Cite Search Checklist Fix data