Peijian Gu

2025

Improve Safety Training of Large Language Models with Safety-Critical Singular Vectors Localization
Peijian Gu | Quan Wang | Zhendong Mao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The rapid advancement of large language models (LLMs) has brought about increased concerns regarding their safety, especially as adversaries develop jailbreak techniques to bypass LLMs’ safety mechanism. Although recent work on safety training with modules such as low-rank adaptation (LoRA) to resist jailbreaks shows promise, these approaches can inadvertently degrade a model’s general utility. In this paper, we propose a novel plug-and-play method that mitigates the impact of safety training on model utility by explicitly locating and leveraging safety-critical singular vectors, which only contribute to safety, within the model’s parameter space. We quantify the safety-criticality of each singular vector as the difference of their importance for safety and utility measured by a corresponding low-rank projection. The top scored singular vectors are located as safety-critical and are used to initialize the LoRA modules within existing safety training methods in a plug-and-play manner, thereby constraining the training updates within safety-critical parameters. Additionally, we propose a dynamic rank number determination strategy to further reduce parameter overhead. Experiments on HarmBench with multiple jailbreak methods validate the effectiveness of our approach in safety training, while evaluations on several utility benchmarks demonstrate that our method successfully mitigates the adverse impact of safety training on model utility, enhancing the utility performance of the evaluated safety training baselines.

2023

pdf bib abs

Instance attribution (IA) aims to identify the training instances leading to the prediction of a test example, helping researchers understand the dataset better and optimize data processing. While many IA methods have been proposed recently, how to evaluate them still remains open. Previous evaluations of IA only focus on one or two dimensions and are not comprehensive. In this work, we introduce IAEval for IA methods, a systematic and comprehensive evaluation scheme covering four significant requirements: sufficiency, completeness, stability and plausibility. We elaborately design novel metrics to measure these requirements for the first time. Three representative IA methods are evaluated under IAEval on four natural language understanding datasets. Extensive experiments confirmed the effectiveness of IAEval and exhibited its ability to provide comprehensive comparison among IA methods. With IAEval, researchers can choose the most suitable IA methods for applications like model debugging.

Co-authors

Venues

ACL1
Findings1

Fix author