Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis

Wang Cai; Yilin Wen; Jinchang Hou; Du Su; Guoqiu Wang; Zhonghou Lv; Chenfu Bao; Yunfang Wu

Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis

Wang Cai, Yilin Wen, Jinchang Hou, Du Su, Guoqiu Wang, Zhonghou Lv, Chenfu Bao, Yunfang Wu

Abstract

Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of “high-conflict” heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.

Anthology ID:: 2026.acl-long.340
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7478–7496
Language:
URL:: https://aclanthology.org/2026.acl-long.340/
DOI:
Bibkey:
Cite (ACL):: Wang Cai, Yilin Wen, Jinchang Hou, Du Su, Guoqiu Wang, Zhonghou Lv, Chenfu Bao, and Yunfang Wu. 2026. Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7478–7496, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis (Cai et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.340.pdf
Checklist:: 2026.acl-long.340.checklist.pdf

PDF Cite Search Checklist Fix data