Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMs

Megh Thakkar; Quentin Fournier; Matthew Riemer; Pin-Yu Chen; Amal Zouaq; Payel Das; Sarath Chandar

doi:10.18653/v1/2025.acl-short.22

Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMs

Megh Thakkar, Quentin Fournier, Matthew Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, Sarath Chandar

Abstract

There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models are not either explicitly trained to be safe, or experience a loss in their safety abilities in the process, making them capable of generating harmful content. We observe that simple interpolation between the domain and alignment delta parameters leads to safer domain-specific models that preserve their utility. Building on this, we introduce MergeAlign, a simple, efficient, and effective model merging-based alignment method. We apply MergeAlign on Llama3 models that are experts in medicine and finance, obtaining substantial safety alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged, as well as the applicability of MergeAlign on more general code and math expert models using the Qwen-2.5 series of models. We hope our findings open new research avenues towards efficient development and deployment of safe expert LLMs.

Anthology ID:: 2025.acl-short.22
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 268–277
Language:
URL:: https://aclanthology.org/2025.acl-short.22/
DOI:: 10.18653/v1/2025.acl-short.22
Bibkey:
Cite (ACL):: Megh Thakkar, Quentin Fournier, Matthew Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, and Sarath Chandar. 2025. Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 268–277, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMs (Thakkar et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-short.22.pdf

PDF Cite Search Fix data