Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Hua Farn, Hsuan Su, Shachi H. Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee


Abstract
Fine-tuning large language models (LLMs) for downstream tasks often leads to catastrophic forgetting, notably degrading the safety of originally aligned models. While some existing methods attempt to restore safety by incorporating additional safety data, the quality of such data typically falls short of that used in the original alignment process. Moreover, these high-quality safety datasets are generally inaccessible, making it difficult to fully recover the model’s original safety. We ask: How can we preserve safety while improving downstream task performance without additional safety data? We show that simply merging the weights of pre- and post-fine-tuned models effectively mitigates safety degradation while enhancing performance. Experiments across different downstream tasks and models validate the method’s practicality and effectiveness.
Anthology ID:
2025.findings-emnlp.901
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16589–16602
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.901/
DOI:
Bibkey:
Cite (ACL):
Hua Farn, Hsuan Su, Shachi H. Kumar, Saurav Sahay, Shang-Tse Chen, and Hung-yi Lee. 2025. Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 16589–16602, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging (Farn et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.901.pdf
Checklist:
 2025.findings-emnlp.901.checklist.pdf