Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Essa Jan, Nouar Aldahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, Yasir Zaki


Abstract
Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model’s overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.
Anthology ID:
2025.coling-main.606
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9025–9043
Language:
URL:
https://aclanthology.org/2025.coling-main.606/
DOI:
Bibkey:
Cite (ACL):
Essa Jan, Nouar Aldahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, and Yasir Zaki. 2025. Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9025–9043, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Multitask-Bench: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning (Jan et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.606.pdf