Making Harmful Behaviors Unlearnable for Large Language Models

Xin Zhou, Yi Lu, Ruotian Ma, Yujian Wei, Tao Gui, Qi Zhang, Xuanjing Huang


Abstract
Large language models (LLMs) have shown great potential to empower various domains and are often customized by fine-tuning for the requirements of different applications. However, the powerful learning ability of LLMs not only enables them to learn new tasks but also makes them vulnerable to learning undesired behaviors, such as harmfulness and hallucination, as the fine-tuning data often implicitly or explicitly contains such content. Can we fine-tune LLMs on harmful data without learning harmful behaviors? This paper proposes a controllable training framework to make undesired behaviors unlearnable during the fine-tuning process. Specifically, we introduce security vectors to control the model’s behavior and make it consistent with the undesired behavior. Security vectors are activated during fine-tuning, the consistent behavior makes the model believe that such behavior has already been learned and there is no need for further optimization, while inconsistent data can still be learned. After fine-tuning, security vectors are deactivated to restore the LLM’s normal behavior. Our experiments show that the security vectors can prevent LLM from learning harmful and hallucination behavior while preserving the ability to learn other information.
Anthology ID:
2024.findings-acl.611
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10258–10273
Language:
URL:
https://aclanthology.org/2024.findings-acl.611
DOI:
Bibkey:
Cite (ACL):
Xin Zhou, Yi Lu, Ruotian Ma, Yujian Wei, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Making Harmful Behaviors Unlearnable for Large Language Models. In Findings of the Association for Computational Linguistics ACL 2024, pages 10258–10273, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Making Harmful Behaviors Unlearnable for Large Language Models (Zhou et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.611.pdf