Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Yuanpu Cao, Bochuan Cao, Jinghui Chen


Abstract
Recent developments in Large Language Models (LLMs) have manifested significant advancements. To facilitate safeguards against malicious exploitation, a body of research has concentrated on aligning LLMs with human preferences and inhibiting their generation of inappropriate content. Unfortunately, such alignments are often vulnerable: fine-tuning with a minimal amount of harmful data can easily unalign the target LLM. While being effective, such fine-tuning-based unalignment approaches also have their own limitations: (1) non-stealthiness, after fine-tuning, safety audits or red-teaming can easily expose the potential weaknesses of the unaligned models, thereby precluding their release/use. (2) non-persistence, the unaligned LLMs can be easily repaired through re-alignment, i.e., fine-tuning again with aligned data points. In this work, we show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections. We also provide a novel understanding of the relationship between the backdoor persistence and the activation pattern and further provide guidelines for potential trigger design. Through extensive experiments, we demonstrate that our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.
Anthology ID:
2024.naacl-long.276
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4920–4935
Language:
URL:
https://aclanthology.org/2024.naacl-long.276
DOI:
10.18653/v1/2024.naacl-long.276
Bibkey:
Cite (ACL):
Yuanpu Cao, Bochuan Cao, and Jinghui Chen. 2024. Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4920–4935, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections (Cao et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.276.pdf