BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

Yi Zeng, Weiyu Sun, Tran Huynh, Dawn Song, Bo Li, Ruoxi Jia


Abstract
Safety backdoor attacks in large language models (LLMs) enable harmful behaviors to be stealthily triggered while evading detection during normal interactions. The high dimensionality of the trigger search space and the diverse range of potential malicious behaviors in LLMs make this a critical open problem. This paper presents BEEAR, a novel mitigation method based on a key insight: backdoor triggers induce a uniform drift in the model’s embedding space, irrespective of the trigger’s form or targeted behavior. Leveraging this observation, we introduce a bi-level optimization approach. The inner level identifies universal perturbations to the decoder’s embeddings that steer the model towards defender-defined unwanted behaviors; the outer level fine-tunes the model to reinforce safe behaviors against these perturbations. Our experiments demonstrate the effectiveness of this approach, reducing the success rate of safety backdoor attacks from over 95% to <1% for general harmful behaviors and from 47% to 0% for Sleeper Agents, without compromising the model’s helpfulness. Notably, our method relies only on defender-defined sets of safe and unwanted behaviors without any assumptions about the trigger location or attack mechanism. This work represents the first practical framework to counter safety backdoors in LLMs and provides a foundation for future advancements in AI safety and security.
Anthology ID:
2024.emnlp-main.732
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13189–13215
Language:
URL:
https://aclanthology.org/2024.emnlp-main.732
DOI:
Bibkey:
Cite (ACL):
Yi Zeng, Weiyu Sun, Tran Huynh, Dawn Song, Bo Li, and Ruoxi Jia. 2024. BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13189–13215, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models (Zeng et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.732.pdf