Yi Zeng
2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Yi Zeng
|
Weiyu Sun
|
Tran Huynh
|
Dawn Song
|
Bo Li
|
Ruoxi Jia
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Safety backdoor attacks in large language models (LLMs) enable harmful behaviors to be stealthily triggered while evading detection during normal interactions. The high dimensionality of the trigger search space and the diverse range of potential malicious behaviors in LLMs make this a critical open problem. This paper presents BEEAR, a novel mitigation method based on a key insight: backdoor triggers induce a uniform drift in the model’s embedding space, irrespective of the trigger’s form or targeted behavior. Leveraging this observation, we introduce a bi-level optimization approach. The inner level identifies universal perturbations to the decoder’s embeddings that steer the model towards defender-defined unwanted behaviors; the outer level fine-tunes the model to reinforce safe behaviors against these perturbations. Our experiments demonstrate the effectiveness of this approach, reducing the success rate of safety backdoor attacks from over 95% to <1% for general harmful behaviors and from 47% to 0% for Sleeper Agents, without compromising the model’s helpfulness. Notably, our method relies only on defender-defined sets of safe and unwanted behaviors without any assumptions about the trigger location or attack mechanism. This work represents the first practical framework to counter safety backdoors in LLMs and provides a foundation for future advancements in AI safety and security.
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
|
Hongpeng Lin
|
Jingwen Zhang
|
Diyi Yang
|
Ruoxi Jia
|
Weiyan Shi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Most traditional AI safety research views models as machines and centers on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. Observing this, we shift the perspective, by treating LLMs as human-like communicators to examine the interplay between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak risk across all risk categories: PAP consistently achieves an attack success rate of over 92% on Llama-2-7b-Chat, GPT-3.5, and GPT-4 in 10 trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP, find a significant gap in existing defenses, and advocate for more fundamental solutions for AI safety.