BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models Yi Zeng author Weiyu Sun author Tran Huynh author Dawn Song author Bo Li author Ruoxi Jia author 2024-11 text Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Yaser Al-Onaizan editor Mohit Bansal editor Yun-Nung Chen editor Association for Computational Linguistics Miami, Florida, USA conference publication zeng-etal-2024-beear 10.18653/v1/2024.emnlp-main.732 https://aclanthology.org/2024.emnlp-main.732/ 2024-11 13189 13215