Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! Zhanhui Zhou author Jie Liu author Zhichen Dong author Jiaheng Liu author Chao Yang author Wanli Ouyang author Yu Qiao author 2024-08 text Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Lun-Wei Ku editor Andre Martins editor Vivek Srikumar editor Association for Computational Linguistics Bangkok, Thailand conference publication zhou-etal-2024-emulated 10.18653/v1/2024.acl-long.842 https://aclanthology.org/2024.acl-long.842/ 2024-08 15810 15830