Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield

Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield Jinhwa Kim author Ali Derakhshan author Ian Harris author 2024-06 text Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024) Yi-Ling Chung editor Zeerak Talat editor Debora Nozza editor Flor Miriam Plaza-del-Arco editor Paul Röttger editor Aida Mostafazadeh Davani editor Agostina Calabrese editor Association for Computational Linguistics Mexico City, Mexico conference publication kim-etal-2024-robust 10.18653/v1/2024.woah-1.12 https://aclanthology.org/2024.woah-1.12/ 2024-06 159 170