Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning

Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning Adib Hasan author Ileana Rugina author Alex Wang author 2024-11 text Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP Yonatan Belinkov editor Najoung Kim editor Jaap Jumelet editor Hosein Mohebbi editor Aaron Mueller editor Hanjie Chen editor Association for Computational Linguistics Miami, Florida, US conference publication hasan-etal-2024-pruning 10.18653/v1/2024.blackboxnlp-1.26 https://aclanthology.org/2024.blackboxnlp-1.26/ 2024-11 417 430