GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation

Govind Ramesh; Yao Dou; Wei Xu

doi:10.18653/v1/2024.emnlp-main.1235

GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation

Abstract

Research on jailbreaking has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which is crucial for ensuring that even well-aligned LLMs obey adversarial instructions. IRIS then rates and enhances the output given the refined prompt to increase its harmfulness. We find that IRIS achieves jailbreak success rates of 98% on GPT-4, 92% on GPT-4 Turbo, and 94% on Llama-3.1-70B in under 7 queries. It significantly outperforms prior approaches in automatic, black-box, and interpretable jailbreaking, while requiring substantially fewer queries, thereby establishing a new standard for interpretable jailbreaking methods.

Anthology ID:: 2024.emnlp-main.1235
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22139–22148
Language:
URL:: https://aclanthology.org/2024.emnlp-main.1235/
DOI:: 10.18653/v1/2024.emnlp-main.1235
Bibkey:
Cite (ACL):: Govind Ramesh, Yao Dou, and Wei Xu. 2024. GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22139–22148, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (Ramesh et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.1235.pdf

PDF Cite Search Fix data