CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Xiaohu Li; Yunfeng Ning; Zepeng Bao; Mayi Xu; Jianhao Chen; Tieyun Qian

doi:10.18653/v1/2025.findings-acl.346

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian

Abstract

Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security.Warning: This paper contains some harmful text examples.

Anthology ID:: 2025.findings-acl.346
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6664–6678
Language:
URL:: https://aclanthology.org/2025.findings-acl.346/
DOI:: 10.18653/v1/2025.findings-acl.346
Bibkey:
Cite (ACL):: Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, and Tieyun Qian. 2025. CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6664–6678, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations (Li et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.346.pdf

PDF Cite Search Fix data