from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

Yu Yan; Sheng Sun; Zenghao Duan; Teli Liu; Min Liu; Zhiyi Yin; LeiJingyu LeiJingyu; Qi Li

doi:10.18653/v1/2025.acl-long.238

from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, LeiJingyu LeiJingyu, Qi Li

Abstract

Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms.In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking.Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed.Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content.Experimental results demonstrate that AVATAR can effectively and transferably jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

Anthology ID:: 2025.acl-long.238
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4785–4817
Language:
URL:: https://aclanthology.org/2025.acl-long.238/
DOI:: 10.18653/v1/2025.acl-long.238
Bibkey:
Cite (ACL):: Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, LeiJingyu LeiJingyu, and Qi Li. 2025. from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4785–4817, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors (Yan et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.238.pdf

PDF Cite Search Fix data