Enhancing the Transferability of Jailbreak Attacks on Large Language Models via Exploiting Reparameterization Invariance

Ao Wang; Xinghao Yang; Yongshun Gong; Wei Liu; Bao-di Liu; Weifeng Liu

Enhancing the Transferability of Jailbreak Attacks on Large Language Models via Exploiting Reparameterization Invariance

Ao Wang, Xinghao Yang, Yongshun Gong, Wei Liu, Bao-di Liu, Weifeng Liu

Abstract

Jailbreak attacks serve as a pivotal technique for evaluating the safety alignment of Large language models. Current token-level attacks have shown remarkable efficacy on open-source models by leveraging gradient-based optimization. However, these attacks suffer from poor cross-model transferability, severely limiting their utility on proprietary ones. To address this limitation, we propose Reparameterization Invariance Gradient-based Jailbreak (RIGJ), a natural gradient based framework designed to improve cross-model transferability. Unlike prior token-level methods whose optimization paths are constrained by model-specific Euclidean geometry, RIGJ defines update directions according to differences in output distributions rather than parameter-space distances. Since language models are trained to capture similar dependency structures of natural language, their output distributions share common geometry across architectures, yielding intrinsically model-agnostic optimization trajectories and substantially stronger jailbreak transferability. Extensive experiments demonstrate superior performance, increasing the cross-model Attack Success Rate and Average Harmfulness Score by 14.9 and 1.23, respectively. Our code is provided https://github.com/nohuma/AISafety_transfer_jailbreak_RIGJ_2026.

Anthology ID:: 2026.acl-long.357
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7854–7865
Language:
URL:: https://aclanthology.org/2026.acl-long.357/
DOI:
Bibkey:
Cite (ACL):: Ao Wang, Xinghao Yang, Yongshun Gong, Wei Liu, Bao-di Liu, and Weifeng Liu. 2026. Enhancing the Transferability of Jailbreak Attacks on Large Language Models via Exploiting Reparameterization Invariance. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7854–7865, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Enhancing the Transferability of Jailbreak Attacks on Large Language Models via Exploiting Reparameterization Invariance (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.357.pdf
Checklist:: 2026.acl-long.357.checklist.pdf

PDF Cite Search Checklist Fix data