Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Hongfu Liu, Yuxi Xie, Ye Wang, Michael Shieh


Abstract
Language Language Models (LLMs) face safety concerns due to potential misuse by malicious users. Recent red-teaming efforts have identified adversarial suffixes capable of jailbreaking LLMs using the gradient-based search algorithm Greedy Coordinate Gradient (GCG). However, GCG struggles with computational inefficiency, limiting further investigations regarding suffix transferability and scalability across models and data. In this work, we bridge the connection between search efficiency and suffix transferability. We propose a two-stage transfer learning framework, DeGCG, which decouples the search process into behavior-agnostic pre-searching and behavior-relevant post-searching. Specifically, we employ direct first target token optimization in pre-searching to facilitate the search process. We apply our approach to cross-model, cross-data, and self-transfer scenarios. Furthermore, we introduce an interleaved variant of our approach, i-DeGCG, which iteratively leverages self-transferability to accelerate the search process. Experiments on HarmBench demonstrate the efficiency of our approach across various models and domains. Notably, our i-DeGCG outperforms the baseline on Llama2-chat-7b with ASRs of 43.9 (+ 22.2) and 39.0 (+19.5) on valid and test sets, respectively. Further analysis on cross-model transfer indicates the pivotal role of first target token optimization in leveraging suffix transferability for efficient searching.
Anthology ID:
2024.emnlp-main.409
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7213–7224
Language:
URL:
https://aclanthology.org/2024.emnlp-main.409
DOI:
Bibkey:
Cite (ACL):
Hongfu Liu, Yuxi Xie, Ye Wang, and Michael Shieh. 2024. Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7213–7224, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models (Liu et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.409.pdf
Software:
 2024.emnlp-main.409.software.zip