ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings

Hao Wang, Hao Li, Minlie Huang, Lei Sha


Abstract
The safety defense methods of Large language models (LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters.To cope with this challenge, in this paper, we propose an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent and understandable text. This method greatly reduces the computational overhead during the attack process and helps to automatically generate multiple adversarial samples, which can be used as data to strengthen LLM’s security defense. Experimental evaluations were conducted on Llama2, Vicuna, and other prominent LLMs, employing harmful directives sourced from the Advbench dataset.The results indicate that our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate than existing techniques, while significantly enhancing the textual fluency of the prompts. In addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini.
Anthology ID:
2024.emnlp-main.157
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2697–2711
Language:
URL:
https://aclanthology.org/2024.emnlp-main.157
DOI:
Bibkey:
Cite (ACL):
Hao Wang, Hao Li, Minlie Huang, and Lei Sha. 2024. ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2697–2711, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings (Wang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.157.pdf