Vahid Behzadan


2024

pdf bib
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Bibek Upadhayay | Vahid Behzadan
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

A significant challenge in reliable deployment of Large Language Models (LLMs) is malicious manipulation via adversarial prompting techniques such as jailbreaks. Employing mechanisms such as safety training have proven useful in addressing this challenge. However, in multilingual LLMs, adversaries can exploit the imbalanced representation of low-resource languages in datasets used for pretraining and safety training. In this paper, we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. Our experiments with five different models, namely Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to elicit harmful responses from these models. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse. Content Warning: This paper contains examples of harmful language.