Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen


Abstract
Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to verify its effectiveness in defending against alignment-breaking attacks. Through real-world experiments on open-source large language models, we demonstrate that RA-LLM can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100% to around 10% or less.
Anthology ID:
2024.acl-long.568
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10542–10560
Language:
URL:
https://aclanthology.org/2024.acl-long.568
DOI:
Bibkey:
Cite (ACL):
Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2024. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10542–10560, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM (Cao et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.568.pdf