Nael Abu-Ghazaleh
2024
Can Textual Unlearning Solve Cross-Modality Safety Alignment?
Trishna Chakraborty
|
Erfan Shayegani
|
Zikui Cai
|
Nael Abu-Ghazaleh
|
M. Salman Asif
|
Yue Dong
|
Amit Roy-Chowdhury
|
Chengyu Song
Findings of the Association for Computational Linguistics: EMNLP 2024
Recent studies reveal that integrating new modalities into large language models (LLMs), such as vision-language models (VLMs), creates a new attack surface that bypasses existing safety training techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where all input modalities are ultimately fused into the language space, we explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our empirical evaluation across seven datasets demonstrates promising transferability — textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands.
Vulnerabilities of Large Language Models to Adversarial Attacks
Yu Fu
|
Erfan Shayegan
|
Md. Mamun Al Abdullah
|
Pedram Zaree
|
Nael Abu-Ghazaleh
|
Yue Dong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)
This tutorial serves as a comprehensive guide on the vulnerabilities of Large Language Models (LLMs) to adversarial attacks, an interdisciplinary field that blends perspectives from Natural Language Processing (NLP) and Cybersecurity. As LLMs become more complex and integrated into various systems, understanding their security attributes is crucial. However, current research indicates that even safety-aligned models are not impervious to adversarial attacks that can result in incorrect or harmful outputs. The tutorial first lays the foundation by explaining safety-aligned LLMs and concepts in cybersecurity. It then categorizes existing research based on different types of learning architectures and attack methods. We highlight the existing vulnerabilities of unimodal LLMs, multi-modal LLMs, and systems that integrate LLMs, focusing on adversarial attacks designed to exploit weaknesses and mislead AI systems. Finally, the tutorial delves into the potential causes of these vulnerabilities and discusses potential defense mechanisms.
Search
Co-authors
- Yue Dong 2
- Trishna Chakraborty 1
- Erfan Shayegani 1
- Zikui Cai 1
- M. Salman Asif 1
- show all...