Trishna Chakraborty
2024
Can Textual Unlearning Solve Cross-Modality Safety Alignment?
Trishna Chakraborty
|
Erfan Shayegani
|
Zikui Cai
|
Nael Abu-Ghazaleh
|
M. Salman Asif
|
Yue Dong
|
Amit Roy-Chowdhury
|
Chengyu Song
Findings of the Association for Computational Linguistics: EMNLP 2024
Recent studies reveal that integrating new modalities into large language models (LLMs), such as vision-language models (VLMs), creates a new attack surface that bypasses existing safety training techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where all input modalities are ultimately fused into the language space, we explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our empirical evaluation across seven datasets demonstrates promising transferability — textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands.
Search
Co-authors
- Erfan Shayegani 1
- Zikui Cai 1
- Nael Abu-Ghazaleh 1
- M. Salman Asif 1
- Yue Dong 1
- show all...