Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs

Wei Zhao; Zhe Li; Yige Li; Jun Sun

doi:10.18653/v1/2025.findings-emnlp.767

Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs

Abstract

Large Vision-Language Models (LVLMs) have made significant strides in multimodal comprehension, thanks to extensive pre-training and fine-tuning on large-scale visual datasets. However, despite their robust textual safety mechanisms, they remain vulnerable to harmful visual inputs. Existing safeguards—typically relying on pre-filtering or fine-tuning—incur high costs and diminish overall utility. To address this critical vulnerability, we introduce SafeCLIP, a lightweight method that leverages LVLMs’ inherent multimodal alignment for zero-shot toxic image detection. By projecting CLIP’s discarded CLS token into its text space and matching it with toxic descriptors, SafeCLIP detects harmful content without any architectural changes—adding minimal latency and enabling dynamic safety corrections during inference and fine-tuning. Experiments show that SafeCLIP achieves a 66.9% defense success rate with only 3.2% false positive rate and 7.2% overhead. In contrast, state-of-the-art methods achieve 52.9% success but have a 10.7% false positive rate and 210% overhead. Our work demonstrates that leveraging inherent multimodal alignment can yield efficient, low-cost LVLM safety. Code is available at anonymous.4open.science/r/safeclip-2C01.

Anthology ID:: 2025.findings-emnlp.767
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14232–14246
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.767/
DOI:: 10.18653/v1/2025.findings-emnlp.767
Bibkey:
Cite (ACL):: Wei Zhao, Zhe Li, Yige Li, and Jun Sun. 2025. Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 14232–14246, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs (Zhao et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.767.pdf
Checklist:: 2025.findings-emnlp.767.checklist.pdf

PDF Cite Search Checklist Fix data