MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

Sahil Verma; Keegan Hines; Jeff Bilmes; Charlotte Siska; Luke Zettlemoyer; Hila Gonen; Chandan Singh

MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh

Abstract

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting, by 20.44% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient (≈ 120× faster than the next fastest baseline). Code and data are available at https://github.com/vsahil/OmniGuard

Anthology ID:: 2025.emnlp-main.819
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16184–16198
Language:
URL:: https://aclanthology.org/2025.emnlp-main.819/
DOI:
Bibkey:
Cite (ACL):: Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, and Chandan Singh. 2025. MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16184–16198, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities (Verma et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.819.pdf
Checklist:: 2025.emnlp-main.819.checklist.pdf

PDF Cite Search Checklist Fix data