PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

Hao Li; Xiaogeng Liu; Ning Zhang; Chaowei Xiao

doi:10.18653/v1/2025.acl-long.1468

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

Hao Li, Xiaogeng Liu, Ning Zhang, Chaowei Xiao

Abstract

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. PIGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.4%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at https://github.com/leolee99/PIGuard.

Anthology ID:: 2025.acl-long.1468
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30420–30437
Language:
URL:: https://aclanthology.org/2025.acl-long.1468/
DOI:: 10.18653/v1/2025.acl-long.1468
Bibkey:
Cite (ACL):: Hao Li, Xiaogeng Liu, Ning Zhang, and Chaowei Xiao. 2025. PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30420–30437, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free (Li et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1468.pdf

PDF Cite Search Fix data