Defending against Indirect Prompt Injection by Instruction Detection

Tongyu Wen; Chenglong Wang; Xiyuan Yang; Haoyu Tang; Yueqi Xie; Lingjuan Lyu; Zhicheng Dou (窦志成); Fangzhao Wu

Defending against Indirect Prompt Injection by Instruction Detection

Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, Fangzhao Wu

Abstract

The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that IPI attacks fundamentally rely on the presence of instructions embedded within external content, which can alter the behavioral states of LLMs. Can the effective detection of such state changes help us defend against IPI attacks? In this paper, we propose InstructDetector, a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks. Specifically, we demonstrate the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark. The code is publicly available at https://github.com/MYVAE/Instruction-detection.

Anthology ID:: 2025.findings-emnlp.1060
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19472–19487
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.1060/
DOI:
Bibkey:
Cite (ACL):: Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, and Fangzhao Wu. 2025. Defending against Indirect Prompt Injection by Instruction Detection. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 19472–19487, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Defending against Indirect Prompt Injection by Instruction Detection (Wen et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.1060.pdf
Checklist:: 2025.findings-emnlp.1060.checklist.pdf

PDF Cite Search Checklist Fix data