Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

San Kim; Jonghwi Kim; Yejin Jeon; Gary Lee

doi:10.18653/v1/2025.findings-acl.1263

Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

San Kim, Jonghwi Kim, Yejin Jeon, Gary Lee

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk; attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.

Anthology ID:: 2025.findings-acl.1263
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24597–24614
Language:
URL:: https://aclanthology.org/2025.findings-acl.1263/
DOI:: 10.18653/v1/2025.findings-acl.1263
Bibkey:
Cite (ACL):: San Kim, Jonghwi Kim, Yejin Jeon, and Gary Lee. 2025. Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24597–24614, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection (Kim et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1263.pdf

PDF Cite Search Fix data