Attacks by Content: Automated Fact-checking is an AI Security Issue

Michael Sejr Schlichtkrull


Abstract
When AI agents retrieve and reason over external documents, adversaries can manipulate the data they receive to subvert their behaviour. Previous research has studied indirect prompt injection, where the attacker injects malicious instructions. We argue that injection of instructions is not necessary to manipulate agents – attackers could instead supply biased, misleading, or false information. We term this an *attack by content*. Existing defenses, which focus on detecting hidden commands, are ineffective against attacks by content. To defend themselves and their users, agents must critically evaluate retrieved information, corroborating claims with external evidence and evaluating source trustworthiness. We argue that this is analogous to an existing NLP task, automated fact-checking, which we propose to repurpose as a cognitive self-defense tool for agents.
Anthology ID:
2025.emnlp-main.431
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8561–8576
Language:
URL:
https://aclanthology.org/2025.emnlp-main.431/
DOI:
Bibkey:
Cite (ACL):
Michael Sejr Schlichtkrull. 2025. Attacks by Content: Automated Fact-checking is an AI Security Issue. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8561–8576, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Attacks by Content: Automated Fact-checking is an AI Security Issue (Schlichtkrull, EMNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.emnlp-main.431.pdf
Checklist:
 2025.emnlp-main.431.checklist.pdf