Has this Fact been Edited? Detecting Knowledge Edits in Language Models

Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer


Abstract
Knowledge editing methods (KEs) can update language models’ obsolete or inaccurate knowledge learned from pre-training. However, KEs can be used for malicious applications, e.g., inserting misinformation and toxic content. Knowing whether a generated output is based on edited knowledge or first-hand knowledge from pre-training can increase users’ trust in generative models and provide more transparency. Driven by this, we propose a novel task: detecting knowledge edits in language models. Given an edited model and a fact retrieved by a prompt from an edited model, the objective is to classify the knowledge as either unedited (based on the pre-training), or edited (based on subsequent editing). We instantiate the task with four KEs, two large language models (LLMs), and two datasets. Additionally, we propose using hidden state representations and probability distributions as features for the detection model. Our results reveal that using these features as inputs to a simple AdaBoost classifier establishes a strong baseline. This baseline classifier requires a small amount of training data and maintains its performance even in cross-domain settings. Our work lays the groundwork for addressing potential malicious model editing, which is a critical challenge associated with the strong generative capabilities of LLMs.
Anthology ID:
2025.naacl-long.492
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9768–9784
Language:
URL:
https://aclanthology.org/2025.naacl-long.492/
DOI:
Bibkey:
Cite (ACL):
Paul Youssef, Zhixue Zhao, Christin Seifert, and Jörg Schlötterer. 2025. Has this Fact been Edited? Detecting Knowledge Edits in Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 9768–9784, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Has this Fact been Edited? Detecting Knowledge Edits in Language Models (Youssef et al., NAACL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.naacl-long.492.pdf