Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Minseo Kwak; Jaehyung Kim

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Abstract

The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge.Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model’s top-1 prediction, as well as local correlations between adjacent tokens.In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model’s top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training.Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.

Anthology ID:: 2026.acl-long.1072
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23391–23405
Language:
URL:: https://aclanthology.org/2026.acl-long.1072/
DOI:
Bibkey:
Cite (ACL):: Minseo Kwak and Jaehyung Kim. 2026. Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23391–23405, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data (Kwak & Kim, ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1072.pdf
Checklist:: 2026.acl-long.1072.checklist.pdf

PDF Cite Search Checklist Fix data