Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Kent Chang, Mackenzie Cramer, Sandeep Soni, David Bamman


Abstract
In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.
Anthology ID:
2023.emnlp-main.453
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7312–7327
Language:
URL:
https://aclanthology.org/2023.emnlp-main.453
DOI:
10.18653/v1/2023.emnlp-main.453
Bibkey:
Cite (ACL):
Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. 2023. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327, Singapore. Association for Computational Linguistics.
Cite (Informal):
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 (Chang et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.453.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.453.mp4