A Comprehensive Analysis of Memorization in Large Language Models

Hirokazu Kiyomaru, Issa Sugiura, Daisuke Kawahara, Sadao Kurohashi


Abstract
This paper presents a comprehensive study that investigates memorization in large language models (LLMs) from multiple perspectives. Experiments are conducted with the Pythia and LLM-jp model suites, both of which offer LLMs with over 10B parameters and full access to their pre-training corpora. Our findings include: (1) memorization is more likely to occur with larger model sizes, longer prompt lengths, and frequent texts, which aligns with findings in previous studies; (2) memorization is less likely to occur for texts not trained during the latter stages of training, even if they frequently appear in the training corpus; (3) the standard methodology for judging memorization can yield false positives, and texts that are infrequent yet flagged as memorized typically result from causes other than true memorization.
Anthology ID:
2024.inlg-main.45
Volume:
Proceedings of the 17th International Natural Language Generation Conference
Month:
September
Year:
2024
Address:
Tokyo, Japan
Editors:
Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
584–596
Language:
URL:
https://aclanthology.org/2024.inlg-main.45
DOI:
Bibkey:
Cite (ACL):
Hirokazu Kiyomaru, Issa Sugiura, Daisuke Kawahara, and Sadao Kurohashi. 2024. A Comprehensive Analysis of Memorization in Large Language Models. In Proceedings of the 17th International Natural Language Generation Conference, pages 584–596, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
A Comprehensive Analysis of Memorization in Large Language Models (Kiyomaru et al., INLG 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.inlg-main.45.pdf