A Comprehensive Analysis of Memorization in Large Language Models

Hirokazu Kiyomaru; Issa Sugiura; Daisuke Kawahara; Sadao Kurohashi

A Comprehensive Analysis of Memorization in Large Language Models

Hirokazu Kiyomaru, Issa Sugiura, Daisuke Kawahara, Sadao Kurohashi

Abstract

This paper presents a comprehensive study that investigates memorization in large language models (LLMs) from multiple perspectives. Experiments are conducted with the Pythia and LLM-jp model suites, both of which offer LLMs with over 10B parameters and full access to their pre-training corpora. Our findings include: (1) memorization is more likely to occur with larger model sizes, longer prompt lengths, and frequent texts, which aligns with findings in previous studies; (2) memorization is less likely to occur for texts not trained during the latter stages of training, even if they frequently appear in the training corpus; (3) the standard methodology for judging memorization can yield false positives, and texts that are infrequent yet flagged as memorized typically result from causes other than true memorization.

Anthology ID:: 2024.inlg-main.45
Volume:: Proceedings of the 17th International Natural Language Generation Conference
Month:: September
Year:: 2024
Address:: Tokyo, Japan
Editors:: Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 584–596
Language:
URL:: https://aclanthology.org/2024.inlg-main.45
DOI:
Bibkey:
Cite (ACL):: Hirokazu Kiyomaru, Issa Sugiura, Daisuke Kawahara, and Sadao Kurohashi. 2024. A Comprehensive Analysis of Memorization in Large Language Models. In Proceedings of the 17th International Natural Language Generation Conference, pages 584–596, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):: A Comprehensive Analysis of Memorization in Large Language Models (Kiyomaru et al., INLG 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.inlg-main.45.pdf

PDF Cite Search