Issa Sugiura
2024
A Comprehensive Analysis of Memorization in Large Language Models
Hirokazu Kiyomaru
|
Issa Sugiura
|
Daisuke Kawahara
|
Sadao Kurohashi
Proceedings of the 17th International Natural Language Generation Conference
This paper presents a comprehensive study that investigates memorization in large language models (LLMs) from multiple perspectives. Experiments are conducted with the Pythia and LLM-jp model suites, both of which offer LLMs with over 10B parameters and full access to their pre-training corpora. Our findings include: (1) memorization is more likely to occur with larger model sizes, longer prompt lengths, and frequent texts, which aligns with findings in previous studies; (2) memorization is less likely to occur for texts not trained during the latter stages of training, even if they frequently appear in the training corpus; (3) the standard methodology for judging memorization can yield false positives, and texts that are infrequent yet flagged as memorized typically result from causes other than true memorization.