Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG

William Merrill; Noah A. Smith; Yanai Elazar

doi:10.18653/v1/2024.emnlp-main.800

Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG

William Merrill, Noah A. Smith, Yanai Elazar

Abstract

How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate n-grams from their training data, evaluating both (i) the probability LMs assign to complete training n-grams and (ii) n-novelty, the proportion of n-grams generated by an LM that did not appear in the training data (for arbitrarily large n). To enable arbitrary-length n-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for n > 4, LM-generated text is less novel than human-written text, though it is more novel for smaller n. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete n-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.

Anthology ID:: 2024.emnlp-main.800
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14459–14473
Language:
URL:: https://aclanthology.org/2024.emnlp-main.800/
DOI:: 10.18653/v1/2024.emnlp-main.800
Bibkey:
Cite (ACL):: William Merrill, Noah A. Smith, and Yanai Elazar. 2024. Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14459–14473, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG (Merrill et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.800.pdf

PDF Cite Search Fix data