Nicholas Carlini


2023

pdf bib
Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy
Daphne Ippolito | Florian Tramer | Milad Nasr | Chiyuan Zhang | Matthew Jagielski | Katherine Lee | Christopher Choquette Choo | Nicholas Carlini
Proceedings of the 16th International Natural Language Generation Conference

Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works—and some recently deployed defenses—focus on “verbatim memorization”, defined as a model generation that exactly matches a substring from the training set. We argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization. Specifically, we design and implement an efficient defense that _perfectly_ prevents all verbatim memorization. And yet, we demonstrate that this “perfect” filter does not prevent the leakage of training data. Indeed, it is easily circumvented by plausible and minimally modified “style-transfer” prompts—and in some cases even the non-modified original prompts—to extract memorized information. We conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models.

pdf bib
Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System
Daphne Ippolito | Nicholas Carlini | Katherine Lee | Milad Nasr | Yun William Yu
Proceedings of the 16th International Natural Language Generation Conference

Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-_k_ or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model’s predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).

2022

pdf bib
Deduplicating Training Data Makes Language Models Better
Katherine Lee | Daphne Ippolito | Andrew Nystrom | Chiyuan Zhang | Douglas Eck | Chris Callison-Burch | Nicholas Carlini
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets—for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. Code for deduplication is released at https://github.com/google-research/deduplicate-text-datasets.