Knowledge Unlearning for Mitigating Privacy Risks in Language Models

Pretrained Language Models (LMs) memorize a vast amount of knowledge during initial pretraining, including information that may violate the privacy of personal lives and identities. Previous work addressing privacy issues for LMs has mostly focused on data preprocessing and differential privacy methods, both requiring re-training the underlying LM. We propose knowledge unlearning as an alternative method to reduce privacy risks for LMs post hoc. We show that simply performing gradient ascent on target token sequences is effective at forgetting them with little to no degradation of general language modeling performances for larger-sized LMs. We also find that sequential unlearning is better than trying to unlearn all the data at once and that unlearning is highly dependent on which kind of data (domain) is forgotten. By showing comparisons with previous methods known to mitigate privacy risks for LMs, we show that our approach can give a stronger empirical privacy guarantee in scenarios where the data vulnerable to extraction attacks are known a priori while being much more efficient and robust.


INTRODUCTION
Recent work has shown that an adversary can extract training data from Pretrained Language Models (LMs) including Personally Identifiable Information (PII) such as names, phone numbers, and email addresses, and other information such as licensed code, private clinical notes, and 128-bit UUIDs (Carlini et al., 2021;Lee et al., 2022;Huang et al., 2022;Lehman et al., 2021). In 2021, an AI chatbot Iruda became the first AI system to be sued for violating the Personal Information Protection Act after generating the exact home addresses and bank account numbers of actual individuals unintentionally (Park, 2021). Heikkilä (2022) has also shown that GPT-3 (Brown et al., 2020), one of the most well known LM currently in commercial use, offered detailed private information about the Editor-in-Chief of MIT Technology Review including his family members, work address, and phone number. Considering findings that show extracting training data gets easier as LMs scale to larger sizes (Carlini et al., 2022) and that it is common practice for practitioners to release billion parameter pretrained LMs for public use (Gao et al., 2020;Black et al., 2021;Zhang et al., 2022), it has become important to provide privacy guarantees for large LMs.
Practitioners are required to delete personal information from the LMs by individuals' request because each individual has the "Right To Be Forgotten (RTBF)" (Mantelero, 2013;Graves et al., 2021) and can limit the direct and indirect commercial use of their personal information (Villaronga et al., 2018). Previous methods addressing privacy risks for language models attempt to remove all private information from the training data (data preprocessing) (Aura et al., 2006;Dernoncourt et al., 2017;Lison et al., 2021;Kandpal et al., 2022) or attempt to design algorithms that ensure differential privacy (DP) (Dwork, 2008;Dwork et al., 2006;Abadi et al., 2016;Anil et al., 2021;Li et al., 2022;Yu et al., 2022). Both approaches require retraining the underlying LM every time individuals want to practice their RTBF, which makess them inadequate for large LMs that are extremely costly to retrain. Furthermore, as pointed out by Brown et al. (2022), data preprocessing methods assume private information to be easily identifiable, specified, and removed and DP algorithms can only guarantee protection for information that has clear privacy borders, which make them inadequate in the real-world scenarios where the standard of privacy might differ by each individuals.
To this end, we propose knowledge unlearning (Figure 1) as an efficient solution that can be applied with just a few parameter updates instead of pretraining the underlying LM again. We perform experiments on GPT-Neo LMs (125M, 1.3B, 2.7B) (Black et al., 2021) and show that simply changing the gradient descent to the opposite direction during language modeling (which can also be seen as maximizing instead of minimizing the loss function) is effective at protecting target sequences from extraction attacks with little to no performance degradation on the initial LM capabilities measured via 9 common NLP benchmarks: Hellaswag (Zellers et al., 2019), Lambada (Paperno et al., 2016), Winogrande (Sakaguchi et al., 2021), COPA (Gordon et al., 2012), ARC-Easy (Clark et al., 2018), ARC-Challenge (Clark et al., 2018), Piqa (Bisk et al., 2020), MathQA (Amini et al., 2019), and Pub-medQA (Jin et al., 2019). For some cases, knowledge unlearning unexpectedly shows significant improvements in LM performance for some of the benchmarks.
We compare our approach with data deduplication method (Kandpal et al., 2022) which is known to mitigate privacy risks, and show the effectiveness of knowledge unlearning by providing a stronger privacy guarantee while being much more efficient. We also provide a general guideline that can be used to quantify the memorization and extraction likelihood of target token sequences and suggest when we can empirically consider them to have been "forgotten". Specifically, we introduce a novel metric that measures the extraction likelihood by varying the prefix length of the target token sequence and quantifying how much of the suffix is actually extracted from the LM.
Surprisingly, for knowledge unlearning, we find that it is easier to forget a chunk of instances sequentially rather than trying to forget them all at once. We provide further analysis and show that the difficulty of knowledge unlearning depends heavily on the target data being forgotten, especially the domain of the target data. We also provide empirical examples of performing extraction attacks and how exactly knowledge unlearning provides a privacy guarantee for the LM.
To summarize, our main contributions are fourfold: • We compare knowledge unlearning with a data preprocessing approach and show that our approach results in little to no performance degradation of general language modeling per-formance (sometimes resulting in improvement) while providing stronger empirical privacy guarantees in situations individuals practice their RTBF and being 3,500,000x more computationally efficient than the compared approach.
• We perform additional experiments to determine which factors contribute to the difficulty of knowledge unlearning and find that (1) trying to forget many samples at once results in substantial LM performance degradation which can be mitigated by sequentially forgetting chunks of data and that (2) the domain of the target data (Code, License, Wikipedia, etc.) plays a critical role in determining how hard they are to forget.
• We provide a novel metric and a general guideline for quantifying the privacy risks for LMs and determine when they should be considered to have "forgotten" a given target sequence.
• Knowledge unlearning surprisingly seems to make LMs stronger where the extreme cases bring +8.0% ( Prior work that tries to mitigate privacy risks for LMs can be divided mainly into data pre/postprocessing methods and differential privacy methods. (Data) Pre/Post-Processing Data preprocessing aims to sanitize the training data; it aims to get rid of all data that might violate any kind of privacy from the training data prior to training. These methods mostly utilize measures such as parsers and classification models that try to identify and predict patterns that constitute private information. This is effective at identifying well-formatted private information such as social security numbers or special forms of medical notes (Aura et al., 2006;Dernoncourt et al., 2017;Lison et al., 2021;Kandpal et al., 2022). However, as pointed out by Brown et al. (2022), considering that private information is mostly context-dependent and sometimes in a non-specific format, data preprocessing methods cannot fully claim that they provide privacy guarantees, especially guarantees that match each individual's standards. Methods that attempt to utilize post-processing methods such as applying censorship to the LM outputs still face the same limitations.
In this work, we compare our proposed method with a data preprocessing approach proposed by Kandpal et al. (2022) which shows that deduplicating the training corpora before pretraining helps pretrain LMs that show stronger robustness against extraction attacks than an LM pretrained under the same circumstances without deduplicating the pretraining corpora. However, we highlight that this approach, which may still be effective at mitigating the overall privacy risks, is not the most suitable approach when considering a realistic scenario of individuals requesting the removal of their information from the implicit parameters of the LMs.
Differential Privacy Differential Privacy (DP) aims to guarantee that the effect of an individual input on the output of a specific function is bounded (Dwork, 2008;Dwork et al., 2006). In the context of deep neural networks, DP, which needs to be applied during the training phase, aims to construct models that can ensure that the individual information within the training data cannot be inferred (Abadi et al., 2016). While DP has shown to be surprisingly effective at fine-tuning LMs (Li et al., 2022;Yu et al., 2022), pretraining LMs with DP still suffers from substantial performance gap, expensive computation, and slow convergence (Anil et al., 2021). Furthermore, as pointed out by Brown et al. (2022), DP can only provide limited guarantees for LMs because DP requires a unified definition for privacy boundaries, which is inherently impossible for natural language data. Furthermore, in a realistic scenario where individuals may practice their Right-To-Be-Forgotten (RTBF) dynamically after model deployment, it is extremely difficult to define a notion of privacy that matches the requirements of each individual beforehand, which is required for training an LM with DP.

MACHINE UNLEARNING
Machine unlearning has received attention as an alternative approach to overcome data privacy issues in machine learning (Cao & Yang, 2015;Ginart et al., 2019;Bourtoule et al., 2021;Graves et al., 2021). Several studies attempt to explore machine unlearning for deep neural networks (Golatkar et al., 2020;Mehta et al., 2022). However, they mostly focus on proposing algorithms for image classification models where they aim to forget a whole class; that is, achieve random performance for specific image classes such as "cats" or "ships". We are the first, to the best of our knowledge, to explore unlearning a specific sequence of tokens for LMs which is a quite different set-up from traditional image classification models (∼tens of image classes vs. a sequence of tokens that can each be classified into V ∈ R ∼50,000 ). In this work, we coin this approach as knowledge unlearning since we are more focused on forgetting specific knowledge represented by sequences of tokens.
Zhou et al. (2022) focus on how forgetting can be leveraged to improve the performance of the underlying model. They propose "forget-and-relearn" that unifies existing iterative training algorithms by selectively removing undesirable information and re-learning good features, helping boost performance for the task of image classification and multi-agent emergence communication. The underlying assumption is that it is often easier to define and stop unwanted behavior than to teach good behavior. We also show this phenomenon in Section 4 where we unintentionally find unlearning just a few sequences of tokens sometimes boosts general LM capabilities.

MEMORIZATION IN LANGUAGE MODELS
Previous work that explores to which extent LMs have memorized their training data approach the phenomenon with two different viewpoints. Some work view memorization of LMs simply as a threat to individual privacy (Carlini et al., 2021;Jagielski et al., 2022) and utilize metrics that quantify how much the LMs are susceptible to adversarial attacks. These metrics are mostly dependent on the specific types of attacks such as the membership inference attack (Shokri et al., 2017) and measure the privacy risks of LMs by quantifying the success rate of these attacks.
Another line of work simply quantifies how much knowledge is accumulated and forgotten during pretraining by extracting relational knowledge about the world (Petroni et al., 2019;Lazaridou et al., 2021;Jang et al., 2022b;a). This line of work does not view memorization as a negative trait, but as a positive one that can be leveraged to extract world knowledge from its implicit parameters and perform knowledge-intensive tasks such as question answering or training knowledgeable conversation agents.
Our work is highly related to Jagielski et al. (2022)'s work where they also assert that forgetting can be a relaxed version of differential privacy. However, there are two main differences between our work and theirs. First, they only analyze forgetting as a passive form of mitigating privacy, asserting that data seen early in large-scale training obtain privacy benefits, whereas we suggest a more active form of forgetting. Second, they only show analysis results with image classification and audio generation models while we specifically focus on large LMs.

METHODOLOGY
We propose simply negating the original training objective of minimizing the negative log-likelihood of the token sequences as our main method of knowledge unlearning in LMs. Specifically, given a sequence of tokens x = (x 1 , ..., x T ), our unlearning training objective is simply minimizing the following loss function: where x <t denotes the token sequence x = (x 1 , ..., x t−1 ) and p θ (x t |x <t ) denotes the conditional probability of predicting the next token to be x t when given x <t to an LM f with parameters θ.
Prior work refer to this training objective as unlikelihood training and combines it together with the original loss of minimizing the negative log-likelihood for the final objective of enhancing language modeling quality (Welleck et al., 2020) and few-shot learning for downstream NLP tasks (Tam et al., 2021). In contrast, we simply optimize the unlikelihood training objective since we are only concerned with forgetting. While this method seems simple, it is highly effective at forgetting specific token sequences without affecting the overall LM capabilities as shown in Section 4.

QUANTIFYING PRIVACY RISKS OF LANGUAGE MODELS
In this subsection, we introduce two metrics we use to quantify the privacy risks given a specific token sequence and how we empirically define the token sequence to be forgotten.
Extraction Likelihood (EL) We first introduce a new metric, EL. Given a sequence of tokens x = (x 1 , ..., x T ) and an LM f with pre-trained parameters θ, we define EL to be as follows: where n-grams() denotes the list of n-grams in the given token sequence and f θ (x <t ) denotes the output token sequences from the LM f θ when given x <t as input that can have max lengths |x ≥t | but may be shorter when the EOS (end-of-sequence) token is generated beforehand.
The process of varying the prefix length |x <t | can be seen as varying the strength of adversarial attacks. This is based on the assumption that the more prior information is provided about the target token sequence, the easier the LM will be able to extract it. Overall, EL can be seen as estimating the general extraction likelihood since we are measuring the average success rate of varying extraction attacks quantified via getting the n-gram overlap of generated and target token sequences. While previous metrics quantifying the privacy risks of LMs are dependent on specific adversarial attacks, this characteristic of EL allows it to quantify the general likelihood of extraction without any dependency on specific extraction attacks.
We regard n to be a hyper-parameter that can be varied depending on the stringency of privacy standards. The higher n is set, the stricter we set the standard for a successful extraction attack.
Memorization Accuracy (MA) We define Memorization Accuracy (MA) as follows: MA quantifies how much f θ has memorized the given token sequences and was proposed by Tirumala et al. (2022) to analyze the training dynamics of large LMs.
Empirical Definition of Forgetting By utilizing both EL n and MA, we empirically define a specific token sequence x to be forgotten and is no longer susceptible to extraction attacks when the following conditions are met: where D represents a validation corpora not seen during training. In other words, we define x to be forgotten when the EL n (x) and MA(x) reach a value that is lower than the average EL n and MA on token sequences that were not seen during training. benchmarks to measure the scientific reasoning abilities. We use the test set for Lambada and the validation set for the rest of the benchmarks. We also show the results of measuring the perplexity on the validation corpora of Pile and Wikitext in Appendix B. We do not include measuring perplexity as one of the main evaluations because perplexity might not be the most suitable metric for quantifying general LM performance, especially in the case of unlearning 2 .
Configurations For the learning rate, we set it to 5e-5. We show the effect of varying learning rates in Appendix D. We use a constant learning rate scheduling throughout the run. We fix the global batch size to be the same as s (how many samples are forgotten at once) because having global batch sizes smaller than s proved to degrade general LM capabilities 3 . For EL n , we set n=10 which means EL measures the extraction likelihood of extracting n consecutive tokens of varying extraction attack. For calculating EL 10 and MA, we use a naïve greedy decoding strategy. We set both the dropout and weight decay rates to 0. Lastly, while we provide a guideline of empirically deciding a single token sequence to be forgotten in Section 3.2, for considering a chunk of s token sequences to be forgotten, we use the average EL 10 and MA as an approximation of the individual EL 10 and MA.

MAIN EXPERIMENTS
Forgetting Threshold First, we show how we get the Forgetting Threshold for EL 10 and MA, the values where we consider the token sequence to be forgotten and unsusceptible from extraction attacks, for all model sizes of GPT-NEO LMs in Table 1. For D , we perform weighted sampling (same domain distribution as the Pile training corpora) of 10,000 instances each with token lengths 200 from the Pile validation corpora and measure the average EL 10 and MA (Equation 5), which are empirically set as the Forgetting Threshold values.  NEO denotes the initial GPT-NEO LM, NEO + UL represents performing unlearning on the initial NEO until it provides a stronger security guarantee for the target sequences than OPT, NEO + UL + represents performing unlearning on GPT-NEO until target sequences are forgotten, and OPT represents the LM with deduplication applied. Avg. denotes the average accuracy of the 9 evaluation benchmark datasets. Best comparable performances are bolded and second best underlined.  Table 2 shows the main results of performing unlearning on LMs of varying sizes.

Main Results
In the table, NEO denotes the GPT-NEO LMs. NEO + UL denotes applying unlearning on the GPT-NEO LM until it reaches lower EL 10 and MA than OPT. Lastly, NEO + UL + denotes performing unlearning on the GPT-NEO until the target token sequences are forgotten (EL 10 and MA value reach below the Forgetting Threshold). While we provide the average performances of the 5 random samplings in Table 2, we provide each individual runs in Appendix A for reference.
We highlight four main points regarding the results.
(1) OPT LMs show a much lower EL 10 and MA than GPT-NEO LMs, confirming that deduplicating the pretraining corpora is indeed helpful for mitigating privacy risks.
(2) NEO + UL + results in degradation of average LM performance for the 125M LM while retaining most of its previous capabilities for the 1.3B and 2.7B LMs. Interestingly, for some benchmarks such as Lambada and ARC-Challenge, it actually results in a boost of performance.
(3) While the LMs scale to larger sizes, it takes fewer epochs for the target sequences to be forgotten. Together with (2), this implies that larger LMs are strong unlearners. (4) While NEO + UL + provides a stronger privacy guarantee than OPT without sacrificing average LM performance from NEO, it is much more computationally efficient (3,500,000x) than re-training the underlying LM, which is required for all data preprocessing approaches 4 . This characteristic makes knowledge unlearning very advantageous in scenarios where people might practice their RTBF and thus dynamic forgetting of specific token sequences are required.
Overall, results show unlearning to be an effective approach to providing a stronger empirical privacy guarantee than existing LMs measured via EL 10 and MA while retaining and sometimes even improving general LM capabilities.
Sequential Unlearning is more Stable than Batch Unlearning. We show the effect of varying s (the # of data instances to be forgotten at once) in Figure 2a   approach as batch unlearning. As shown by the s = 128 results, it is harder to forget more samples at once, resulting in substantial degradation of average LM performance regardless of how large the LM is. Since s ≤ 32 does not show much degradation, we explore if sequentially unlearning can be a solution. In Figure 2b, we show the result of dividing the 128 samples into 4 chunks of 32 and performing sequential unlearning; we unlearn each chunk at a time until the chunk reaches the forgetting threshold. Surprisingly, as shown by the performance gap at s = 128 between the dotted lines (the s = 128 performance of Figure 2a) and straight lines, the end result is vastly different even though exactly the same instances were forgotten. Sequential unlearning shows almost no degradation of average LM performance. In Appendix G, we show that chunks once forgotten stay forgotten and that later chunks are forgotten much faster compared to the initial chunk. This result hints at the generalization of unlearning, which we do not further explore in the scope of this work. The result also suggests that knowledge unlearning can be continually applied to LMs when needed.

ANALYSIS OF KNOWLEDGE UNLEARNING
Providing Better Intuition of What Exactly Happens During Knowledge Unlearning.
To show exactly what happens to the LM during knowledge unlearning, we show how the performance of each of the LM benchmarks changes as we perform 10 runs of unlearning to the GPT-NEO (1.3B) model (each run with s = 1) in Figure 3. As shown in the figure, the LM performance for each benchmark varies tremendously on which sample is chosen to be forgotten. Furthermore, the ending time of each run is different, indicating that some samples are forgotten faster than others.
To provide a better intuition of exactly how knowledge unlearning guarantees privacy, we perform an extraction attack with a token sequence sample in Table 3 where we show the model-generated text from the extraction attack before and after applying knowledge unlearning. While the extraction attack is extremely successful at extracting the rest of the suffix before unlearning (100% of the token sequence), only a small portion (∼3% of the token sequence) of the suffix is extracted after applying unlearning.
Why Are Some Instances Harder to Forget? To measure why some instances are harder to forget, we perform 5 random samplings of s = 8 from 8 different domains from the Training Data Extraction Challenge 5 and perform unlearning on the GPT-NEO 1.3B LM. We also show the results of each individual run in Appendix A. As shown in

CLOSING
In this paper, we propose knowledge unlearning as a method for mitigating privacy risks in LMs that is not only orders of magnitude more efficient than previous methods, but also provides a stronger empirical privacy guarantee with little to no degradation of general LM capabilities measured by evaluating on 9 common LM benchmarks. As large LMs expand their use cases, potentially affecting the daily lives of people, the research community should make sure that the privacy of individuals is not violated intentionally or unintentionally by the knowledge stored in the implicit parameters

A FULL RESULTS
We provide all of the results for the 5 random samplings for our main experimental setting in Table  5 and the full results for the domain analysis setting in Table 6. Table 7 shows the results of measuring perplexity on 500 samples from the validation set of Pile and Wikitext corpora on the LMs from the main experimental setting (Table 2). Results show that LMs that underwent knowledge unlearning show higher perplexity while the main experimental table (Table 2) does not show degradation of performance on 9 different LM benchmarks. We believe the discrepancy to be due to the inherent attributes of performing unlearning: since we are doing gradient ascent, we are likely softening the probability to generate each token from the vocabulary, giving it a more uniform distribution that will inevitably result in a higher perplexity. However, since it does not show much degradations in the LM benchmarks, it also means that the argmax of the most likely token to be generated has not changed much. However, further exploration of what exactly knowledge unlearning does to the representations of the LM should be done in future work. : Varying the learning rate for unlearning the GPT-NEO 1.3B with s = 32. We report the average of 3 random samplings and display the standard deviations as the shaded regions. Red dotted lines denote the memorization accuracy forgetting threshold of the 1.3B model reported in Table 1

D VARYING THE LEARNING RATE
In Figure 4, we show the results of varying the learning rate for knowledge unlearning where we fix the total epoch to 10 and perform 3 random runs with s = 32 on the GPT-NEO 1.3B. Overall, we observe that higher learning rates lead to faster forgetting, but with substantial LM performance degradation. While lower learning rates retain the LM performance, they fail to meet the Forgetting Threshold within 10 epochs. Thus, we set the learning rate to 5e-5 for our experiments to get the best trade-off.  Table 1.

E TEXT EXAMPLE FROM EACH DOMAIN
We show an example token sequence from each of the 8 domains used for the analysis section in Table 9.

F MORE EXAMPLES OF PERFORMING EXTRACTION ATTACKS
In addition to the extraction attack example shown in the analysis section, we provide 3 additional examples to provide readers with more empirical examples of how knowledge unlearning ensures protection against extraction attacks in Table 10.

G ADDITIONAL RESULTS OF SEQUENTIAL KNOWLEDGE UNLEARNING
We show how the EL 10 of each individual chunks and the average LM performance change as we perform sequential unlearning in Figure 5. Results show that the chunks that are forgotten stay forgotten and that later chunks are forgotten much faster (one or two epochs) compared to the initial chunk. We hypothesize that this might be because of the similarity of the token sequences from the 15,000 examples from the Training Extraction Challenge Benchmark. Also, this result hints at the generalization of unlearning, which we do not further explore because of the scope of this work.

H LIMITATIONS
While we provide an empirical privacy guarantee through unlearning, our Forgetting Threshold is dependent on which data samples are chosen as D . Furthermore, varying the prefix length can be seen as a naïve way of varying the strength of the extraction attacks. In a real-world scenario, extraction attacks may be more complicated and may require other prevention methods. Lastly, we could not directly compare our approach with a Differential Privacy (DP) (Anil et al., 2021) approach because there are no open-sourced LMs pretrained with a DP algorithm. We do not replicated the pretrainig phase because of the heavy computational resources needed to pretrain an LM with DP. We also leave this comparison for future work.      ## Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: ## The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the The pharmaceutical formulations of the present invention, which may conveniently be presented in unit dosage form, may be prepared according to conventional techniques well known in the pharmaceutical industry. Such techniques include the step of bringing into association the active ingredients with the pharmaceutical carrier(s) or excipient(s). In general the formulations are prepared by uniformly and intimately bringing into association the active ingredients with liquid carriers or finely divided solid carriers or both, and then, if necessary, shaping the product. The compositions of the present invention may be formulated into any of many possible dosage forms such as, but not limited to, tablets, capsules, gel capsules, liquid syrups, soft gels, suppositories, and enemas.
PUBMED CENTRAL I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your articleś publication date.