Just Fine-tune Twice: Selective Differential Privacy for Large Language Models

Protecting large language models from privacy leakage is becoming increasingly crucial with their wide adoption in real-world products. Yet applying *differential privacy* (DP), a canonical notion with provable privacy guarantees for machine learning models, to those models remains challenging due to the trade-off between model utility and privacy loss. Utilizing the fact that sensitive information in language data tends to be sparse, Shi et al. (2021) formalized a DP notion extension called *Selective Differential Privacy* (SDP) to protect only the sensitive tokens defined by a policy function. However, their algorithm only works for RNN-based models. In this paper, we develop a novel framework, *Just Fine-tune Twice* (JFT), that achieves SDP for state-of-the-art large transformer-based models. Our method is easy to implement: it first fine-tunes the model with *redacted* in-domain data, and then fine-tunes it again with the *original* in-domain data using a private training mechanism. Furthermore, we study the scenario of imperfect implementation of policy functions that misses sensitive tokens and develop systematic methods to handle it. Experiments show that our method achieves strong utility compared to previous baselines. We also analyze the SDP privacy guarantee empirically with the canary insertion attack.


Introduction
With the rapid advancement in natural language processing (NLP), it has become increasingly important to protect NLP models from leaking privacy information.Previous work has attempted to tackle this challenge by applying differential privacy (DP, Dwork et al., 2014) on these models (McMahan et al., 2018;Li et al., 2021).However, existing DP learning algorithms suffer from limited user control and low utility, as they protect the entirety 1 Our code and data are available at https://github.com/wyshi/sdp_transformers of each training example (e.g., one complete sentence) regardless of users' privacy preference, and therefore tend to be overly pessimistic when only partial information in a training example is sensitive.This problem is particularly pertinent in NLP, as NLP training data are often mixed with sparse domain-dependent private information, and not all tokens need to be protected.For example, for the sentence "My SSN is 123-45-6789", only the last few tokens of the actual SSN need to be protected.
In fact, the definition of DP does not prevent us at all from protecting only the sensitive part of data.Specifically, DP ensures that the output of a data analysis algorithm stays roughly the same for neighboring datasets, while providing the flexibility to adjust the definition of neighboring relation to specific application contexts.Shi et al. (2021) recently proposed an instantiation of DP, called Selective-DP (SDP), which defines neighboring datasets to differ only in the sensitive part of a training example and as a result, SDP selectively hides the difference in the sensitive part only.SDP is particularly suitable for NLP and many other unstructured, high-dimensional data, wherein sensitive information only accounts for a small part.But their privacy mechanism to achieve SDP suffers from three problems: 1) it requires substantial knowledge about the model to separate the private and public variables, and it is unclear how their algorithm tailored to recurrent neural networks could be extended to modern Transformer-based NLP models; 2) it has only been evaluated with explicit private entities but not with contextual sensitive information; 3) it doesn't provide protection for undetected sensitive tokens; These constraints limit the applicability of SDP in real-world scenarios.
Large language models (LLMs) (Vaswani et al., 2017) have achieved tremendous success in NLP.They are pretrained on a massive amount of public textual data, and thus excel at capturing general language structures.A common practice in NLP is to fine-tune these LLMs on downstream tasks.Such a fine-tuning process also works well in the private training context.Previously, Yu et al. (2021a) showed that privately fine-tuning an additional small set of parameters on top of off-theshelf LLMs with private data achieves comparable performance to non-private baselines.Inspired by their findings, in this paper, we propose a two-phase fine-tuning privacy mechanism, Just fine-tune twice (JFT), to achieve SDP for LLMs.Instead of directly using off-the-shelf models to fine-tune once, we have two fine-tuning steps: 1) we first redact the in-domain data of the downstream tasks, and fine-tune the model with these in-domain redacted data (redacted-fine-tune), and 2) then privately finetune the model on the original private data (privatefine-tune).This additional redacted-fine-tune step allows the model to directly learn information from the in-domain data and thus leads to a better model initialization for the second private-fine-tune step.Moreover, in the redacted-fine-tune step, we show that even with limited public data (where manual screening is possible), JFT achieves better utility than fine-tune-once baselines.Additionally, we can apply lightly-noised optimizers and privacy amplification to protect undetected sensitive tokens.
Our contributions are as follows.First, we propose an effective and generalizable privacy mechanism to achieve SDP for large language models for various NLP tasks.Second, we design secret detectors of different privacy levels (explicit and contextual sensitive data) and study their implications on the models.Third, our method can utilize even a small amount of public data to achieve better utility, and mitigate the missed sensitive token problem with lightly-noised optimizer and privacy amplification.Finally, we show that, opposite to the common belief that privacy is at odds with utility, private learning doesn't have to conflict with the utility because private information in the data could be irrelevant to the learning task.

Preliminary
A differentially private algorithm hides the difference between two neighboring datasets.
Definition 1 (Differential Privacy).Given a domain D, any two neighboring datasets D, D ′ ⊆ D a randomized algorithm M : D → R is (ϵ w , δ w )differential private if for all neighboring datasets D and D ′ and all T ⊆ R, The neighboring relation captures what is protected.Traditional DP literature has considered neighboring datasets as those differing in one training example; thus, the corresponding DP protects each training example as a whole.We denote by ϵ w and δ w the privacy parameters achieved under this traditional neighboring relation definition.Given the sparsity of sensitive information in language data, this instantiation of neighboring relations is apparently over-pessimistic.Shi et al. (2021) proposed Selective-DP (SDP), which instantiates the neighboring datasets to be those that differ in the sensitive attributes of a training sample; as a result, SDP selectively hides the difference in the sensitive part only.In the context of NLP, a training example could be a sentence or a paragraph depending on the task and the attributes are individual tokens.
In this paper, we will focus on designing learning algorithms to achieve SDP.Formally, SDP relies on a policy function F that specifies the sensitive information in a training example to be protected in an application-dependent fashion.
Detecting private information manually in a large corpus based on the policy function is often costly.In that case, one may resort to building automatic secret detectors to identify the sensitive attributes.A simple example of a secret detector is a regular expression to capture phone numbers.However, secret detectors could miss some private attributes and produce false negatives, which intuitively would weaken the privacy guarantees.Existing work (Doudalis et al., 2017;Shi et al., 2021;Zhao et al., 2022) that selectively protects data either assumes a perfect detector or uses an overly conservative detector with a low false negative but at the cost of a high false positive.In this paper, we provide alternative ways to address this issue with a better privacy-utility tradeoff (Section 3).
With F , SDP defines F -Neighbors.
Given the definition, the dataset with "My ID is 123" and the dataset with "My ID is 456" are Figure 1: The two-phase JFT mechanism.As pre-processing, we apply the secret detector to redact the private data D and obtain the redacted data D ′ .Next, depending on the detector's performance, we use different ways to fine-tune the language model on the redacted D ′ and obtain a redacted model.Then we fine-tune the model again on the private data D with a private optimizer (e.g., DPSGD) to achieve an SDP-protected model.
F -Neighbors (because except for the actual ID number, the other tokens are the same), while the datasets with "Hi there" and the dataset with "Hello there" are not F -neighbors because the only token they differ in is not sensitive.An SDP algorithm guarantees that F -neighbors cannot be distinguished by attackers if they observed the output.In this paper, we differentiate between SDP and DP for two reasons.Firstly, we want to highlight that the privacy parameters associated with SDP (ϵ s and δ s ) and DP (ϵ w and δ w ) are incomparable.For instance, one cannot claim which of (1, 0.001)-SDP and (2, 0.001)-DP provides stronger privacy guarantees because they are under different privacy notions.To meaningfully present the value of these privacy parameters, we need to specify under which definition these parameters are calculated; to meaningfully compare the value of these parameters, we need to make sure that the parameters are calculated under the same privacy definition.Secondly, we would like to remain the same terminology as our main reference, Shi et al. (2021), which also uses the terms SDP and DP to refer to the privacy guarantees under the two different neighboring relations.In the rest of the paper, we use different notations to distinguish the privacy parameters associated with SDP (ϵ s and δ s ) and DP (ϵ w and δ w ).

JFT: Just Fine-tune Twice
Now we describe JFT, a two-phase privacy mechanism to achieve SDP for large language models (Figure 1).In the first redacted-fine-tune phase, we redact the private data D with a secret detector to obtain the redacted version D ′ , and learn a redacted model from D ′ in a privacy-preserving way.In the second private-fine-tune phase, we further fine-tune the redacted model (from phase one) on the private data D with a private optimizer to achieve SDP guarantees.

Phase 1: Redacted-fine-tune
JFT is built upon the observation that the public portion of the in-domain data does not require protection and can be utilized in various ways to help the model learn in-domain information.In this phase, we apply the secret detector to redact the private data D and obtain redacted in-domain data D ′ .Dependent on the detector performance, we propose the three following methods to use D ′ to fine-tune off-the-shelf language models.Direct Usage.If the secret detector masks all the sensitive information in D (which is possible when D is small enough to support thorough inspection or when a detector is very conservative and removes most of the essential information, see examples in Table 1), we can use the redacted D ′ directly to fine-tune the model with a public and unnoised optimizer like SGD.Selective Manual Screening.If the secret detector is imperfect, we can select an affordable subset from D ′ and then manually sanitize all the missed secrets.Then we fine-tune the model with this small sanitized subset with a public optimizer.Experiments show that even with a small amount of sanitized in-domain data, the resulting model still outperforms the traditional DP learning algorithms that pessimistically protect every single token.
Lightly-Noised Fine-tuning.When the detector is imperfect, besides manually screening out the missed secrets, we could also employ a private optimizer to train on D ′ that contains missed sensitive tokens.Because missed tokens only account for a small portion of D ′ , intuitively, a much smaller noise is needed to ensure the privacy of the missed tokens than the noise magnitude required to ensure the privacy of the entire D ′ .We propose to leverage privacy amplification by subsampling (PAS) (Balle et al., 2018) to calculate the privacy parameters associated with the private optimizer.The intuition of PAS is that if we perform a DP mechanism on random subsamples of the data, and one data point is not included in the subsamples, nothing about it could be leaked.In this way, we could amplify the privacy guarantee.In our scenario, we need to protect the missed sensitive tokens.If we know the secret detector's missing rate m (i.e., m=number of missed sensitive tokens/total tokens, the probability of sampling a missed secret), we can calculate the privacy budget ϵ s by privacy amplification using the subsampling ratio m.
Note that the application of PAS requires the number of missed tokens that appear in any batch to be the same, which does not necessarily hold.Hence, the privacy parameters calculated from privacy amplification are an empirical estimate of the actual privacy loss.In practice, the secret detector's missing rate is unknown and we need to estimate it (denoted as m).Then we change the original sampling rate p 0 in moment accounting-based privacy parameter calculation (Abadi et al., 2016) to p = p 0 * m and calculate the noise injected into each private optimizer iteration according to a predefined privacy budget ϵ under p.
In our experiments, we sample 0.01% training data for 10 times, and estimate the 95% confidence interval of the missing rate, [ m low , m high ].For both m low and m high , we can calculate an associated ϵ low and ϵ high according to Theorem 9 in Balle et al. (2018), and report both ϵ.
In the second phase, we initialize the model with the redacted model from phase one, and fine-tune it with the original private D and a private optimizer (e.g., DPSGD (Abadi et al., 2016) or any other more advanced private optimizer that achieves DP).
Unlike the privacy mechanism in Shi et al. (2021), our algorithm does not require knowledge about the models or the tasks, and therefore can be easily applied to different models such as GPT2 (Radford et al., 2019) and Roberta (Liu et al., 2019), and different tasks such as language generation and natural language understanding.See Section A.2 for more implementation details.One-phase vs two-phase.Compared to conventional differentially private training, our algorithm introduces an additional stage that involves redaction and regular unnoised training on redacted data.In fact, the computational cost of the additional stage is much lower than the cost originally incurred by DP learning, because the additional first phase does not need costly per-sample gradient clipping and noising operations.And redaction is a common first-step people are already doing and familiar with.Also, there exist abundant off-the-shelf tools that allow redaction at scale.

Privacy Analysis
We provide Theorem 1 for privacy analysis.It ensures that, if the user has a secret detector with 100% recall, JFT-trained models achieve (0,0)-SDP after phase one and (ϵ, δ)-SDP after phase two.A secret detector with 100% recall is possible if the user can afford manual inspection or have enough domain knowledge.When a detector with 100% recall is not possible, we use lightly-noised finetuning to empirically protect the missed secrets as mentioned in Section 3.1.
Theorem 1.Given that 1) in the first phase, the data used for fine-tuning do not contain sensitive tokens and a public optimizer is used, and 2) in the second phase, the private optimizer achieves (ϵ, δ)-DP, JFT achieves (ϵ, δ)-SDP.
The proofs are deferred to Section A.1.The theorem shows that under direct usage or selective screening of D ′ , JFT achieves SDP with the same privacy parameter values as the ones pertaining to the private optimizer used in the second phase.

Secret Detectors of Different Levels
Typical private information includes personalidentifiable information (PII) such as name and birthday.But as pointed out in Brown et al. (2022), one key challenge in NLP is that private information is often contextual.For example, they presented a dialogue between Alice and Bob about Alice's divorce (Table 1): none of the tokens in "What are you going to do about the custody of the kids?", are PII by themselves, but combined together, the semantics reveals private information.
To build generalizable secret detectors, we utilize off-the-shelf NER, dependency parser, and POS tagger in spaCy (Honnibal and Montani, 2017) to label each token, and redact different sets of tokens to achieve the different privacy levels below (entity level and contextual level).To qualitatively show their protection levels, we apply them to redact two sentences from the divorce dialogue in Brown et al. (2022).The results are in Table 1.Low entity redacts four types of named entities (person, organization, date, and location), which are considered PII defined by the US Department of Labor2 .We use NER in spaCy to detect them.If we apply this detector, "Did you hear Alice is getting divorced?"becomes "Did you hear <PERSON> is getting divorced?"An attacker who attacks a model trained on the latter sentence can at best learn about the divorce but cannot know who.High entity redacts all the 18 entities in spaCy including the four above and more, like time entity3 .
The two secret detectors above rely on named entities, so they are more explicit than the two de-tectors below, which consider the overall sentence structure and thus are more contextual.Low contextual protects all the 18 entities plus proper nouns, pronouns, and sentence subjects and objects.This detector drastically increases the privacy level: we cannot get any useful information from the left example in Table 1.High contextual further redacts all the verbs, in addition to the tokens redacted by the low contextual detector.It increases the privacy level even further and we cannot learn anything from both examples.This is to stress-test JFT and see the model utility when the majority of the tokens are redacted4 .
Human language is diverse, and private information can take various forms.So instead of designing sophisticated algorithms, we intentionally rely on common NLP tools to build easy-to-use domain-agnostic secret detectors with high recalls to protect privacy as much as possible.As shown in Table 1, these detectors tend to over-sanitize the sentences.But we will show later, even with over-redaction, JFT still achieve good performance.Private textual information can be treated more sophisticatedly, but how to better detect private information is not the focus of this paper.Our goal is to show that simply redacting tokens achieves promising performance and JFT is compatible with better private information detection algorithms to further improve the results.cient implementation of DPSGD in Li et al. (2021).Based on previous studies (Li et al., 2021;Yu et al., 2021a), larger DP models usually achieve better results and thus we expect that larger SDP models will achieve even better performances.Baselines.1) No-DP: the model is fine-tuned using regular Adam optimizer (Kingma and Ba, 2014) without extra noise and hence it does not have any privacy guarantees (i.e., ϵ w = ϵ s = ∞).
2) DPSGD5 : the model is fine-tuned with traditional DPSGD Abadi et al. (2016) where the gradient is clipped and noised in every gradient descent iteration (we employ the DP-Adam variant where the optimizer is Adam but its gradient privatization is the same as DPSGD, and we keep the term DPSGD as it is more accessible to the community).While DPSGD was originally proposed to achieve the DP guarantees that protect a training example as a whole, it can also achieve SDP guarantees with the same privacy parameters (i.e., ϵ s = ϵ w and δ s = δ w ). 3) CRT: the model is trained with the recently proposed Confidentially Redacted Training (Zhao et al., 2022) that achieves (ϵ c , δ c )-Confidentiality.Confidentiality is a new definition related to SDP but different from SDP, it ensures the indistinguishability between a secret and a <MASK> token, so its privacy parameters are not directly comparable to SDP.Thus, we add the same amount of noise to CRT and SDP, empirically compare SDP and CRT with the canary insertion attack in Figure 2 and 3, and report the utility in Table 6 in the Appendix.4) Redacted: We also present the utility of the redacted models since they are also privacy-preserving.Note when the secret detector is perfect, the redacted models have a perfect SDP privacy guarantee (i.e., ϵ s = 0).However, it does not allow the model to learn from sensitive tokens at all.JFT, by contrast, empowers the model to learn from sensitive data with a flexible, tunable tradeoff between privacy and utility.Moreover, it provides ways to offer quantifiable privacy in the presence of imperfect secret detectors.Our models.1) JFT: this is our JFT model directly using the redacted data in phase one.2) JFT +manual screening: this is JFT using a subset of the redacted data where missed secrets are manually filtered out in phase one.3) JFT +light noise: this is JFT where we add light noises according to the estimated missing rate in phase one.

Results
We show three major findings: 1) the impacts of secret detectors are task-dependent on the resulting JFT models, but even for conservative contextual detectors (30%+ tokens are redacted), JFT still achieves better results than naive DPSGD (Section 6.1); 2) despite the small scale, using the manual screened in-domain data still improves the JFT model utility (Section 6.2); 3) lightly noised optimizer with privacy amplification protects missed sensitive tokens from attacks (Section 6.3).
There is always a privacy-utility trade-off, so larger epsilons lead to better utilities but worse privacy.And when comparing models, we need to look at the model utility under a similar privacy budget.An epsilon of 1 to 3 is commonly used in various privacy literature (Yu et al., 2021a;Li et al., 2021;Zhao et al., 2022).In our experiments, we pre-calculated the privacy parameter so that an ϵ of around 3 is spent when training ends.

Secret Detectors of Different Levels
Table 2 show the results on GLUE (left) and generation (right).P ct is the percentage of sensitive tokens redacted by the detector.ϵ s is the SDP privacy budget, the lower the better.We compare model utility under a similar privacy budget ϵ s .Natural Language Understanding.Table 2 (left) shows that under a similar ϵ s , all the JFT models achieve better performance than the DPSGD baseline, even when over 40% of tokens are redacted.
Besides, for all the tasks, all the redacted models achieve reasonable utility, even when a large portion of the tokens are redacted.For example, the redacted model (high contextual) is better than DPSGD on MNLI (83.23 vs 82.10, 44.27% redacted) and SST-2 (91.17 vs 86.12, 38.13% redacted).This confirms the motivation of SDP that when building private NLP models, we should not naively protect all the tokens regardless of their properties.Instead, we should consider if the sensitive tokens will impact the task.If not, we can simply redact them to build private models.
Also, if JFT can improve the redacted model depends on the task.For SST-2 on sentiment analysis, the private-fine-tune step does not improve the redacted model.This is because, the redacted models achieve a high accuracy (even the worst accuracy is 91.86,only a 2.94 drop from the SOTA public model with an accuracy of 94.8), and finetuning them on the private data with noisy gradients is not enough to close the small gap.But for tasks with a bigger gap between the redacted and No-DP models (e.g., MNLI, QQP, and QNLI), JFT can further improve the redacted model.Besides, the gap between the redacted model and the corresponding JFT model becomes bigger as the privacy level increases: for QNLI (low contextual), the gap is 87.99-85.30=2.69,while for QNLI (high contextual), the gap is 87.06-82.81=4.25.This shows the model does learn useful information from the sensitive tokens during the private-fine-tune step.Language Generation.Table 2 (right) shows the language generation results.We note that language generation is different from NLU tasks, because for NLU, the models used for initialization without any fine-tuning ("No-fine-tune" in Table 2) start with a bad accuracy (≤ 50%, just random guess), and adding special tokens to it would still start with a random guess, so additional special tokens will not impact the final results greatly.But for the generation task, the "No-fine-tune" GPT2 is already a strong model for initialization with ppl=30.08 on Wikitext-2 and 13.60 on ABCD, and we found that adding special tokens would disturb this initialization and greatly impact the final result.Because all the JFT models have added special tokens like "<MASK>" and "SYS:", for a fair comparison, we report two DPSGD baselines, one without special tokens ("DPSGD") and one with special tokens ("DPSGD (+spe)").See Section A.2 for more discussions on the impact of special tokens.
Compared to "DPSGD (+spe)", all JFT models achieve better model utility on both datasets.For "DPSGD" without special tokens, fine-tuning on the downstream tasks improves the model from 30.08 to 27.05 (the improvement ∆=3.03); for JFT (low contextual), it is initialized with the redacted model with ppl=37.90, and privately fine-tuning it improves the perplexity to 25.62 (∆=12.28).This shows that although the initialization seems worse (30.08 vs 37.90), since the redacted model is finetuned directly on in-domain redacted data, it does learn useful information from the first redactedfine-tune step, and the second private-fine-tune step can further improve upon the redacted model.For JFT (high contextual) for the stress test, although 45% tokens are masked and the language structure is largely impacted, JFT still improves the redacted model utility from 54.29 to 27.19 (∆=27.10)and performs on par with DPSGD (27.05 vs 27.19).

Selective Manual Screening
As mentioned earlier, secret detectors can miss certain secrets and we can manually filter out the missed secrets at a small scale and fine-tune with the small manually sanitized set.Denote the original data as D 0 .We sample 0.1% data from D 0 , apply the high entity secret detectors (because it is less conservative and could miss secrets), and manually sanitize the missed secrets to get D ′ .We use D ′ = 0.1%D 0 during redacted-fine-tune to train  the redacted model, and the entire D 0 as D during private-fine-tune to obtain JFT models.Table 4 shows the results.D ′ =0.1%D 0 contains 100∼300 examples for GLUE and 10 articles and 10 dialogues for Wikitext-2 and ABCD respectively.On all the tasks, JFT achieves better utilities than DPSGD.This shows that even fine-tuning with a small manually-screened in-domain subset can still help the model learn in-domain information, and lead to better utility.We also simulate a completely low-resource setting where we simply have limited training data (i.e., D ′ =0.1% D 0 , D=0.1% D 0 ).See Section A.4 for the results.

Lightly Noised Optimizer with Privacy Amplification
Besides manually inspecting the missed secrets, we can also use noised optimizers in the first phase to protect the missed secrets from attacks and then adopt privacy amplification to estimate the corresponding privacy parameters.We again perform the experiments on the high entity detector and Table 3 shows the results.We convert the missing rate m (%) to the recall of the secret detector, i.e., recall=(1-m/P ct), where P ct is the percentage of sensitive tokens among all tokens."JFT+light noise" shows the model performance with a noised optimizer and privacy amplification employed in the first step.Besides, we add more noise than needed to obtain a conservative model ("JFT+light conservative noise"): for instance, MNLI data contains 8.63% sensitive tokens, although the secret detector's missing rate m ranges from (0.3%, 1.2%) with 95% probability, we assume m to be 8.63% (i.e., it miss all sensitive tokens and thus recall=0) to calculate and add the noises that are more than actually needed.

Attack Results
We perform the canary insertion attack (Carlini et al., 2019) to empirically show how much the models memorize the training data unintentionally.The attack is to insert a canary of a certain format into the training data, and calculate its exposure, which is the rank of the inserted canary amongst all possible values of the same format.The lower the exposure is, the safer the model is.In our experiments, we insert the canary "My ID is 341752" into the training data for 10 times to make the performance difference of different models more salient.By definition, for a six-digit canary, an exposure close to log 2 (10 6 ) ≈ 19.9 means the canary can be extracted by the attackers.The result is in Figure 2.Each point on the figure is a model checkpoint.The X-axis shows the perplexity (utility), and the y-axis is the exposure (privacy), so Figure 2 shows different models' privacy-utility tradeoffs.
One major reason why the model remembers a canary is that it has seen the canary many times.For "No-DP", initially, its exposure is low because it hasn't seen the canary many times.But because the model is unprotected, its exposure is unbounded and increases dramatically after it accesses the data for more epochs.This suggests that models without protection do memorize the data unintentionally.
For protected models (DPSGD, redacted, and JFT), if the canary is captured by the detector ("not missed" in the figure), then the exposure does not increase much even if the data are accessed many times.Under similar exposure, JFT achieves better utility than DPSGD and the redacted models.
But if the secret detector misses the canary (we purposely code it to mark the canary as public), the exposure is increased for both "Redacted (missed)" and "JFT (missed)".But if we add light noise in the first phase ("Redacted+light noise" in red), even if the canary is missed for 10 times, its exposure is still low.If we continue to privately fine-tune in the second phase ("JFT+light noise" in pink), we can further improve the utility but still achieve a low exposure value.Both DPSGD and CRT also achieve a low exposure value if the canary is missed by the secret detector, but with worse utilities than "JFT+light noise".This shows "JFT+light noise" can protect missed secrets while achieving better utility.We also performed the canary insertion attack with one canary inserted once, and 10 different canaries inserted once, shown in Section A.5.
We also tested the membership inference attack (MIA), but it wasn't successful (inference accuracy is around 60% even for public models).Previous studies also observed unsuccessful MIA (Shi et al., 2021;Zhao et al., 2022) and our future work includes developing better MIA for NLP.

Related Work
Recent work studied private language models on various model architectures such as RNNs (McMahan et al., 2018;Ramaswamy et al., 2020) and large language models (Anil et al., 2021;Li et al., 2021;Yu et al., 2021a).Li et al. (2021) proposed ghostclipping to reduce the computational cost in persample gradients of DPSGD, and achieved strong private LLMs.Yu et al. (2021a) added a small set of private parameters to off-the-shelf LLMs and privately tune them on private data and obtained performant private models.Most previous works achieve canonical DP.Shi et al. (2021) proposed Selective-DP for applications with sparse sensitive information like NLP, and a privacy mechanism for RNN models.Our work proposes an effective mechanism for LLMs to achieve SDP and study the impact of secret detectors at different levels.
Our work is also closely related to utilizing public data for private learning (Papernot et al., 2018;Tramer and Boneh, 2020;Ghazi et al., 2021;Yu et al., 2021a).One working direction assumes access to large unlabeled public data to train DP models.For example, PATE (Papernot et al., 2016) used unlabeled public data and knowledge distillation to build DP models.Hoory et al. (2021) used public medical data to pre-train domain-specific private vocabulary and models.Another direction leveraged small public data to guide the private updates in lower-dimension subspaces (Zhou et al., 2020;Yu et al., 2021b).Our work is distinct from previous studies: instead of querying public data from outside, we utilize the public portion of in-domain data and achieve SDP with better model utility.

Conclusions
In this paper, we propose JFT, which can achieve Selective-DP for large language models.We also design generalizable secret detectors to provide protection at different levels and study their impacts on the resulting SDP models, and address the problem of missed sensitive tokens via selective manual screening and private training with reduced noise, which is justified by privacy amplification.The results show that the proposed JFT produces SDP models with strong performance while remaining robust to the canary insertion attack.

Limitations
Parameter search in DP learning is challenging as the training process takes a long time and is quite sensitive to different parameters (Li et al., 2021).So the findings in the paper are based on the parameter tuning performed by the authors (Table 5), and more parameter tuning could potentially lead to better results than the results reported in the paper.
In our experiments, we simply fine-tuned the models on the in-domain redacted data without adjustment for the redaction.We could potentially utilize more sophisticated methods to train better redacted models to further improve the JFT utility.Besides, for the selective manual screening experiments, we did not adjust the redacted-fine-tune step for the low-resource setting where D ′ = 0.1%D 0 .Future work includes how to train a better redacted model given limited data.
When the gap between the SOTA public model and the redacted model is small, the private-finetuning step cannot further improve the results because of the noisy gradients (e.g., in SST-2), we plan to develop better algorithms to utilize the redacted data and apply denoising methods (Welch et al., 1995) to close the gap further.
One thing to note is that if the secret detector misses a secret and the secret appears multiple times in the data, then it is likely that the secret detector will miss it multiple times.Therefore, deduplicating the data first is important.We plan to study data deduplication in NLP in the future.

Ethical Considerations
To prevent real-world harm, all the datasets and models used in this paper are already public with either public or synthesized information.So no real personal information will be leaked.
This work tackles the challenge of privacy protection and can be utilized in various domains or applications to build models that preserve privacy.We will release the code to facilitate privacy-preserving model building.The canary insertion attack is wellknown (Carlini et al., 2019(Carlini et al., , 2020) ) and adjusted specifically for our setting, so it cannot be directly utilized to attack real-world models successfully.

A Appendix
A.1 Proofs Theorem 1 (restated).Given that 1) in the first phase, the data used for fine-tuning does not contain sensitive tokens and a public optimizer is used, and 2) in the second phase, the private optimizer achieves (ϵ, δ)-DP, JFT achieves (ϵ, δ)-SDP.
Proof.Since the first phase does not incur any privacy loss on the sensitive tokens, the first phase achieves (0, 0)-SDP.
Note that the first phase achieves (0,0)-SDP but cannot achieve (0,0)-DP.DP aims to protect the entire token sequence, whether it is considered sensitive or non-sensitive by the policy function.Because the first phase does not noise the nonsensitive tokens at all, it cannot ensure DP.

A.2 Implementation Details
Notes on special tokens.When fine-tuning LLMs, it is a common practice to add new special tokens to fit the need of the downstream tasks.For example, in dialogue tasks, we often have prompts like "SYS:" and "USR:" to indicate the speaker in the data.This step doesn't affect the public models that much (20.48without special tokens vs 20.44 with special tokens), but as it does change the model structure (additional embeddings) and the model initialization, we notice that in our experiments, DPSGD is sensitive to the addition of special tokens (because the model initialization is changed): after reasonable amounts of parameter tuning (see Table 5), DPSGD initialized with the original GPT achieves 27.05 in PPL, while DPSGD with added special tokens achieves 30.32 in PPL on Wikitext-2.The gap could potentially be reduced with more parameter tuning, but we just want to mention that in practice, it may not be easy to find the best parameters.In our experiments, for WikiText-2, we add <mask> as the special token; for ABCD, as it is a dialogue task, we add <mask>, "ACT:", "SYS:", and "USR:".Since all the JFT models have added special tokens, we report two DPSGD results, one without special tokens and one with special tokens, for a fair comparison in terms of model structure.
Also, the secret detectors replace the sensitive information with artificial special tokens such as "<SSN>" and "<NAME>".But these tokens don't appear in the validation or test set and thus inserting them will skew the training data distribution and lead to inferior results, especially when the sensitive token portion is high.In our experiments, we mask the detected sensitive information with the same "<mask>" token and ignore this special token in the loss calculation.In this way, for models with an existing "<mask>" token (like Roberta), we can utilize the existing embedding; for models without "<mask>", the model only needs to learn one additional special embedding.This improves the validation perplexity from 64.82 to 37.90 for the redacted GPT2 model with the low contextual secret detector.
We could potentially apply the same secret detector on the validation and test set to mitigate special token issues.However, this causes two concerns: 1) if the secrete detector redacts 45% of tokens (e.g. the high contextual one redacts all the verbs, etc), then the performance on validation/test is not informative at all, and cannot be compared to the public baseline; 2) in the past privacy literature (Papernot et al., 2018;Ghazi et al., 2021), the conventional problem setup considers validation/test sets as public and focuses only on the training privacy.We inherit the same treatment to be comparable with prior literature.But in privacy-related NLP problems, how to treat the validation/test sets remains an open question as they can contain private information.
Our experiments find that adding many special tokens impacts the results.In the future, we plan to study how to treat special tokens better in privacypreserving LMs.Hyper-parameter tuning.Hyper-parameter tuning remains a challenging problem in DP learning as the training takes a long time, and the model can be sensitive to the hyper-parameters.Guided Figure 4 shows the canary insertion attack results when we insert ten different canaries into the training data.The exposure is the average exposure of the ten canaries.In this experiment, we treat the inserted canaries as the only secrets, so the "Redacted" and "JFT" model utilities are close to the black "No-DP" model, and we can better compare with the "No-DP" model.We also artificially vary the recall of the secret detector to see the effect."Recall=0.4"means that the detector can only detect four of the ten canaries.Because each canary only appears once in the dataset, the exposure is low, similar to the ones in Figure 3.But if the recall is higher (0.6), the exposure will still be lower."JFT +light noise" achieves low exposure, similar to the baseline DPSGD that protects all canaries, but with much higher utility over DPSGD.

Parameters
These experiments may suggest that for large NLP models, if the sensitive tokens only appear for very limited times, they may not be extracted using the canary insertion attack.

Figure 2 :
Figure 2: Canary exposure for different models.

Figure 3 :
Figure 3: Exposure for different models when the canary is inserted only once.The exposures are all small (<3) even for public models.

Figure 4 :
Figure 4: Exposure for different models when we insert ten different canaries.

Table 3 :
Privacy-amplified JFT performance on all the tasks, with the high entity detector.We report the estimated 95% confidence interval (CI) of the missing rate m, recall and the corresponding 95% CI of the ϵ s .

Table 4 :
Manual screening results on the high entity secret detector.D 0 : original data.D ′ : the inspected redacted data.D: the private data.ϵ ≈ 3. D ′ size is the number of records used in the redacted-fine-tune phase.