Membership Inference Attacks against Language Models via Neighbourhood Comparison

Membership Inference attacks (MIAs) aim to predict whether a data sample was present in the training data of a machine learning model or not, and are widely used for assessing the privacy risks of language models. Most existing attacks rely on the observation that models tend to assign higher probabilities to their training samples than non-training points. However, simple thresholding of the model score in isolation tends to lead to high false-positive rates as it does not account for the intrinsic complexity of a sample. Recent work has demonstrated that reference-based attacks which compare model scores to those obtained from a reference model trained on similar data can substantially improve the performance of MIAs. However, in order to train reference models, attacks of this kind make the strong and arguably unrealistic assumption that an adversary has access to samples closely resembling the original training data. Therefore, we investigate their performance in more realistic scenarios and find that they are highly fragile in relation to the data distribution used to train reference models. To investigate whether this fragility provides a layer of safety, we propose and evaluate neighbourhood attacks, which compare model scores for a given sample to scores of synthetically generated neighbour texts and therefore eliminate the need for access to the training data distribution. We show that, in addition to being competitive with reference-based attacks that have perfect knowledge about the training data distribution, our attack clearly outperforms existing reference-free attacks as well as reference-based attacks with imperfect knowledge, which demonstrates the need for a reevaluation of the threat model of adversarial attacks.


Introduction
The public release and deployment of machine learning models trained on potentially sensitive user data introduces a variety of privacy risks: While embedding models have been shown to leak personal attributes of their data (Song and Raghunathan, 2020), generative language models are capable of generating verbatim repetitions of their training data and therefore exposing sensitive strings such as names, phone numbers or emailaddresses (Carlini et al., 2021b).Another source of risk arises from membership inference attacks (MIAs) (Shokri et al., 2016), which enable adversaries to classify whether a given data sample was present in a target model's training data or not.Due to their simplicity and the fact that MIAs are an important component of more sophisticated attacks such as extraction attacks (Carlini et al., 2021b), they have become one of the most widely used tools to evaluate data leakage and empirically study the privacy of machine learning models (Murakonda and Shokri, 2020;Song and Marn, 2020).
Typically, membership inference attacks exploit models' tendency to overfit their training data and therefore exhibit lower loss values for training members (Yeom et al., 2018;Sablayrolles et al., 2019).A highly simple and commonly used baseline attack is therefore the LOSS attack (Yeom et al., 2018), which classifies samples as training members if their loss values are below a certain threshold.While attacks of this kind do generally reap high accuracies, Carlini et al. (2021a) point out a significant flaw: Good accuracies for attacks of this kind are primarily a result of their ability to identify non-members rather than training data members, which does arguably not pose important privacy risks.This shortcoming can be attributed to the fact that certain samples such as repetitive or very simple short sentences are naturally assigned higher probabilities than others (Fan et al., 2018;Holtzman et al., 2020), and the influence of this aspect on the obtained model score largely outweighs the influence of a model's tendency to overfit its training samples (Carlini et al., 2021a).To account for this, previous work has introduced the idea of difficulty calibration mechanisms (Long

Target Model
Figure 1: Overview of our attack: Given a target sample x, we use a pretrained masked language model to generate highly similar neighbour sentences through word replacements.Consequently, we compare our neighbours' losses and those of the original sample under the target model by computing their difference.As our neighbours are highly similar to the target sequence, we expect their losses to be approximately equal to the target model and only to be lower if the target sequence was a sample of the model's training data.In this case, the difference should be below our threshold value γ.
et al., 2018;Watson et al., 2022), which aim to quantify the intrinsic complexity of a data sample (i.e., how much of an outlier the given sample is under the probability distribution of the target model) and subsequently use this value to regularize model scores before comparing them to a threshold value.
In practice, difficulty calibration is mostly realized through Likelihood Ratio Attacks (LiRA), which measure the difficulty of a target point by feeding it to reference models that help provide a perspective into how likely that target point is in the given domain (Ye et al., 2022;Carlini et al., 2021a;Watson et al., 2022;Mireshghallah et al., 2022a,b).In order to train such reference models, LiRAs assume that an adversary has knowledge about the distribution of the target model's training data and access to a sufficient amount of samples from it.We argue that this is a highly optimistic and in many cases unrealistic assumption: as also pointed out by Tramèr et al. (2022), in applications in which we care about privacy and protecting our models from leaking data (e.g. in the medical domain), high-quality, public in-domain data may not be available, which renders reference-based attacks ineffective.Therefore, we aim to design an attack which does not require any additional data: For the design of our proposed neighborhood attack, we build on the intuition of using references to help us infer membership, but instead of using reference models, we use neighboring samples, which are textual samples crafted through data augmentations such as word replacements to be non-training members that are as similar as possible to the tar-get point and therefore practically interchangeable with it in almost any context.With the intuition that neighbors should be assigned equal probabilities as the original sample under any plausible textual probability distribution, we then compare the model scores of all these neighboring points to that of the target point and classify its membership based on their difference.Similar to LiRAs, we hypothesize that if the model score of the target data is similar to the crafted neighbors, then they are all plausible points from the distribution and the target point is not a member of the training set.However, if a sample is much more likely under the target model's distribution than its neighbors, we infer that this could only be a result of overfitting and therefore the sample must be a part of the model's training data.
We conduct extensive experiments measuring the performance of our proposed neighborhood attack, and particularly compare it to referencebased attacks with various different assumptions about knowledge of the target distribution and access to additional data.Concretely, amongst other experiments, we simulate real-world referencebased attacks by training reference models on external datasets from the same domain as the target model's training data.We find that neighbourhood attacks outperform LiRAs with more realistic assumptions about the quality of accessible data by up to 100%, and even show competitive performance when we assume that an attacker has perfect knowledge about the target distribution and access to a large amount of high-quality samples from it.

Membership Inference Attacks via Neighbourhood Comparison
In this section, we provide a detailed description of our attack, starting with the general idea of comparing neighbouring samples and following with a technical description of how to generate such neighbors.

General Idea
We follow the commonly used setup of membership inference attacks in which the adversary has grey-box access to a machine learning model f θ trained on an unknown dataset D train , meaning that they can obtain confidence scores and therefore loss values from f θ , but no additional information such as model weights or gradients.The adversary's goal is to learn an attack function A f θ : X → {0, 1} , which determines for each x from the universe of textual samples X whether x ∈ D train or x ̸ ∈ D train .As mentioned in the previous section, the LOSS attack (Yeom et al., 2018), one of the most simple forms of membership inference attacks, classifies samples by thresholding their loss scores, so that the membership decision rule is: (1) More recent attacks follow a similar setup, but perform difficulty calibration to additionally account for the intrinsic complexity of the sample x under the target distribution and adjust its loss value accordingly.Concretely, given a function d : X → R assigning difficulty scores to data samples, we can extend the the decision rule to (2) Likelihood Ratio Attacks (LiRAs) (Ye et al., 2022), the currently most widely used form of membership inference attacks, use a sample's loss score obtained from some reference model f ϕ as a difficulty score, so that d(x) = L(f ϕ , x) .However, this makes the suitability of the difficulty score function dependent on the quality of reference models and therefore the access to data from the training distribution.We circumvent this by designing a different difficulty calibration function depending on synthetically crafted neighbors.
Formally, for a given x, we aim to produce natural adjacent samples, or a set of n neighbors {x 1 , ..., xn }, which slightly differ from x and are not part of the target model's training data, but are approximately equally likely to appear in the general distribution of textual data, and therefore offer a meaningful comparison.Given our set of neighbors, we calibrate the loss score of x under the target model by subtracting the average loss of its neighbors from it, resulting in a new decision rule: (3) The interpretation of this decision rule is straightforward: Neighbors crafted through minimal changes that fully preserve the semantics and grammar of a given sample should in theory be interchangeable with the original sentence and therefore be assigned highly similar likelihoods under any textual probability distribution.Assuming that our neighbors were not present in the training data of the target model, we can therefore use the model score assigned to them as a proxy for what the original sample's loss should be if it was not present in the training data.The target sample's loss value being substantially lower than the neighbors' losses could therefore only be a result of overfitting and therefore the target sample being a training member.In this case, we expect the difference in Equation 3 to be below our threshold value γ

Obtaining Neighbour Samples
In the previous section, for a given text x, we assumed access to a set of adjacent samples {x 1 , ..., xn }.In this section we describe how those samples are generated.As it is highly important to consider neighbours that are approximately equally complex, it is important to mention that beyond the semantics of x, we should also preserve structure and syntax, and can therefore not simply consider standard textual style transfer or paraphrasing models.Instead, we opt for very simple word replacements that preserve semantics and fit the context of the original word well.For obtaining these replacements, we adopt the framework proposed by Zhou et al. (2019), who propose the use of transformer-based (Vaswani et al., 2017) masked language models (MLMs) such as BERT (Devlin et al., 2019) for lexical substitutions: Concretely, given a text x := (w (1) , ..., w (L) ) consisting of L tokens, the probability p θ ( w = w (i) |x) of token w as the word in position i can be obtained from the MLM's probability distribution p(V (i) |x) over our token vocabulary V at position i.As we do not want to consider the influence of the probability of the original token on the token's suitability as a replacement when comparing it to other candidates, we normalize the probability over all probabilities except that of the original token.So, if ŵ was the original token at position i, our suitability score for w as a replacement is In practice, simply masking the token which we want to replace will lead to our model completely neglecting the meaning of the original word when predicting alternative tokens and therefore potentially change the semantics of the original sentence -for instance, for the given sample "The movie was great", the probability distribution for the last token obtained from "The movie was [MASK]" might assign high scores to negative words such as "bad", which are clearly not semantically suitable replacements.To counteract this, Zhou et al. (2019) propose to keep the original token in the input text, but to add strong dropout to the input embedding layer at position i before feeding it into the transformer to obtain replacement candidates for w (i) .We adopt this technique, and therefore obtain a procedure which allows us to obtain n suitable neighbors with m word replacements using merely an off-the-shelf model that does not require any adaptation to the target domain.The pseudocode is outlined in Algorithm 1.

Experimental Setup
We evaluate the performance of our attack as well as reference-free and reference-based baseline attacks against large autoregressive models trained with the classical language modeling objective.Particularly, we use the base version of GPT-2 (Radford et al., 2019) as our target model.

Datasets
We perform experiments on three datasets, particularly news article summaries obtained from a subset of the AG News corpus 1 containing four news categories ("World", "Sports", "Business", "Science & Technology"), tweets from the Sen-timent140 dataset (Go et al., 2009) and excerpts from wikipedia articles from Wikitext-103 (Merity et al., 2017).Both datasets are divided into two disjunct subsets of equal size: one of these subsets serves as training data for the target model and therefore consists of positive examples for the membership classification task.Subset two is not used for training, but its samples are used as negative examples for the classification task.The subsets contain 60,000, 150,000 and 100,000 samples for AG News, Twitter and Wikitext, respectively, leading to a total size of 120,000, 300,000 and 200,000 samples.For all corpora, we also keep an additional third subset that we can use to train reference models for reference-based attacks.

Baselines
To compare the performance of our attack, we consider various baselines: As the standard method for reference-free attacks, we choose the LOSS Attack proposed by Yeom et al. (2018), which classifies samples as training members or non-members based on whether their loss is above or below a certain threshold (see Equation 1).For referencebased attacks, we follow recent implementations (Mireshghallah et al., 2022a,b;Watson et al., 2022) and use reference data to train a single reference model of the same architecture as the target model.Subsequently, we measure whether the likelihood of a sample under the target model divided by its likelihood under the reference model crosses a certain threshold.
Training Data for Reference Models As discussed in previous sections, we would like to evaluate reference-based attacks with more realistic assumptions about access to the training data distribution.Therefore, we use multiple reference models trained on different datasets: As our Base Reference Model, we consider the pretrained, but not fine-tuned version of GPT-2.Given the large pretraining corpus of this model, it should serve as a good estimator of the general complexity of textual samples and has also been successfully used for previous implementations of reference-based attacks (Mireshghallah et al., 2022b).Similar to our neighbourhood attack, this reference model does not require an attacker to have any additional data or knowledge about the training data distribution.
To train more powerful, but still realistic reference models, which we henceforth refer to as Candidate Reference Models, we use data that is in general similar to the target model's training data, but slightly deviates with regard to topics or artifacts that are the result of the data collection procedure.Concretely, we perform this experiment for both our AG News and Twitter corpora: For the former, we use article summaries from remaining news categories present in the AG News corpus ("U.S.", "Europe", "Music Feeds", "Health", "Software and Development", "Entertainment") as well as the NewsCatcher dataset2 containing article summaries for eight categories that highly overlap with AG News ("Business", "Entertainment", "Health", "Nation", "Science", "Sports", "Technology", "World").For Twitter, we use a depression detection dataset for mental health support from tweets3 as well as tweet data annotated for offensive language4 .As it was highly difficult to find data for reference models, it was not always possible to match the amount of training samples of the target model.The number of samples present in each dataset can be found in Table 1.
As our most powerful reference model, henceforth referred to as Oracle Reference Model, we use models trained on the same corpora, but different subsets as the target models.This setup assumes that an attacker has perfect knowledge about the training data distribution of the target model and high quality samples.

Implementation Details
We obtain and fine-tune all pretrained models using the Huggingface transformers library (Wolf et al., 2020) and PyTorch (Paszke et al., 2019).As target models, we fine-tune the pretrained 117M parameter version of GPT-2, which originally has a validation perplexity of 56.8 and 200.3 on AG News and Twitter data, respectively, up to validation set perplexities of 30.0 and 84.7.In our initial implementation of our neighbourhood attack, we obtain the 100 most likely neighbour samples using one word replacement only from the pretrained 110M parameter version of BERT.We apply a dropout of p = 0.7 to the embedding of the token we want to replace.For evaluating LiRA baselines, we train each reference model on its respective training dataset over multiple epochs, and choose the best performing reference model w.r.t attack performance.Following Carlini et al. (2021a), we evaluate our attack's precision for predetermined low false positive rate values such as 1% or 0.01%.We implement this evaluation scheme by adjusting our threshold γ to meet this requirement and subsequently measure the attack's precision for the corresponding γ.All models have been deployed on single GeForce RTX 2080 and Tesla K40 GPUs.

Results
In this section, we report our main results and perform additional experiments investigating the impact of reference model performance on the success of reference-based attacks as well as several ablation studies.Following (Carlini et al., 2021a), we report attack performances in terms of their true positive rates (TPR) under very low false positive rates (FPR) by adjusting the threshold value γ.

Main Results
Our results can be found in Table 2 and 3, with the former showing our attack performance in terms of true positive rates under low false positive rates and the latter showing AUC values.As previously discovered, the LOSS attack tends to perform badly when evaluated for very low false positive rates (Carlini et al., 2021a;Watson et al., 2022).Likelihood Ratio Attacks can clearly outperform it, but we observe that their success is highly dependent on having access to suitable training data for reference models: Attacks using the base reference models and candidate models can not reach the performance of an attack using the oracle reference model by a large margin.Notably, they are also substantially outperformed by our Neighbour Attack, which can, particularly in low FPR ranges, even compete very well with or outperform Likelihood Ratio Attacks with an Oracle Reference Model, without relying on access to any additional data.

Measuring the Dependence of Attack Success on Reference Model Quality
Motivated by the comparably poor performance of Likelihood Ratio Attacks with reference models trained on only slightly different datasets to the target training data, we aim to investigate the dependence of reference attack performances on the quality of reference models in a more controlled and systematic way.To do so, we train reference models on our oracle data over multiple epochs, and report the attack performance of Likelihood Ratio Attacks w.r.t to the reference models' validation perplexity (PPL) on a held out test set, which is in this case the set of non-training members of the target model.Intuitively, we would expect the attack performance to peak when the validation PPL of reference models is similar to that of the target model, as this way, the models capture a very similar distribution and therefore offer the best comparison to the attack model.In this setup, we are however particularly interested in the attack performance when the validation PPL does not exactly match that of the target model, given that attackers will not always be able to train perfectly performing reference models.
The results of this experiment can be found in Figure 2 for our News and Twitter dataset and in Figure 3 for Wikitext.As can be seen, the performance of reference-based attacks does indeed peak when reference models perform roughly the same as the target model.A further very interesting observation is that substantial increases in attack success only seem to emerge as the validation PPL of reference models comes very close to that of the target model and therefore only crosses the success 11335

Ablation Studies
Having extensively studied the impact of different reference model training setups for the Likelihood Ratio Attack, we now aim to explore the effect of various components of our proposed neighbourhood attack.erated neighbours as determined by BERT.In the following, we measure how varying this number affects the attack performance.While intuitively, a higher number of neighbours might offer a more robust comparison, it is also plausible that selecting a lower number of most likely neighbours under BERT will lead to neighbours of higher quality and therefore a more meaningful comparison of loss values.Our results in Table 4 show a clear trend towards the former hypothesis: The number of neighbours does in general have a strong influence on the performance of neighbourhood attacks and higher numbers of neighbours produce better results.

Number of Word Replacements
Besides the number of generated neighbours, we study how the number of replaced words affects the performance of our attack.While we reported results for the replacement of a single word in our main results in Table 2, there are also reasons to expect that a higher number of replacements leads to better attack performance: While keeping neighbours as similar to the original samples as possible ensures that their probability in the general distribution of textual data remains as close as possible, one could also expect that too few changes lead the target model to assign the original sample and its neighbours almost exactly the same score, and therefore make it hard to observe high differences in loss scores for training members.Our results of generating 100 neighbours with multiple word replacements are reported in Table 5.We find that replacing only one word clearly outperforms multiple replacements.Beyond this, we do not find highly meaningful differences between two and three word replacements.Due to the privacy risks that emerge from the possibility of membership inference and data extraction attacks, the research community is actively working on defenses to protect models.Beyond approaches such as confidence score perturbation (Jia et al., 2019) and specific regularization techniques (Mireshghallah et al., 2021;Chen et al., 2022) showing good empirical performance, differentially private model training is one of the most well known defense techniques offering mathematical privacy guarantees: DP-SGD (Song et al., 2013;Bassily et al., 2014;Abadi et al., 2016), which uses differential privacy (Dwork et al., 2006) to bound the influence that a single training sample can have on the resulting model and has been shown to successfully protect models against membership inference attacks (Carlini et al., 2021a) and has recently also successfully been applied to training language models (Yu et al., 2022;Li et al., 2022;Mireshghallah et al.).To test the effectiveness of differential privacy as a defense against neighbourhood attacks, we follow Li et al. (2022) and train our target model GPT-2 in a differentially private manner on AG News, where our attack performed the best.The results can be seen in Table 6 and clearly demonstrate the effectiveness of DP-SGD.
Even for comparably high epsilon values such as ten, the performance of the neighbourhood attack is substantially worse compared to the non-private model and is almost akin to random guessing for low FPR values.

Related Work
MIAs have first been proposed by Shokri et al. (2016) and continue to remain a topic of interest for the machine learning community.While many attacks, such as ours, assume to only have access to model confidence or loss scores (Yeom et al., 2018;Sablayrolles et al., 2019;Jayaraman et al., 2020;Watson et al., 2022), others exploit additional information such as model parameters (Leino and Fredrikson, 2020) or training loss trajectories (Liu et al., 2022).Finally, some researchers have also attempted to perform membership inference attacks given only hard labels without confidence scores (Li and Zhang, 2021;Choquette-Choo et al., 2021).Notably, the attack proposed by Choquette-Choo et al. ( 2021) is probably closest to our work as it tries to obtain information about a sample's membership by flipping its predicted labels through small data augmentations such as rotations to image data.To the best of our knowledge, we are the first to apply data augmentations of this kind for text-based attacks.
Membership Inference Attacks in NLP Specifically in NLP, membership inference attacks are an important component of language model extraction attacks (Carlini et al., 2021b;Mireshghallah et al., 2022b).Further studies of interest include work by Hisamoto et al. (2020), which studies membership inference attacks in machine translation, as well as work by Mireshghallah et al. (2022a), which investigates Likelihood Ratio Attacks for masked language models.Specifically for language models, a large body of work also studies the related phenomenon of memorization (Kandpal et al., 2022;Carlini et al., 2022b,a;Zhang et al., 2021), which enables membership inference and data extraction attacks in the first place.
Machine-Generated Text Detection Due to the increasing use of tools like ChatGPT as writing assistants, the field of machine-generated text detection has become of high interest within the research community and is being studied extensively (Chakraborty et al., 2023;Krishna et al., 2023;Mitchell et al., 2023;Mireshghallah et al., 2023).Notably, Mitchell et al. (2023) propose DetectGPT, which works similarly to our attack as it compares the likelihood of a given sample under the target model to the likelihood of perturbed samples and hypothesizes that the likelihood of perturbations is smaller than that of texts the model has generated itself.

Conclusion and Future Work
In this paper, we have made two key contributions: First, we thoroughly investigated the assumption of access to in-domain data for reference-based membership inference attacks: In our experiments, we have found that likelihood ratio attacks, the most common form of reference-based attacks, are highly fragile to the quality of their reference models and therefore require attackers to have access to high-quality training data for those.Given that specifically in privacy-sensitive settings where publicly available data is scarce, this is not always a realistic assumption, we proposed that the design of reference-free attacks would simulate the behavior of attackers more accurately.Thus, we introduced neighborhood attacks, which calibrate the loss scores of a target samples using loss scores of plausible neighboring textual samples generated through word replacements, and therefore eliminate the need for reference trained on in-domain data.We have found that under realistic assumptions about an attacker's access to training data, our attack consistently outperforms reference-based attacks.Furthermore, when an attacker has perfect knowledge about the training data, our attack still shows competitive performance with referencebased attacks.We hereby further demonstrated the privacy risks associated with the deployment of language models and therefore the need for effective defense mechanisms.Future work could extend our attack to other modalities, such as visual or audio data, or explore our attack to improve extraction attacks against language models.

Limitations
The proposed attack is specific to textual data While many membership inference attacks are universally applicable to all modalities as they mainly rely on loss values obtained from models, our proposed method for generating neighbours is specific to textual data.While standard augmentations such as rotations could be used to apply our method for visual data, this is not straightforward such as the transfer of other attacks to different modalities.

Implementation of baseline attacks
As the performance of membership inference attacks depend on the training procedure of the attacked model as well as its degree of overfitting, it is not possible to simply compare attack performance metrics from other papers to ours.Instead, we had to reimplement existing attacks to compare them to our approach.While we followed the authors' descriptions in their papers as closely as possible, we cannot guarantee that their attacks were perfectly implemented and the comparison to our method is therefore 100% fair.

Ethical Considerations
Membership inference attacks can be used by malicious actors to compromise the privacy of individuals whose data has been used to train models.However, studying and expanding our knowledge of such attacks is crucial in order to build a better understanding for threat models and to build better defense mechanisms that take into account the tools available to malicious actors.Due to the importance of this aspect, we have extensively highlighted existing work studying how to defend against MIAs in Section 6.As we are aware of the potential risks that arise from membership inference attacks, we will not freely publicize our code, but instead give access for research projects upon request.
With regards to the data we used, we do not see any issues as all datasets are publicly available and have been used for a long time in NLP research or data science competitons.

Figure 2 :
Figure 2: Attack Performance of reference attacks w.r.t validation PPL of reference models, compared to the performance of neighborhood attacks.The perplexities of the target models were 30.0 and 84.7 for AG News and Twitter, respectively

Figure 3 :
Figure 3: Attack Performance of reference attacks w.r.t validation PPL of reference models, compared to the performance of neighborhood attacks.The perplexity of the target model was 55.6 for Wikipedia

Table 1 :
Number of samples in the reference model training data.Target models for News, Twitter and Wikipedia were trained on 60,000, 150,000 and 100,000 samples, respectively.

Table 2 :
True positive rates of various attacks for low false positive rates of 1%, 0.1%, and 0.01%.Candidate Reference Model 1 refers to reference models trained on data from other AG News categories and our Twitter mental health dataset, Candidate Reference Model 2 refers to reference models trained on NewsCatcher and the offensive tweet classification dataset.*As reference attacks trained on oracle datasets represent a rather unrealistic scenario with perfect assumptions, we compare our results with other baselines with more realistic assumptions when highlighting best results as bold.

Table 3 :
AUC values of various attacks.
Table 2, we report the performance of neighbour attacks for the 100 most likely gen-

Table 4 :
Attack performance w.r.t the number of neighbours against which we compare the target sample

Table 5 :
Attack performance w.r.t the number of words that are replaced when generating neighbours

Table 6 :
Performance of neighbourhood attacks against models trained with DP-SGD r.t validation PPL of reference models, compared to the performance of neighborhood attacks.The perplexity of the target model was 55.6 for Wikipedia Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang.2022.Differentially private fine-tuning of language models.In International Conference on Learning Representations.Tao Ge, Ke Xu, Furu Wei, and Ming Zhou.2019.BERT-based lexical substitution.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3368-3373, Florence, Italy.Association for Computational Linguistics.