Extracted BERT Model Leaks More Information than You Think!

The collection and availability of big data, combined with advances in pre-trained models (e.g. BERT), have revolutionized the predictive performance of natural language processing tasks. This allows corporations to provide machine learning as a service (MLaaS) by encapsulating fine-tuned BERT-based models as APIs. Due to significant commercial interest, there has been a surge of attempts to steal remote services via model extraction. Although previous works have made progress in defending against model extraction attacks, there has been little discussion on their performance in preventing privacy leakage. This work bridges this gap by launching an attribute inference attack against the extracted BERT model. Our extensive experiments reveal that model extraction can cause severe privacy leakage even when victim models are facilitated with state-of-the-art defensive strategies.


Introduction
The emergence of pre-trained language models (PLMs) has revolutionized the natural language processing (NLP) research, leading to state-of-the-art (SOTA) performance on a wide range of tasks (Devlin et al., 2018;Yang et al., 2019). This breakthrough has enabled commercial companies to deploy machine learning models as black-box APIs on their cloud platforms to serve millions of users, such as Google Prediction API 1 , Microsoft Azure Machine Learning 2 , and Amazon Machine Learning 3 .
However, recent works have shown that existing NLP APIs are vulnerable to model extraction attack (MEA), which can reconstruct a copy of the remote NLP model based on the carefully-designed queries and outputs of the target API (Krishna et al., 2019;Wallace et al., 2020), causing the financial losses of the target API. Prior to our work, researchers have investigated the hazards of model extraction under various settings, including stealing commercial APIs (Wallace et al., 2020;Xu et al., 2022), ensemble model extraction (Xu et al., 2022), and adversarial examples transfer (Wallace et al., 2020;He et al., 2021).
Previous works have indicated that an adversary can leverage the extracted model to conduct adversarial example transfer, such that these examples can corrupt the predictions of the victim model (Wallace et al., 2020;He et al., 2021). Given the success of MEA and adversarial example transfer, we conjecture that the predictions from a victim model could reveal its private information unconsciously, as victim models can memorize side information in addition to the task-related message (Lyu and Chen, 2020;Lyu et al., 2020;Carlini et al., 2021). Thus, we are interested in examining whether the victim model can leak the private information of its data to the extracted model, which has received little attention in previous research. In addition, a list of defenses against MEA has been devised (Lee et al., 2019;Ma et al., 2021;Xu et al., 2022;He et al., 2022a,b). Although these technologies can alleviate the effects of MEA, it is unknown whether such defenses can prevent private information leakage, e.g., gender, age, identity.
To study the privacy leakage from MEA, we first leverage MEA to obtain a white-box extracted model. Then, we demonstrate that from the extracted model, it is possible to infer sensitive attributes of the data used by the victim model. To the best of our knowledge, this is the first attempt that investigates privacy leakage from the extracted model. Moreover, we demonstrate that the privacy leakage is resilient to advanced defense strategies even though the task utility of the extracted model is significantly diminished, which could motivate further investigation on defense technology in MEA. 4 2 Related Work MEA aims to steal an intellectual model from cloud services (Tramèr et al., 2016;Orekondy et al., 2019;Krishna et al., 2019;Wallace et al., 2020). It has been studied both empirically and theoretically, on simple classification tasks (Tramèr et al., 2016), vision tasks (Orekondy et al., 2019), and NLP tasks (Krishna et al., 2019Wallace et al., 2020). MEA targets at imitating the functionality of a black-box victim model (Krishna et al., 2019;Orekondy et al., 2019), i.e., a model replicating the performance of the victim model.
Furthermore, the extracted model could be used as a reconnaissance step to facilitate later attacks (Krishna et al., 2019). For instance, the adversary could construct transferrable adversarial examples over the extracted model to corrupt the predictions of the victim model (Wallace et al., 2020;He et al., 2021). Prior works (Coavoux et al., 2018;Lyu et al., 2020) have shown malicious users can infer confidential attributes based on the interaction with a trained model. However, to the best of our knowledge, none of the previous works investigate whether the extracted model can facilitate privacy leakage of the data used by the black-box victim model.
In conjunction with MEA, a list of avenues has been proposed to defend against MEA. These approaches focus on the perturbation of the posterior prediction. Orekondy et al. (2019) suggested revealing the top-K posterior probabilities only. Lee et al. (2019) demonstrated that API owners could increase the difficulty of MEA by softening the posterior probabilities and imposing a random noise on the non-argmax probabilities. Ma et al. (2021) introduced an adversarial training process to discourage the knowledge distillation from the victim model to the extracted model. However, these approaches are specific to model extraction, which are not effective to defend against attribute inference attack, as shown in Section 5. (AIA). Throughout this paper, we mainly focus on the BERT-based API as the victim model, which is widely used in commercial black-box APIs.
Model Extraction Attack (MEA). To conduct MEA, attackers craft a set of inputs as queries (transfer set), and send them to the target victim model (BERT-based API) to obtain the predicted posterior probability, i.e., the outputs of the softmax layer. Then attackers can reconstruct a copy of the victim model as an "extracted model" by training on query-prediction pairs.
Attribute Inference Attack (AIA). After we derive an extracted model, we now investigate how to infer sensitive information from the extracted model by conducting AIA against the extracted model. Given any record x = [x ns , x s ], AIA aims to reconstruct the sensitive components x s , based on the hidden representation of x ns , where x ns and x s represent the non-sensitive information and the target sensitive attribute respectively. The intuition behind AIA is that the representation generated by the extracted model can be used to facilitate the inference of the sensitive information of the data used by the victim model (Coavoux et al., 2018). Note that the only explicit information that is accessible to the attacker is the predictions output by the victim model, rather than the raw BERT representations.
Given an extracted model g V , we first feed a limited amount of the auxiliary data D aux with labelled attribute into g V to collect the BERT representation h(x ns i ) for each x i ∈ D aux . Then, we train an inference model f (·), which takes the BERT representation of the extracted model as input and outputs the sensitive attribute of the input, i.e., {h(x ns i ), x s i }. The trained inference model f (·) can infer the sensitive attribute; in our case, they are gender, age and named entities (see Section 4.1). During test time, as illustrated in Figure 1, the attacker can first derive the BERT representation of any record by using the extracted model, then feed the extracted BERT representation into the trained inference model f (·) to infer the sensitive attributes.

Experimental Setup
Data. We conduct experiments on three datasets: i) Trustpilot (TP) (Hovy et al., 2015), ii) AG news (Del Corso et al., 2005), and iii) Blog posts (Blog) (Schler et al., 2006). To study AIA, we reuse the data pre-processed by Coavoux et al. (2018). For TP, Coavoux et al. (2018) use the subset from US users, i.e., TP-US. The private attributes of TP-US and Blog are gender and age. The private attributes of AG news are the five most frequent person entities. More details and statistics are provided in Appendix A.
Settings. For each dataset, we randomly split the training data D into two halves D V and D Q , where |D V |= |D Q |. The first half (D V ) is used to train the victim model, whereas the second half (D Q ) is reserved for two folds. The first fold is to train an extracted model, where the data distribution of the victim's training data (T V ) is the same as that of the query (T A ). The second fold is to train f (·) to infer the private attributes, i.e., D aux .
Since API providers tend to use in-house datasets, it is difficult for the attacker to know the target data distribution as prior knowledge. Thus, we sample queries from different distributions but semantically-coherent data as the origi-  In order to test AIA, we assume D V is accessible to attackers. We use the non-sensitive attributes of D V as the test input. If the attacker can successfully infer the sensitive attributes of D V given its nonsensitive information, then it will cause a privacy leakage of the victim model. Following Coavoux et al. (2018), for demographic variables (i.e., gender and age), we take 1 − X as empirical privacy, where X is the average prediction accuracy of the attack models on these two variables; for named entities, we take 1 − F as empirical privacy, where F is the F1 score between the ground truths and the prediction by the attackers on the presence of all the named entities. Higher empirical privacy means lower attack performance.
Training Details. Victim and extracted models are BERT-base (Devlin et al., 2018), trained for 5 epochs with the Adam optimizer (Kingma and Ba, 2014) using a learning rate of 2 × 10 −5 . We use the codebase from Transformers library (Wolf et al., 2020). Attribute inference models are 2-layer MLP, trained for 3 epochs with the same optimizer and learning rate. All experiments are run with one Nvidia V100 gpu.  Baselines. To gauge the private information leakage, we consider a majority value for each discrete attribute as a baseline. To evaluate how the extracted model suffers from AIA, we also take the pretrained BERT without (w/o) fine-tuning as a baseline model to extract representation. Note that BERT (w/o fine-tuning) is a plain model that does not contain any information about the training data of the target model.

Experimental Results
MEA results. We present the performance of MEA for the same domain querying and crossdomain querying in Table 1. Due to the domain mismatch, the cross-domain querying underperforms the same-domain querying. Although increasing the cross-domain query size can boost the accuracy of the extracted model, it is still inferior to the samedomain competitor with fewer data. In addition, we notice that AG news prefers news data, while TP-US and Blog favor reviews data. Intuitively, one can attribute this preference to the genre similarity, i.e., news data is close to AG news, while distant from TP-US and Blog. To verify this phenomenon, we calculate the uni-gram and 5-gram overlapping between test sets and different queries in Appendix A.
Since we do not have access to the training data of the victim model, we will use news data as queries for AG news, and reviews data as queries for TP-US and Blog, unless otherwise mentioned.
AIA results. We show AIA results using the same-domain and cross-domain queries in Table 2. Table 2 shows that compared to the BERT (w/o fine-tuning) and majority baselines, the attack model built on the BERT representation of the extracted model indeed essentially enhances the attribute inference for the victim training data, i.e., more than 3.57-4.97x effective for AG news compared with the baselines, even when using cross-domain queries. The majority baseline is merely a random guess, while BERT (w/o fine-tuning) is a plain model that did not contain any information about the victim training data. However, the extracted model is trained on the queries and the returned predictions from the victim model. This implies that the target model predictions inadvertently capture sensitive information about users, such as their gender, age, and other important attributes, apart from the useful information for the main task.
Interestingly, compared with the queries from the same distribution, Table 2 also shows that queries from different distributions make AIA easier (see the best results corresponding to the lower privacy in bold in Table 2). We provide a detailed study of this phenomenon in Appendix B.1.

Defense
Although we primarily focus on the privacy vulnerability of BERT-based APIs in this work, we also test four representative defenses: i): Softening predictions: Using τ on softmax layer to scale probability vector (Xu et al., 2022). ii): Prediction perturbation: Adding Gaussian noises with a variance of σ to the probability vector (Xu et al., 2022). iii). Reverse sigmoid: Softening the posterior probabilities and injecting a random noise on the non-argmax probabilities (Lee et al., 2019). iv).
Nasty teacher: Using an adversarial loss to discourage the knowledge distillation from the victim model to the extracted model (Ma et al., 2021). We also propose a new defense called Most Least, in which we set the predicted probabilities of the most and least likely categories to 0.5+ and 0.5− , and zero out others. could be set as small as possible. For defense experiment, we set to 10 −5 .
According to Table 3, except MOSTLEAST, none of the defense avenues can well defend against MEA, unless we significantly compromise the utility (or accuracy) of the victim model. However, such degradation is more detrimental to the victim model than the extracted model. Consequently, the extracted model may surpass the victim model.
Regarding AIA, although MOSTLEAST manages to defend against MEA, it still falls short of preventing privacy leakage from the extracted model (c.f., Table 2 and 3). Among these defenses, merely the hard-labeling (τ = 0.0) can slightly mitigate the information leakage caused by AIA. In addition, some defenses, such as prediction perturbation and reverse sigmoid, can even exacerbate privacy leakage. Given that these methods have been used to defend against MEA, such a side effect requires more investigation before it causes a severe implication. We leave this for future study.

Conclusions
This work reveals that the hazards of the extracted model have been underestimated. In addition to the violation of IP, it can vastly exacerbate the privacy leakage even under challenging scenarios (e.g., limited query budget; queries from distributions that are different from that of the training data used by the victim APIs). Such a vulnerability cannot be alleviated by the strong defensive strategies against model extraction. We hope our work can raise the alarm for more investigations to the vulnerability of model extraction attack.

Limitations
Although our work has revealed the vulnerability of model extraction through a lens of privacy leakage, we have not proposed an effective enough defense approach for AIA. Thus, we encourage the community to investigate this direction to mitigate the adverse social impacts caused by this attack.

Statement of Ethics
There are two major ethical issues in this work. The first one is the violation of intellectual property, as model extraction attacks can illegally replicate commercial APIs. The second relates to privacy leakage in model extraction attacks. Both can bring severe ethical concerns to the community when deploying machine learning services on the cloud platform. Although we have shown that some defensive avenues can partly mitigate their vulnerabilities, more efforts should be dedicated to them in future work.

A Datasets
This section first details the construction of each dataset.
Trustpilot (TP) . Trustpilot sentiment dataset (Hovy et al., 2015) contains reviews associated with a sentiment score on a five point scale, and each review is associated with 3 attributes: gender, age and location, which are self-reported by users. The original dataset is comprised of reviews from different locations, however, in this paper, we only derive TP-US for study. Following (Coavoux et al., 2018), we extract examples containing information of both gender and age, and treat them as the private information.
AG news . AG news corpus (Del Corso et al., 2005) aims to predict the topic label of the document, with four different topics in total. Following (Zhang et al., 2015;Jin et al., 2019), we use both "title" and "description" fields as the input document.  We use full AG news dataset for MEA, which we call AG news (full). As AIA takes the entity as the sensitive information, we use the corpus filtered by (Coavoux et al., 2018), which we call AG news. The resultant AG news merely includes sentences with the five most frequent person entities, and each sentence contains at least one of these named entities. Thus, the attacker can identify these five entities as five independent binary classification tasks.
Blog posts (Blog) . We derive a blog posts dataset from the blog authorship corpus (Schler et al., 2006). We recycle the corpus preprocessed by (Coavoux et al., 2018), which covers 10 different topics. Similar to TP-US, the private variables are comprised of the age and gender of the author. We provide the statistics of all datasets in Table 4. Table 5 presents the uni-gram and 5-gram overlapping between test sets and different queries. According to Table 5, AG news is closer to news data, whereas Blog and TP-US are more similar to reviews data, which validates our claim in Section 4.2.

B.1 Impact of Prediction Sharpness
Interestingly, compared with the queries from the same distribution, Table 2 also shows that queries from different distributions make AIA easier (see the best results corresponding to the lower privacy in bold in Table 2). We hypothesize this counterintuitive phenomenon is due to that the posterior probability of the same distribution is sharper than that of the different distribution. This argument can be further strengthened in Section 5, in which we use a temperature coefficient τ at the softmax layer to control the sharpness of the posterior probability. We conjecture that if the model is less confident on its most likely prediction, then AIA is more likely to be successful. This speculation is confirmed by Figure 2, where the higher posterior probability leads to a higher empirical privacy.

B.2 Impact of Attribute Distribution
We further investigate which attribute is more vulnerable, i.e., the relationship between attribute distribution (histogram variance) and privacy leakage. Table 6 empirically indicates that attributes with higher variances cause more information leakage or a lower empirical privacy. For example, for AG-news, entity 2-4 with higher variances result in lower empirical privacy, while entity 0-1 are more resistant to AIA. For TP-US and Blog, as age and gender exhibit similar distribution, AIA performance gap across these two attributes is less obvious.

B.3 Architecture Mismatch
In practice, it is more likely that the adversary does not know the victim's model architecture. A natural question is whether our attacks are still possible when the extracted models and the victim models have different architectures, i.e., architectural mismatch. To study the influence of the architectural mismatch, we fix the architecture of the extracted model, while varying the victim model from BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019) to XLNET (Yang et al., 2019). According to Table 7, when there is an architecture mismatch between the victim model and the extracted model, the efficacy of AIA is alleviated as expected. However, the leakage of the private information is still severe (compare to the majority class in Table 2). Surprisingly, we observe that MEA cannot benefit from a more accurate victim, which is different from the findings in (Hinton et al., 2015). For example, the victim model performs best using XLNETlarge, while MEA shows best performance when the victim model uses XLNET-base. We conjecture such difference is ascribed to the distribution mismatch between the training data of the victim model and the queries. We will conduct an in-depth study on this in the future.