Model Extraction and Adversarial Transferability, Your BERT is Vulnerable!

Natural language processing (NLP) tasks, ranging from text classification to text generation, have been revolutionised by the pre-trained language models, such as BERT. This allows corporations to easily build powerful APIs by encapsulating fine-tuned BERT models for downstream tasks. However, when a fine-tuned BERT model is deployed as a service, it may suffer from different attacks launched by malicious users. In this work, we first present how an adversary can steal a BERT-based API service (the victim/target model) on multiple benchmark datasets with limited prior knowledge and queries. We further show that the extracted model can lead to highly transferable adversarial attacks against the victim model. Our studies indicate that the potential vulnerabilities of BERT-based API services still hold, even when there is an architectural mismatch between the victim model and the attack model. Finally, we investigate two defence strategies to protect the victim model and find that unless the performance of the victim model is sacrificed, both model ex-traction and adversarial transferability can effectively compromise the target models


Introduction
Recently, owing to the success of pretrained BERTbased models (Devlin et al., 2018;Liu et al., 2019;Sun et al., 2020b), the downstream NLP tasks have been revolutionised in the form of the limited taskspecific supervision via fine-tuning on BERT models. Meanwhile, commercial task-oriented NLP models, built on top of BERT models, are often deployed as pay-per-query prediction APIs for the sake of the protection of data privacy, system integrity and intellectual property.
As publicly accessible services, commercial APIs have become victims of different explicit attacks, such as privacy attack (Lyu et al., 2020a,b;Shokri et al., 2017), adversarial attack (Shi et al., 2018), etc. Recently, prior works have also found that with the aid of carefully-designed queries and outputs of the NLP APIs, many existing APIs can be locally imitated via model extraction (Krishna et al., 2019;Wallace et al., 2020), which raises concerns of the vulnerability of NLP APIs. For instance, competing companies can imitate the victim model with a negligible cost. Since the considerable investment of data annotation and algorithm design are sidestepped, the competing companies would be able to launch an identical service with a more competitive price than the victim companies. Such security issue can be exacerbated, when the back-end pertained models, such as BERT, are publicly available (Krishna et al., 2019).
Beyond model extraction, we further demonstrate the adversarial examples crafted by the extracted model could be transferred to the black-box victim model. From the perspective of commercial competition, if the competitors manage to predicate incorrect predictions of the victim services, they can launch an advertising campaign against the victim model with these adversarial examples.
In summary, we investigate the vulnerabilities of publicly available NLP classification APIs through a two-stage attack. First, a model extraction attack is issued to obtain a local copy of the target model. Then, we conduct adversarial attacks against the extracted model, which is empirically transferable to the target model. To patch these vulnerabilities, we mount two basic defence strategies on the victim models. The empirical results show that without corrupted predictions from the victims, model extraction and adversarial example transferability are resilient to the defence. Our results spotlight the risks of using pretrained BERT to deploy the APIs through the lens of model extraction attack and adversarial example transfer attack. Such attacks can be conducted at a cost of as little as $7.1. 1 2 Related Work

Model Extraction Attack (MEA)
Model extraction attacks (also referred to as "model stealing") have been effectively applied to different tasks, ranging from computer vision tasks (Orekondy et al., 2019) to NLP tasks (Chandrasekaran et al., 2020).
In a nutshell, model extraction enables malicious users to forge the functionality of a black-box victim model as closely as possible. The activity seriously causes the intellectual property infringement. Additionally, the follow-up attacks can be facilitated as the aftermath of the model extraction. Particularly, an adversarial attack can be built upon the extracted model, which is able to enhance the successful rate of fooling the victim model.

Adversarial Transferability in NLP
As a byproduct of the adversarial attack, it has been shown that adversarial transferability encourages a transition of the adversarial examples from one model to other models (Liu et al., 2016;Papernot et al., 2017), especially in computer vision research. Although such property has been explored by a few recent works in NLP systems (Sun et al., 2020a;Wallace et al., 2020), it remains largely unexplored for the BERT-based APIs, and whether the transferability could succeed when the substitute (extracted) model and the victim model have different architectures.

Attack on BERT-based API
Our attacks against BERT-based APIs consist of two phases, Model Extraction Attack (MEA) and Adversarial Example Transfer (AET), as depicted in Figure 1.

Model Extraction Attack (MEA)
In the first phase, we assume that a "victim model" M v is commercially available as a prediction API for target task T . An adversary attempts to reconstruct a local copy M e ("extracted model") of M v via querying M v . Our goal is to extract a model with comparable accuracy to the victim model. Generally, MEA can be formulated as a two-step approach, as illustrated by the left figure in Fig  2. Attackers reconstruct a local copy of the victim model as an "extracted model" using the retrieved query-prediction pairs.
For each query by m queries is used to train M e . We assume that the attacker finetunes the public release of f bert,θ * on this dataset, with the objective of imitating the behaviour of M v . Once the local copy of M e is obtained, the attacker no longer needs to pay the original service provider.

Adversarial Example Transfer (AET)
In the second phase, we leverage the transferabil-

NLP Tasks and Datasets
To evaluate the efficacy of the proposed attacks, we select four NLP datasets covering two main tasks, i) sentiment analysis and ii) topic classification (Li et al., 2020). We use TP-US from Trustpilot Sentiment dataset (Hovy et al., 2015) and YELP dataset (Zhang et al., 2015) for sentiment analysis. We use AG news corpus (Del Corso et al., 2005) and Blog posts dataset from the blog authorship corpus (Schler et al., 2006) for topic classification. We refer readers to Appendix A for more details about the pre-processing of these datasets.

MEA Setup and Results
Attack Strategies: We assume that both victim and extracted models are initialised from a freely available pretrained BERT. Once the victim model is task-specifically fine-tuned by following Section 3.1, it can be queried as a black-box API. Afterwards, the extracted model can be obtained through imitating the victim model. Following Krishna et al. (2019), the queries start from the size of 1x to that of victim's training set, then scale up to 5x. We test the accuracy of the victim model and the extracted model on the same held-out set for a fair comparison.
Query Distribution: To examine the correlation between the query distribution (D A ) and the effectiveness of our attacks on the victim model trained on data from D V (c.f., Table 1), we explore the following two different scenarios: (1) we use the same data as the original data of the victim model (D A = D V ). Note that attackers have no true labels of the original data; (2) we sample queries from different distribution but same domain as the original data (D A = D V ).
Since the owners of APIs tend to use the inhouse datasets, it is difficult for the attacker to know the target data distribution as a prior knowledge. Therefore, our second assumption is closer to the practical scenario. As the training datasets of the victims are sourced from either review domain or news domain, we consider datasets from these two domains as our queries. Specifically, we leverage Amazon review dataset (Zhang et al., 2015) or CNN/DailyMail dataset (Hermann et al., 2015) to query the victim models.
According to  data and the attacker's queries; 2) using same data even outperforms the victim models, which is also known as self-distillation (Furlanello et al., 2018); 3) albeit the different distributions brought by review and news corpora, our MEA can still achieve 0.85-0.99× victim models' accuracies when the number of queries varies in {1x,5x}. Although more queries suggest a better extraction performance, small query budgets (0.1x and 0.5x) are often sufficiently successful. More results are available in Appendix C. From now on, unless otherwise mentioned, we will use news data for AG news, and review data for TP-US, Blog and Yelp. 2 Costs Estimation: We analyse the efficiency of MEA on various classification datasets. Each query is charged due to a pay-as-you-use policy adopted by service providers. We estimate costs for each task in Table 3   tion, the cost is highly economical and worthwhile.

AET Setup and Results
After  (Sun et al., 2020a) that leverages the gradients of the gold labels w.r.t the embeddings of the input tokens to find the most informative tokens, which have the largest gradients among all positions within a sentence. Then we corrupt the selected tokens with one of the following typos: 1) Insertion; 2) Deletion; 3) Swap; 4) Mistype: Mistyping a word though keyboard, such as "oh" → "0h"; 5) Pronounce: Wrongly typing due to the close pronounce of the word, such as "egg" → "agg"; 6) Replace-W: Replace the word by the frequent human behavioural keyboard typo based on the Wikipedia statistics (Sun, 2020).
In order to understand whether our extracted model manages to improve the transferability, we also launch a list of black-box adversarial attacks in the same manner.   (Gao et al., 2018); textbugger ; textfooler (Jin et al., 2019); advbert (Sun et al., 2020a). w-box: white-box.  (5x v.s.1x) lead to better attack performances. We believe this conspicuous gain attributes to the higher fidelity to the victim model, obtained by a better extraction (c.f., Table 2).

Architecture Mismatch
In practice, the adversary may not know the victim's model architecture. Hence we also study the attacking behaviours under the different architectural settings. According to Table 5, when both the victim and the extracted models adopt BERT-large, the vulnerability of the victim is magnified in all attacks, which implies that the model with higher capability is more vulnerable to our attacks. As expected, the efficacy of AET can be alleviated when an architectural mismatch exists. 5

Defence
We next briefly discuss two defence strategies the victim model can adopt to counter these attacks.
• Softening predictions (SOFT). A temperature coefficient τ on softmax layer manipulates the posterior probability distribution. 5 More experiments can be found in Appendix D  A higher τ leads to smoother probability, whereas a lower one produces a sharper distribution. When τ =0, the posterior probability becomes a hard label.
• Prediction perturbation (PERT). Another defence method is adding normal noise with variance σ to the predicted probability distribution. The larger the variance of the noise distribution, the stronger the defence. Table 6 indicates that varying temperature on softmax cannot defend the victim model against MEA, except for τ =0 (hard label), which can degrade all attacks to some extent.
Regarding perturbation, it can achieve a significant defence at the cost of the accuracy of the victim models. Surprisingly, when σ=0.50, MEA surpasses the victim model. We conjecture that albeit the perturbed post-softmax probability, the extracted model can still acquire certain informative knowledge via model extraction. We will conduct an in-depth study on this in the future.
To sum up, both MEA and AET pose severe threats to the BERT-based APIs, even when the adversary merely has access to limited or erroneous predictions.

Conclusions
This work goes beyond model extraction from BERT-based APIs, and we also identify the extracted model can largely enhance adversarial example transferability even in difficult scenarios, i.e., limited query budget, queries from different distributions, or architectural mismatch. Extensive experiments based on representative NLP datasets and tasks under various settings demonstrate the effectiveness of our attacks against BERT-based APIs. In the future, we plan to extend our work to more complex NLP tasks, and develop more effective defences.

A Dataset Description
Trustpilot (TP). Trustpilot Sentiment dataset (Hovy et al., 2015) contains reviews associated with a sentiment score on a five point scale. The original dataset is comprised of reviews from different locations, however in this paper, we only derive TP-US for study.
AG news. We use AG news corpus (Del Corso et al., 2005). This task is to predict the topic label of the document, with four different topics in total. Following (Zhang et al., 2015;Jin et al., 2019), we use both "title" and "description" fields as the input document.
Blog posts (Blog). We derive a blog posts dataset (Blog) from the blog authorship corpus presented (Schler et al., 2006). We recycle the corpus preprocessed by Coavoux et al. (2018), which covers 10 different topics.
Yelp Polarity (Yelp). Yelp dataset is a documentlevel sentiment classification (Zhang et al., 2015). The original dataset is in a five point scale (1-5), while the polarised version assigns negative labels to the rating of 1 and 2 and assigns positive labels to 4 and 5.

B Training Details
We use Huggingface (Wolf et al., 2020) as the codebase. Each model is trained for 4 epochs on a NVIDIA V100 GPU, with a batch size of 64. We use AdamW (Loshchilov and Hutter, 2019) with a learning rate of 5e-5.

C Performance of Different Query Size
Due to the budget limit, the attacker cannot issue massive requests. To investigate the attack performance of model extraction under the low-resource setting, we conduct two additional experiments, which only utilise 0.1x and 0.5x of the training data of the victim models respectively. According to Table 7, the overall performance of extracted models is comparable to the victim models. Only Blog with 0.1x training suffers from a drastic drop, as Blog uses the least number of training samples in all four datasets. In addition, distant domains exhibit significant degradation, when compared to the close ones. For example, sampling 0.1x-5x queries from news data present a more stable attack performance against the victim model trained with AG news than Blog.

D Architectural Mismatch
In Table 8, we experiment with different models, including BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019) and XLNET (Yang et al., 2019). Although the architectural difference can cause some drops in MEA and AET, overall the proposed attacks are still effective.