MeaeQ: Mount Model Extraction Attacks with Efficient Queries

We study model extraction attacks in natural language processing (NLP) where attackers aim to steal victim models by repeatedly querying the open Application Programming Interfaces (APIs). Recent works focus on limited-query budget settings and adopt random sampling or active learning-based sampling strategies on publicly available, unannotated data sources. However, these methods often result in selected queries that lack task relevance and data diversity, leading to limited success in achieving satisfactory results with low query costs. In this paper, we propose MeaeQ (Model extraction attack with efficient Queries), a straightforward yet effective method to address these issues. Specifically, we initially utilize a zero-shot sequence inference classifier, combined with API service information, to filter task-relevant data from a public text corpus instead of a problem domain-specific dataset. Furthermore, we employ a clustering-based data reduction technique to obtain representative data as queries for the attack. Extensive experiments conducted on four benchmark datasets demonstrate that MeaeQ achieves higher functional similarity to the victim model than baselines while requiring fewer queries. Our code is available at https://github.com/C-W-D/MeaeQ.


Introduction
The adoption of Machine Learning as a Service (MLaaS) via APIs has introduced a new security challenge known as model extraction attacks (Tramèr et al., 2016).Attackers repeatedly access the API to acquire outputs by utilizing meticulously crafted inputs (queries) and subsequently train a local model based on the gathered input-output pairs.This attack aims to obtain an extracted model that closely approximates the performance of the victim model, thereby posing a substantial threat to the intellectual property of the model owner.* Kun Li is the corresponding author.
In black-box scenarios, wherein the architecture and training data of the victim model remain unknown, the primary concern for attackers is to consider how to design high-quality queries with limited query budgets.This is crucial since frequent API calls not only result in considerable costs (Correia-Silva et al., 2018) but also carries the potential of triggering the victim model's defense mechanisms (Papernot et al., 2017;Juuti et al., 2019;Zhang et al., 2021), thereby diminishing the effectiveness of the attack.Some studies (Orekondy et al., 2019;Xu et al., 2022;Karmakar and Basu, 2023) sample queries from annotated data associated with the victim model training data, e.g., using BBC News data to extract models trained on AGNews (Zhang et al., 2015), which deviates from the assumption of black-box model extraction, where the training data distribution of the victim model is unknown.Therefore, recent research has focused on leveraging publicly available unannotated data sources for model extraction attacks.The pioneering work by Pal et al. (2020) explores the utilization of these data sources and proposes an active learning-based sampling strategy that dynamically adjusts query selection based on self-feedback.Likewise, Krishna et al. (2020) presents two query construction methods, one involving sentences combined with random words and the other involving actual sentences randomly sampled from Wikipedia.Despite the demonstrated efficacy of these methods in their individual studies, we experimentally find that the queries sampled by these approaches often suffer from category imbalance.For instance, when applying the random strategy to an online hate speech detection API, the ratio of positive to negative samples becomes skewed as high as 30 : 1.Such a notable imbalance in sample distribution presents challenges for model training, especially when considering low query costs.We attribute this issue to task-irrelevant text content and insufficient data diversity within the queries.
In this paper, we propose MeaeQ (Model extraction attack with efficient Queries), a straightforward yet effective method to address these issues.MeaeQ comprises two modules: Task Relevance Filter (TRF) and Data Reduction based on Clustering (DRC).The TRF module aims to select data that is highly relevant to the task.To achieve this, we utilize a pre-trained zero-shot sequence inference classifier, in combination with API service information, to infer the entailment relations between the pre-designed prompt and all actual sentences extracted from a publicly available text corpus.The second module, DRC, is designed to mitigate information redundancy within the query pool filtered by the TRF module.To accomplish this, DRC initially extracts embeddings from all texts in the query pool and then employs a clustering method to create multiple clusters.It subsequently selects the data nearest to the centroid of each cluster as the ultimate query.Finally, we send these queries to the victim model and then use the outputs as labels to fine-tune our local model.
Extensive experiments on simulated victim models demonstrate that the model extracted by MeaeQ exhibits higher functional similarity to the victim model than the baseline methods on four benchmark datasets.Furthermore, we validate the generalizability of MeaeQ across diverse model architectures.In-depth analyses reveal the significant contributions of both the TRF and DRC in the model extraction attack.The primary contributions of this paper can be summarized as follows: • We employ a zero-shot sequence inference classifier, combined with API service information to filter data with high task relevance.
• We design a data reduction technique based on the clustering method to alleviate the information redundancy problem in text sampling.
• Extensive experiments confirm the effectiveness of our method in model extraction attacks at a low query cost.Additionally, the queries sampled by our approach enhance the stability of the extracted model during training.

Related Work
We introduce related work from four perspectives of model extraction attacks, including the type of victim model, the type of API feedback, the query source, and the strategy of query sampling.
Type of API Feedback.Several studies consider using the complete probability vectors of the victim model on all or top-k classes returned by the API as feedback for the query, which is less practical in a public API (Tramèr et al., 2016;Orekondy et al., 2019).Instead, following (Pal et al., 2020;Wang et al., 2022), we focus on the most challenging scenario where the API only provides the predicted hard label.Query Source.
Most studies utilize query sources derived from public problem domainrelated datasets (Papernot et al., 2017;Xu et al., 2022;Karmakar and Basu, 2023).In contrast, following Pal et al. (2020), we use a large public text corpus as the query source, ensuring no overlap with the private dataset of the victim model.Query Sampling Strategy.In CV, query sampling strategies can be broadly categorized into three groups.Some studies employ reinforcement learning techniques (Orekondy et al., 2019), some utilize adversarial example generation methods (Papernot et al., 2017;Juuti et al., 2019;Yu et al., 2020), and some employ model inversion techniques (Gong et al., 2021).However, applying these methods to NLP is challenging due to the discrete nature of text data, in contrast to the continuous nature of image data.In NLP, the earliest work comes from Pal et al. (2020), who adopt the active learning method to iteratively sample queries.Krishna et al. (2020) construct a nonsensical sequence of words or randomly sample actual sentences as queries.Different from them, we improve the sampling strategy by task driving and information redundancy minimization, thus better playing the value of the limited query.

Methodology
In this section, we present our method for NLP model extraction attacks.We first formalize the problem and then elucidate how to sample queries with high task relevance and low information redundancy.Then the attackers apply the Task Relevance Filter on Q o to get a task-related query set Q g .Subsequently, the attacker exploits the Data Reduction based on Clustering to reduce Q g to Q r .Finally, the attacker samples the queries from Q r , sends them to the API, and then uses the outputs as labels to fine-tune their own model such as BERT (Devlin et al., 2019).

Problem Formulation
Let f v denote the victim model (θ v are all parameters included), which represents a black-box API providing services for the task T .We assume the API only returns the predicted hard label rather than probability scores.f v is trained on a private dataset that is inaccessible to the public.The attackers construct a query set and utilize the API to obtain outputs by sending sampled queries.This process generates the attacker's dataset {x i , f v (x i )} k i=1 , where k denotes the numbers of query.Then attackers use the dataset to train their own model f a (θ a represents all parameters included).
For evaluation, we adopt the same metrics as Krishna et al. (2020), namely Accuracy and Agreement.Accuracy measures the prediction accuracy of f a , while Agreement assesses the functional similarity between f a and f v .Both metrics are calculated on the private test dataset D test = (x, y) | x ∈ X test , y ∈ Y test of the victim model.Agreement is defined as: (1) where I (•) is the indicator function.The calculation of Accuracy is equivalent to that of Agreement, with the only difference being the substitution of f v with the ground truth label.
Our goal is to construct a high-quality query set with a given budget to train a thief model with performance close to the victim model.The selection of queries directly impacts the parameters of f a , so we can formally express the objective as: where θ v is fixed and hidden.

Overview of MeaeQ
In this subsection, we provide an overview of MeaeQ.The attackers start by sampling all actual sentences from a text corpus to initialize the original query set , where p is the number of actual sentences in the corpus.Then they utilize a sequence inference classifier to filter the input pairs that contain a hypothesis (manually designed prompt) and a premise selected from Q o , resulting in a task-related query set Q g .Next, the attackers employ a clustering-based data reduction technique on Q g to obtain a query set with low information redundancy, denoted as Q r .Finally, the attackers perform a model extraction attack using Q r .The overview of MeaeQ is illustrated in Figure 1.

Task Relevance Filter
Inspired by the study of Yin et al. (2019) who propose using a PLM trained on natural language inference task as a zero-shot sequence classifier for text classification, we introduce the Task Relevance Filter (TRF) combined with API service information to filter queries related to the target task.The TRF consists of a sequence inference classifier, which is a pre-trained language model trained on the MNLI dataset (Williams et al., 2018).We denote this classifier as g whose input pairs are a premise and a hypothesis.We design prompt h as one of the inputs (hypothesis) based on the service information of the task T1 and take the actual sentence x i from Q o as another input (premise) to the model g for reasoning about the relationship between them.For example, if the task is to detect if a text contains hate speech, we design the prompt as "This is a hate speech" to filter queries that resemble hate speech.Finally, we obtain the probability vector of a logical relation classification result of the actual sentence x i with prompt h by the output of the model: where p y i is the probability at relationship label y ∈ {neutral, entailment, contradiction}.
To ensure the selection of high-quality taskrelated queries, the Task Relevance Filter incorporates a filtering mechanism.We simply design this mechanism as follows: if p entailment i is greater than or equal to a threshold ϵ, the sample x i is retained, otherwise discarded.This operation is repeated until all samples in Q o are classified, resulting in a task-related query set

Data Reduction based on Clustering
To reduce the information redundancy in the query set Q g , we propose a Data Reduction technique based on Clustering (DRC).We transform the information redundancy problem into a graph-theoretic problem.Given a weighted undirected graph, G = (V, E) where V is the set of vertices and E is the set of edges, we extract the sentence embedding of the samples x i ∈ Q g as the vertex representation by the model g: Here, v i is a vector of dimension d.We define the edge weights e i,j as the cosine similarity distance between the vertices v i and v j : .
The objective of this module is to select the most representative samples from Q g to form a reduced query set Q r .The selection problem can be viewed as finding a subgraph Ĝr that maximizes the sum of edge weights: However, this problem is an NP-hard problem (already proved by He et al. (2021)).Therefore, we design an approximate solution that achieves the sample selection within a reasonable time.The steps are as follows: Step 1 First, we apply a clustering algorithm on the query set Q g for t iterations.The number of clusters k is set as |Q r |, which represents the number of queries used to access the victim model's API.After clustering, we obtain a set of k clusters CLSTR = {clstr i } k i=1 and a set of k clusters centroids CTRID = {ctrid i } k i=1 , where clstr i represents the set of samples in the i-th cluster, and ctrid i ∈ R d is the centroid of the i-th cluster.
Step 2 For the i-th cluster clstr i , add a sample point x r i to the reduced query set Q r , satisfying the condition that x r i is the closest to the centroid ctrid i within the cluster, i.e., the cosine similarity distance between them is the smallest: Step 3 Repeat step 2 k times until all clusters are traversed, completing the construction of Q r .
The idea behind this approach is that the clustering algorithm generates k clusters with large inter-cluster distances and small intra-cluster distances.We then select the sample point closest to its centroid within each cluster as a candidate sample.This strategy aims to achieve an approximate maximum distance between all candidate samples.Time complexity analysis: Step 1 requires O |Q g |kt to perform clustering with t iterations in total.Each iteration involves searching the smallest distance within the cluster, resulting in a total time complexity equivalent to traversing the entire set Q g t times, with a time complexity of O |Q g |t .Therefore, the overall time complexity of DRC remains O |Q g |kt .

Experiments
In this section, we conduct extensive experiments to evaluate the effectiveness of our method.

Experiment Settings
Datasets of Victim Model We train simulated victim models respectively for four tasks including a hate speech detection dataset: Hate Speech (de Gibert et al., 2018), a topic classification dataset: AG News (Zhang et al., 2015) and two sentiment classification datasets SST-2 (Socher et al., 2013) and IMDB (Maas et al., 2011).Details about the data division and statistical information of these datasets can be found in Appendix B. Corpus and Prompts For query source, we use WikiText-103 corpus (Merity et al., 2017), ensuring that there is no overlap with the private datasets.We design prompts in TRF for each dataset.Since SST-2 and IMDB are both sentiment classification tasks for movie reviews, we use the same prompt "This is a movie review."for both.For AG News, which focuses on news topics, we use the prompt "This is a news.".In the case of Hate Speech, we use the prompt "This is a hate speech".Implementation Details In our experiment, we use BERT Base as the architecture for both the victim model and the extracted model.Additionally, we explore the effectiveness of MeaeQ on different model architectures in section 4.4.The victim model is trained for 3 epochs and the extracted model is trained for 10 epochs.We select the best checkpoints based on the validation set.We utilize BART Large (Lewis et al., 2020) as the sequence inference classifier2 .We use two evaluation metrics Agreement and Accuracy, as described in subsection 3.1.For each dataset, we set up several groups with different query budgets expressed as query rates, which represent the proportion of the original dataset size.The threshold ϵ in TRF and the number of iterations t in DRC are set as 0.95 and 300 respectively.We employ the faiss library3 (Johnson et al., 2019) to accelerate the vector retrieval and use the k-means algorithm (MacQueen, 1967) for clustering.All experiments are repeated 10 times with different random seeds on a single 32GB NVIDIA V100 GPU.More details about the hyperparameters can be found in Appendix C.

Baselines
We compare MeaeQ with the following baselines: RS (Random Sampling) The RS is proposed by Krishna et al. (2020) 4 , which randomly samples real sentences from the WikiText-103 corpus.AL-RS The AL-RS is a simple variant of active learning, which is introduced by Pal et al. (2020).In each iteration, AL-RS utilizes a random strategy.
The key difference between AL-RS and RS lies in the multiple rounds of sampling versus a single round of sampling.AL-US The AL-US is proposed by Pal et al. (2020), which is also an active learning-based approach.In each iteration, the AL-US uses an uncertainty strategy that selects the top-k samples with the highest entropy value computed from the predicted probability vector of the attackers' model.

Main Results
In this subsection, we compare MeaeQ with the baselines on the simulated victim models trained on four datasets.The results are presented in Table 1 and Table 6 (due to space limitations, all accuracy results are listed in Appendix A).Table 1, 6 show that MeaeQ consistently outperforms the baselines for all datasets with low query budgets, demonstrating its effectiveness.Especially in Group A, we find that MeaeQ significantly outperforms the top-performing baseline on three datasets, but with slightly smaller gains on AG News.We attribute this to the similarity in data distribution between Wikipedia and AG News, as many Wikipedia articles can be considered as a form of news.We also notice that the baselines exhibit a high standard deviation in performance, implying some noise in the sampled queries.In contrast, MeaeQ can sample task-related and informative queries, leading to Query Budget
higher functional similarity and more stable performance, even with extremely low query budgets.
To further validate the effectiveness of MeaeQ, we conduct experiments on more query budgets and present the results in Figure 3.We find that MeaeQ outperforms baselines in nearly all settings and exhibits lower standard deviations.In particular, for SST-2 and IMDB, MeaeQ's performance at query budget × 0.008 is comparable to or significantly surpasses the highest performance achieved by the baseline method at query budget × 0.02.When achieving the similar performance, MeaeQ reduces the query cost by more than half.

Cross Model Extraction
We explore the applicability of MeaeQ to different extracted / victim model architectures, including BERTBase, RoBERTaBase (Liu et al., 2019), and XLNet Base (Yang et al., 2019), as depicted in Figure 2 and Figure 7.It is evident that MeaeQ surpasses the baselines across different model architectures, demonstrating its effectiveness and robustness.Besides, we notice that the baselines exhibit good performance solely on the matching model architecture, as indicated by the darker color on the diagonal of the heatmap.In contrast, MeaeQ consistently exhibits superior performance across different architectures, indicating its model-agnostic nature.In summary, MeaeQ can be successfully applied to various model architectures for extraction, while maintaining exceptional performance.

Ablation Study
To better understand the two key modules in MeaeQ, we compare MeaeQ with two variants, w/o TRF and w/o DRC, respectively where the corresponding modules are removed from MeaeQ.The results are shown in Figure 4 and Figure 8.
From the figures, MeaeQ apparently outperforms  the two variants, particularly in terms of higher agreement and shorter error bars at low query budgets (e.g., × 0.003 / 0.005).These results highlight the importance of TRF and DRC, as they enhance the efficacy of extraction and its stability.Moreover, MeaeQ consistently outperforms w/o TRF across all query budgets, emphasizing the critical role of task-related queries screened by TRF which aligns the attackers' data distribution with the victim model training data distribution.However, we also observe that as the query budget increases, the performance gap between w/o DRC and MeaeQ narrows, indicating a decreasing effect of w/o DRC.We know that the query budget determines the number of clusters in DRC.As the query budget increases, the number of clusters gradually rises when the size of the query set screened by TRF remains constant.In the extreme case where the query budget equals the size of the candidate query set, the clustering algorithm becomes completely ineffective.Therefore, our method is particularly suitable for scenarios with a low query budget.However, for higher query budgets, we recommend evaluating the specific circumstances before deciding whether to utilize DRC, as DRC may degrade into random sampling under the worst-case scenario.

Attack on ChatGPT-based Victim Model
In this subsection, we conduct model extraction attacks on the ChatGPT-based 5 (OpenAI, 2023) 5 Specifically, we call the gpt-3.5-turboChat completions API for responses.The document page is https://platform.victim model simulated on the hate speech detection task using manual instructions to verify the effectiveness of our method in the autoregressive language model extraction.The instruction template is available in Table 2.We set up several types of extracted models with different parameters and architectures, including GPT-2 Radford et al. (2019), RoBERTa Liu et al. (2019).Comprehensive results can be found in Table 3.An analysis of the table reveals that our approach performs admirably, even when the victim model is ChatGPT-based and the extracted model relies on an autoregressive framework (e.g., GPT-2).For instance, GPT-2 Small achieves 79.5% functional similarity (Agreement) and 75.3% test accuracy openai.com/docs/models/gpt-3-5(90.4% of the victim model).We also observe that the RoBERTa series of models outperforms the GPT-2 series in this detection task, which is clear, as masked language models are better suited for extracting models in classification tasks, while autoregressive models are typically employed for extracting generative models.

Analysis
In this section, we provide further analysis of the TRF module and DRC module in our proposed method for model extraction attacks.

Impact of Task-Relevant Corpora
We investigate the impact of corpora with varying degrees of relevance to the target task on the performance of model extraction attacks.The experiments are conducted on a simulated victim model trained on IMDB with different query budgets.To assess the influence of the corpus, we replace the default WikiText-103 corpus with other datasets that are more closely related to the victim model's dataset, such as the SST-2 and IMDB training sets.
For query sampling, we adopt a uniform random strategy.The results, shown in Table 4 and Table  7, indicate that using the IMDB training data as the corpus yields the best performance, followed by using the SST-2 training data, both of which outperform the use of WikiText-103.This discrepancy can be attributed to the similarity between the attacker's corpus and the training data distribution of the victim model, i.e., SST-2 and IMDB are both about movie reviews, resulting in a higher data correlation.This observation motivates the Task Relevance Filter in our method, as it aims to filter texts from the public unannotated corpus that closely resemble the data distribution of the task.

Role of Data Reduction based on Clustering
To further explore the role of the clustering-based data reduction technique, we qualitatively visualize the query sampled by DRC and RS on a subset of WikiText-103 with t-SNE (Van der Maaten and Hinton, 2008).Figure 5 depicts the distribution of the sampled data.We can see that the data sampled by DRC displays a broader distribution with larger inter-point distances, whereas the data sampled by RS show some overlapping.This observation demonstrates the capability of DRC to sample data based on their information and effectively reduce redundancy.72.3 ± 7.0 (81.0) 76.5 ± 5.5 (81.7) 78.9 ± 5.6 (84.2) 77.4 ± 5.3 (85.3)IMDB training data 75.5 ± 5.9 (84.2) 80.0 ± 5.0 (84.8) 84.7 ± 2.0 (86.6) 84.8 ± 1.3 (85.6) 85.6 ± 1.3 (87.2)

Analysis of High-Frequency Words
We conduct an analysis of the text content in the query set generated by MeaeQ.Specifically, we segment the text in the query set by word and calculate their frequency of occurrence.Figure 6 presents the top 20 most frequent words in the query sets constructed by RS and MeaeQ for the Hate Speech dataset at query budget × 0.5.We observe that the text filtered by MeaeQ contains a higher frequency of words associated with hate speech, such as "hate", "antisemitic", "nazi", "racist", etc.In con-  trast, these words are rarely seen in RS, confirming that MeaeQ effectively filters task-related data and enhances data diversity.
In Table 5, we also provide query examples sampled by RS and MeaeQ, highlighting task-specific words in colors.It is apparent that the content style of the query generated by MeaeQ closely aligns with the target task.For hate speech, they feature negative and hateful sentiment words; for SST-2/IMDB, they contain positive or negative movie reviews; and for AG News, they include key news elements like 'when', 'who', and 'what'.On the contrary, the contents of RS examples are often not relevant to the given task.

Conclusion
In this paper, we propose a straightforward yet effective model extraction attack method MeaeQ.In particular, we initially utilize a zero-shot sequence inference classifier in combination with the API service information to filter data with high task relevance from a public unannotated text corpus.Subsequently, we employ a clustering-based data reduction technique to obtain representative data as queries for the attack.Extensive experiments demonstrate that our method achieves superior performance than baselines with limited query budgets on four benchmark datasets.

Limitations
Our study focuses on model extraction attacks in text classification tasks, in line with existing baselines (Pal et al., 2020;Krishna et al., 2020).The applicability of MeaeQ to other natural language generation tasks, such as machine translation or text summarization, remains unexplored.We are currently investigating model extraction attacks in these more complex tasks.

D Query Examples
More query examples are presented in

Figure 1 :
Figure 1: Overview of MeaeQ for NLP model extraction attacks.The attackers first build an original query set Q o from a large text corpus.Then the attackers apply the Task Relevance Filter on Q o to get a task-related query set Q g .Subsequently, the attacker exploits the Data Reduction based on Clustering to reduce Q g to Q r .Finally, the attacker samples the queries from Q r , sends them to the API, and then uses the outputs as labels to fine-tune their own model such as BERT(Devlin et al., 2019).
Cross model extraction results (Agreement, %) on Hate Speech at query budget × 0.5.The horizontal / vertical axes represent victim / extracted model architecture respectively.'B', 'R', and 'X' represent BERT Base , RoBERTa Base , and XLNet Base .Darker colors represent larger values of Agreement.

Figure 3 :
Figure 3: Comparison between MeaeQ and the baselines at more query budgets.The first row represents the results of the Accuracy (%) on the four datasets while the second row represents the results of the Agreement (%).

Figure 4 :
Figure 4: Ablation study results (Agreement, %) on SST-2 dataset.The horizontal axes represent different query budgets.Error bars represent standard deviation.

Figure 6 :
Figure 6: Top-20 frequent words in the query sets constructed by RS and MeaeQ respectively at query budget × 0.5 on Hate Speech.

Table 2 :
Task Instruction Hate Speech You are now working as a hate speech detector.Next, I will give you [batch_size] sentences.You need to indicate whether it contains hate speech.If it contains hate speech, output "Hate", otherwise output "Nohate".The required format is one output per line and the beginning of each output is numbered sequentially.Instruction Template.The [batch_size] in the instruction represents the number of queries, and [sentence_1], [sentence_2], etc. represent the specific query content.

Table 3 :
Model extraction results (Accuracy / Agreement, %) on ChatGPT simulated on the Hate Speech dataset at query budget × 0.8 (about 1531 queries).The accuracy of ChatGPT on this task under zero-shot setting is 83.26%.The data recorded in the table are the best results of ten retests.

Table 4 :
Model extraction results (Agreement, %) under different query budgets with a different corpus on IMDB dataset.The results shown in the table are the mean and standard deviation.The max value is shown in green.

Table 5 :
Query examples sampled by RS and MeaeQ for four tasks, demonstrating that MeaeQ can select samples that are semantically more relevant to the target task.More query examples can be found in Appendix D.

Table 7 :
Model extraction results (Accuracy, %) under different query budgets with a different corpus on IMDB dataset.The results shown in the table are the mean and standard deviation.The max value is shown in green.

Table 8 :
The statistics of the datasets.

Table 9 :
Details of Hyperparameters in the main experiment only using BERT Base as the model architecture.

Table 10 :
Details of Hyperparameters in the experiments using different model architectures on Hate Speech.