Canary Extraction in Natural Language Understanding Models

Natural Language Understanding (NLU) models can be trained on sensitive information such as phone numbers, zip-codes etc. Recent literature has focused on Model Inversion Attacks (ModIvA) that can extract training data from model parameters. In this work, we present a version of such an attack by extracting canaries inserted in NLU training data. In the attack, an adversary with open-box access to the model reconstructs the canaries contained in the model’s training set. We evaluate our approach by performing text completion on canaries and demonstrate that by using the prefix (non-sensitive) tokens of the canary, we can generate the full canary. As an example, our attack is able to reconstruct a four digit code in the training dataset of the NLU model with a probability of 0.5 in its best configuration. As countermeasures, we identify several defense mechanisms that, when combined, effectively eliminate the risk of ModIvA in our experiments.


Introduction
Natural Language Understanding (NLU) models are used for different tasks such as questionanswering (Hirschman and Gaizauskas, 2001), machine translation (Macherey et al., 2001) and text summarization (Tas and Kiyani, 2007). These models are often trained on crowd-sourced data that may contain sensitive information such as phone numbers, contact names and street addresses. Nasr et al. (2019), Shokri et al. (2017) and Carlini et al. (2018) have presented various attacks to demonstrate that neural-networks can leak private information. We focus on one such class of attacks, called Model Inversion Attack (ModIvA) (Fredrikson et al., 2015), where an adversary aims to reconstruct a subset of the data on which the machinelearning model under attack is trained on. We also demonstrate that established ML practices (e.g. dropout) offer strong defense against ModIvA.
In this work, we start with inserting potentially sensitive target utterances called 'canaries' 1 along with their corresponding output labels into the training data. We use this augmented dataset to train an NLU model f θ . We perform a open-box attack on this model, i.e., we assume that the adversary has access to all the parameters of the model, including the word vocabulary and the corresponding embedding vectors. The attack takes the form of text completion, where the adversary provides the start of a canary sentence (e.g., 'my pin code is') and tries to reconstruct the remaining, private tokens of an inserted canary (e.g., a sequence of 4 digit tokens). A successful attack on f θ reconstructs all the tokens of an inserted canary. We refer to such a ModIvA as 'Canary Extraction Attack' (CEA). In such an attack, this token reconstruction is cast as an optimization problem where we minimize the loss function of the model f θ with respect to its inputs (the canary utterance), keeping the model parameters fixed.
Previous ModIvAs were conducted on computer vision tasks where there exists a continuous mapping between input images and their corresponding embeddings. However, in the case of NLU, the discrete mapping of tokens to embeddings makes the token reconstruction from continuous increments in the embedding space challenging. We thus formulate a discrete optimization attack, in which the unknown tokens are eventually represented by a one-hot like vector of the vocabulary length. The token in the vocabulary with the highest softmax activation is expected to be the unknown token of the canary. We demonstrate that in our attack's best configuration, for canaries of type "my pin code is k 1 k 2 k 3 k 4 ", k i ∈ {0, 1, . . . , 9}, 1 ≤ i ≤ 4, we are able to extract the numeric pin k 1 k 2 k 3 k 4 with an accuracy of 0.5 (a lower bound on this accuracy using a naive random guessing strategy for a combination of four digits equals 1 × 10 −4 ).
Since we present a new application of ModIvA to NLU models, defenses against them are an important ethical consideration to prevent harm and are explored in Section 6. We observe that standard training practices commonly used to regularize NLU models successfully thwart this attack.

Related Work
Significant research has been conducted in the field of privacy-preserving machine learning. Shokri et al. (2017) determine whether a particular datapoint belongs to the training set X tr . The success of such attacks has prompted research in investigating them (Truex et al., 2019;Hayes et al., 2017;Song and Shmatikov, 2019). Carlini et al. (2018) propose the quantification of unintended memorization in deep networks and presents an extraction algorithm for data that is memorized by generative models. Memorization is further exploited in Carlini et al. (2020) where instances in the training data of very large language-models are extracted by sampling the model. The attacks described above are closed-box in nature where the adversary does not cast the attack as an optimization problem but instead queries the model multiple times.
Open-box ModIvA were initially demonstrated on a linear-regression model (Fredrikson et al., 2014) for inferring medical information. It has been extended to computer vision tasks such as facial recognition (Fredrikson et al., 2015) or image classification (Basu et al., 2019). Our work is a first attempt at performing ModIvAs on NLP tasks.

Attack Setup
We consider an NLU model f θ that takes an utterance x as input and uses the word-embeddings E(x) for the tokens in x to perform a joint intent classification (IC) and named-entity recognition (NER) task. We assume an adversary with open-box access to f θ , which means that they are aware of the model architecture, trained parameters θ, loss function L(f θ (E(x)), y), label set Y of intents and entities supported by the model and vocabulary V which is obtained from the wordembeddings matrix W ∈ IR |V |×d . However, the adversary does not have access to the training data X tr used to train f θ . The adversary's goal is to reconstruct a (private) subsetx ⊆ X tr .
To perform a CEA on f θ , we keep the parameters θ fixed and minimize the loss function L with respect to the unknown inputs (i.e., tokens) of a given utterance. This is analogous to a traditional learning problem, except with fixed model parameters and a learnable input space. In this work, we use the NLU model architecture described in Section 4.1.

Canary Extraction Attacks
We consider a canary sentence x c = (x p , x u ), x c ∈ X tr with tokens (p 1 , .., p m , u 1 .., u n ) and output label y c ∈ Y . The first m tokens in x c represent a known prefix x p (e.g."my pin code is") and the next n tokens (u 1 , .., u n ) represent the unknown tokens that an attacker is interested in reconstructing x u (e.g."one two three four"). We represent the set of word embeddings of this canary E(x c ) as (e p 1 , .., e pm , e ′ u 1 , .., e ′ un ). A trivial attack to identify the n unknown tokens in x u is by directly optimizing ., e ′ un ) are randomly initialized. Words corresponding to the optimized values of (e ′ u 1 , .., e ′ un ) are then assigned by identifying the closest vectors in the embedding matrix W using a distance metric (e.g. Euclidean distance). However, our experiments demonstrate that this strategy is not successful since the updates are performed in a non-discrete fashion, whereas the model f θ has a discrete input space. We thus focus on performing a discrete optimization, inspired by works on relaxing categorical variables to facilitate efficient gradient flow (Jang et al., 2016;Song and Raghunathan, 2020), as illustrated in Figure 1.
We define a logit vector z i ∈ IR |V | for each token u i ∈ x u . We then apply a softmax activation with temperature T to obtain a i ∈ IR |V | : a i is a differentiable approximation of the arg-max over the logit vector for low values of T . This vector then selectively attends to the tokens in the embedding matrix, W ∈ IR |V |×d , resulting in the embeddings (e ′ u 1 , .., e ′ un ) used as inputs fed to the model during the attack: We then train our attack and optimize for Z ∈ IR n×|V | , with Z = (z 1 , . . . , z n ): Z is the only trainable parameter in the attack and all parameters of f θ remain fixed. Once converged, we identify the token x i as the one with the highest activation in a i . We decrease the temperature T exponentially to ensure low values of T in Equation (1) and enforce the inputs to f θ to be discrete. In our experiments, we define z i over a subset of candidate words for x u V 0 , V 0 ⊆ V to prevent the logit vector from becoming too sparse.

Target Model Description
We attack an NLU model jointly trained to perform IC and NER tagging. This model has a CLC structure (Ma and Hovy, 2016). The input embeddings lead to 2 bi-LSTM layers and a fully-connected layer with softmax activation for the IC task and a Conditional Random Field (CRF) layer for the NER task. The sum of the respective cross-entropy and CRF loss is minimized during training. We use FastText embeddings (Mikolov et al., 2018) as inputs to our model 2 .

Canary Insertion
We inject R repetitions of a single canary with sensitive information and its corresponding intent and NER labels into the training set of the NLU model. We insert three different types of canaries with n unknown tokens, n ∈ {4, 6, 8, 10}, described in Table 1. C is a set of 12 colors 3 . Additional details of the canaries and their output labels are presented in the Appendix A. The adversary aims to reconstruct all the n unknown, sensitive tokens in the canary. The reduced vocabulary V 0 in Equation (1) is the set of all digits for canary call and pin and the names of 12 colors for canary color.

Canary
Pattern

Attack Evaluation
We inject the canary into Snips (Coucke et al., 2018), ATIS (Dahl et al., 1994) and NLU-Evaluation (Xingkun Liu and Rieser, 2019). The canary is repeated with R ∈ {1, 10, 100, 500}. For each combination of R, canary type and length n, the experiment is repeated 10 times (trials) with 10 different canaries, to account for variation induced by canary selection. We define the following evaluation metrics averaged across all trials to evaluate the strength of our attack. Average Accuracy (Acc): Fraction of the trials where the attack correctly reconstructs the entire canary sequence in the correct order. A higher Accuracy indicates better reconstruction. Accuracy is 1 if we can reconstruct all n tokens in each of the 10 trials.
Average Hamming Distance per Token (HDT): The Hamming Distance (HD) (Hamming, 1950) is the number of positions at which the reconstructed utterance sequence is different from the inserted canary. Since HD is proportional to the length of the canary, we normalize it by the length of the unknown utterance (HDT = HD/n). The HDT can be interpreted as the probability of reconstructing the incorrect token for a given position in the canary, averaged across the 10 trials. A lower HDT indicates better reconstruction.
Accuracy reports our performance on reconstructing all n unknown tokens in the correct order and is a conservative metric. HDT quantifies our average performance for reconstructing each po-  sition in the unknown sequence. We evaluate our attack against randomly choosing a token from the reduced vocabulary V 0 . Thus for a given value of n, the expected accuracy and HDT of this baseline are ( 1 |V 0 | ) n and 1 − 1 |V 0 | respectively.

Results
The trivial attack described in Sec3.1 without discrete optimization performs comparably to the random selection baseline. We thus focus on performing the attack with discrete optimization in this Section. Table 2 shows the best reconstruction metrics for the different values of n and the corresponding repetitions R ∈ {10, 100, 500} at which these metrics are observed in the Snips dataset. In our experiments, our attack consistently outperforms the baseline. For n = 4, 6, we reconstruct at least one complete canary for each pattern. The attack also completely reconstructs a 10-digit pin for higher values of R, with an accuracy of 0.10. Even when we are unable to reconstruct every token in any trial, i.e. accuracy is zero, we still outperform the baseline, as observed from the HDT values. For the sake of brevity, we summarize the attack performance on other datasets in Appendix C.2. We observe that the attack is dataset-dependent with best performance for the Snips dataset and poorest for the NLU-evaluation dataset.

Discussion
The training data of NLU models may potentially contain sensitive utterances such as "call k 1 . . . k 10 ", k 1≤i≤10 ∈ {0, 1, . . . , 9}. An adversary who wishes to extract the phone-number can assume the prefix "call", along with the output labels of the utterance which are also trivial to guess, given access to the label set Y . Our canaries act as a placeholder for such utterances. We choose to insert the canary color since the names of colors appear infrequently in the datasets mentioned in Section 4.3, allowing us to evaluate the attack on 'out-of-distribution' data which is more likely to be memorized by deep networks (Carlini et al., 2018).
For n = 4 and R = 1 (i.e., the canary only appears once in the train set), our attack has an accuracy of 0.33 for canary color and 0.10 for pin. This suggests that the attack could potentially reconstruct sensitive information from short rare utterances in real-world scenarios. For a special case when the adversary attempts to reconstruct a ten digit phone-number in canary call with a three digit area-code of their choosing, the attack can reconstruct the remaining seven digits of the number with an accuracy of 0.1 when R = 1. For conciseness, we show these results in Appendix C.1. We observe that our model is more effective and with fewer repeats for the canary color than canaries pin and call of the same length. Our empirical analysis indicates the attack is more successful in extracting tokens that are relatively infrequent in the training data and in reconstructing shorter canaries. As shown in Appendix C.1, the attack performs best for R = 1000. However, this trend of improved reconstruction for larger values of R is not monotonic and we observe a general decline in reconstruction for R > 1000. We are unsure of the vulnerabilities that facilitate CEA. While unintended memorization is a likely explanation, we note that our attack performs best on the Snips data, although the smaller ATIS data should be easier to memorize (Zhang et al., 2016).

Proposed Defenses against ModIvA
We propose three commonly used modeling techniques as defense mechanisms-Dropout (D), Early Stopping (ES) (Arpit et al., 2017) and including a Character Embeddings layer in the NLU model (CE). D and ES are regularization techniques to reduce memorization and overfitting. CE makes the problem in 3 more difficult to optimize, by concatenating the embeddings of each input token with a character level representation. This character level representation is obtained using a convolution layer on the input sentence (Ma and Hovy, 2016).
For defense using D, we use a dropout of 20% and 10% while training the NLU model. For ES, we stop training the NLU model under attack if the validation loss does not decrease for 20 consecutive epochs to prevent over-training.

Efficacy of Defenses
In this section we present the performance of the proposed defenses against ModIvA. To do so, we evaluate the attack on NLU models trained with each defense mechanism individually, and in all combinations. The canaries are inserted into the Snips dataset and repeated 10, 500 and 1000 times. The results are summarized in Table 3. We observe that the attack accuracy for each defense (used individually and in combination) is nearly zero for all canaries and is thus omitted in the table. We also note that the HDT approaches the random baseline for most defense mechanisms. The attack performance is comparable to a random-guess when the three mechanisms are combined. However, when dropout or character embedding is used alone, HDT values are lower than the baseline, indicating the importance of combining multiple defense mechanisms. Additionally, training with defenses do not have any significant impact on the performance of the NLU model under attack. The defenses thus successfully thwart the proposed attack without impacting the performance of the NLU models.

Conclusion
We formulate and present the first open-box ModIvA in a form of a CEA to perform text completion on NLU tasks. Our attack performs discrete optimization to select unknown tokens by optimizing over a set of continuous variables. We demonstrate our attack on three patterns of canaries and reconstruct their unknown tokens by significantly outperforming the 'chance' baseline.
To ensure that the proposed attack is not misused by an adversary, we propose training NLU models with three commonplace modelling practicesdropout, early-stopping and including character level embeddings. We observe that the above practices are successful in defending against the attack as its accuracy and HDT values approach the random baseline. Future directions include 'demystifying' such attacks, and strengthening the attack for longer sequences with fewer repeats and a larger V 0 and investigating additional defense mechanisms, such as those based on differential privacy, and their effect on the model performance.

Ethical Considerations
The addition of proprietary data to existing datasets to fine-tune NLU models can often insert confidential information into datasets. The proposed attack could be misused to extract private information from such datasets by an adversary with open-box access to the model. The objectives of this work are to (1) study and document the actual vulnerability of NLU models against this attack, which shares similarities with existing approaches (Fredrikson et al., 2014;Song and Raghunathan, 2020); (2) warn NLU researchers against the possibility of such attacks; and (3) propose effective defense mechanisms to avoid misuse and help NLU researchers protect their models.
Our work demonstrates that private information such as phone-numbers and zip-codes can be extracted from a discriminative text-based model, and not only from generative models as previously demonstrated (Carlini et al., 2020). We advocate for the necessity to privatize such data using anonymization (Ghinita et al., 2007) or differential privacy (Feyisetan et al., 2020). Additionally, in case the training data continues to contain some private information, practitioners can prevent the extraction of sensitive data by using the defense mechanisms described in Section 6, which reduces the attack performance to a random guess.

B Training Parameters
We decrease the temperature T exponentially after each iteration t. The temperature at the t th iteration T t is given by T t = 0.997 t × 10 −1 .
We use the Adam optimizer and train our attack for 250 epochs. We begin with an initial learning rate of 6.5 × 10 −3 for our attack with a decay rate of 9.95 × 10 −1 . Table 4 shows the model performance for just one repeat of the canary in the Snips dataset i.e. R = 1. The n = 7 example for the call canary refers to the special case when the adversary is trying to reconstruct a 10-digit phone number beginning with a three digit area code of their choice.  R ∈ {10, 100, 500, 1000, 2000}. We observe an accuracy of 0.5 for the canary pin when n = 4 and R = 1000. Figure 2 illustrates the model performance across canaries in the Snips dataset with varying number of repetitions R. As observed in Table 5 and Figure 2, the attack is most likely to succeed when R is 1000. However, the attack weakens for higher values of R.

C.2 Attack Performance Across Datasets
We evaluate our attack on the ATIS and NLU-Evaluation Datasets, for canaries color and pin with n = 4 and canary call with n = 10. To ensure that we maintain a comparable number or repeats with respect to the size of the dataset, R ∈ {10, 100, 200, 500} for the ATIS dataset and R ∈ {100, 500, 1000, 5000, 10000} for the NLU-Evaluation dataset. As shown in Figure 3, the attack performance is almost comparable for shorter sequences in Snips and ATIS but under-performs for the NLU-Evaluation data. Figure 4 and Figure  5 illustrate the HDT for the ATIS and NLU Evaluation datasets for R canary repetitions respectively.