Reconstruction Attack on Instance Encoding for Language Understanding

A private learning scheme TextHide was recently proposed to protect the private text data during the training phase via so-called instance encoding. We propose a novel reconstruction attack to break TextHide by recovering the private training data, and thus unveil the privacy risks of instance encoding. We have experimentally validated the effectiveness of the reconstruction attack with two commonly-used datasets for sentence classification. Our attack would advance the development of privacy preserving machine learning in the context of natural language processing.


Introduction
With the development of deep learning technologies, a large number of applications in various domains (e.g., image classification and NLP) have been greatly promoted with significantly improved performance. However, this also arouses serious privacy concerns since a large portion of the training data are usually collected from individuals. For instance, the diagnosis systems in hospitals or healthcare institutions will be trained on the patients' private data, such as medical history (Pham et al., 2017), and radiology medical images (Hosny et al., 2018). In addition, it has been reported that the input keyboard prediction model can be trained with the users' data on mobile devices (Hard et al., 2018), and the assisted composing function for emails/texts can be trained with users' personal messages (Chen et al., 2019).
There have been various works on protecting users' data privacy during the training, which are categorized into two main types: 1) composing cryptographic protocols for securely training the data (Bonawitz et al., 2016;Mohassel and Zhang, 2017;Mohassel and Rindal, 2018) which result in high computational and communication costs in general; 2) leveraging the differential privacy techniques (Dwork et al., 2006;Chaudhuri et al., 2011;Hong et al., 2015;Abadi et al., 2016) to prevent the information leakage which typically cause significant accuracy loss. Despite the above demerits, both types of methods can ensure provable privacy guarantees for the training data. This also raises the question: are there any private learning schemes which can preserve both accuracy and efficiency?
To this end, there are several techniques (Huang et al., 2020a,b) which privately train the model via the so-called instance encoding scheme, by encoding the local data into a somewhat "encrypted" (encoded) data with a mixup scheme , and directly training the model on the encoded data. Data privacy is claimed to be well preserved through the encoding method while only causing minor accuracy loss with the merit of the mixup scheme. In this paper, we study the privacy risks of the instance encoding scheme, and show that the instance encoding cannot provide sufficient privacy protection as the conventional cryptographic techniques against well-designed attacks. Specifically, we design a reconstruction attack to recover the original data from the privately encoded data. We focus on the instance encoding in language understanding, i.e., TextHide (Huang et al., 2020a) as the state-of-the-art technique.

TextHide
The TextHide (Huang et al., 2020a) aims to protect the private text data under the federated learning setting. First, the input text is pre-processed with a BERT transformer encoder to output the corresponding text representation. Then, for "encryption", TextHide will apply the instance encoding to mix up the original text representation with some randomly selected text (representations), which will be fed into the training model of various downstream language understanding tasks, e.g., classification, and question answering. Formally, given the input text x i with the label y i , we denote the text representation as e i = φ(x i ), where φ(·) is a pre-tuned BERT model. The private instance encoded data e i can be generated as below: where λ j is chosen uniformly at random such that K j λ j = 1, the sign-flipping mask σ ∈ {−1, 1} d is also chosen uniformly at random, and d denotes the dimension of the encoding vector. • represents the Hardamard (element-wise) multiplication, and K is the number of combined mix encoding data (as the security parameter). Therefore, the label (one-hot vector) y i of the e i is updated as: y i = K j=1 λ j y j , which is the element-wise addition across y j . Then, for the training with one data batch B, each data (x i , y i ) ∈ B will be privately encoded as Equation 1, where the K data for mixup are randomly sampled from the batch B. TextHide also specifies another parameter m as the size of the mask pool to facilitate the security of instance encoding against the reconstruction attacks. These formalize the (m, K)-TextHide (Algorithm 1 in (Huang et al., 2020a)), which can be integrated into the language training process to ensure text privacy. For instance, (m = 0, K = 1) is the baseline training setting without protection. A larger K will sacrifice some accuracy while improving the privacy (higher costs on recovering the original data), which reflects the trade-off between privacy and accuracy for private training.
Furthermore, TextHide can utilize another dataset X public (usually a large public corpus, e.g., Wikipedia) for mixup, where such mixup works similar to a random oracle in the cryptography domain. 1 Specifically, TextHide will mix up about one half K/2 public data with the private original data, then Equation 1 is updated as: where e p j = φ(x p j ), x p j ∈ X public (randomly sampled). As a consequence, the mixed label y is computed by normalization with the labels of the private data (public data usually do not have labels): 1 The privacy notion provided by mixup in TextHide is based on a k-vector subset sum (Abboud and Lewi, 2013) oracle, which would require O(n k/2 ) efforts to break.
In practice, given the original training dataset (denoted as X), each data (x i , y i ) ∈ X will be encoded for n times (usually equal to the number of training epochs).

Attack Preliminaries
Privacy-Enhancing Schemes. As mentioned before, both cryptographic protocols and differential privacy can provide provable privacy guarantees for protecting the private data. On the one hand, for cryptographic solutions, the data is usually protected by the encryption schemes, e.g., fully homomorphic encryption (FHE) (Gentry, 2009;Cheon et al., 2017), where the security of schemes depends on some hard mathematical problems. Normally, to prove the security of the encryption scheme, we need to formulate a security game, e.g., IND-CPA (Goldreich, 2009), where an attacker with repeating many operations polynomially (w.r.t. the size of the security parameter) cannot do better than randomly guessing. It should be noted that the newly proposed instance encoding schemes are claimed to work as the encryption scheme for privacy protection (Huang et al., 2020a,b), but fail to provide such provable security guarantees.
On the other hand, differential privacy (Dwork et al., 2006(Dwork et al., , 2014Mohammady et al., 2020) can statistically protect the individual information from being identified (i.e., against identification attacks (Dinur and Nissim, 2003)) by injecting wellcalibrated noise to the original values. For example, differential privacy can help to defend against socalled membership inference attacks (Shokri et al., 2017) in the machine learning such that an attacker cannot determine whether any specific individual information is in the dataset or not.
Privacy Attacks. The attacks on the data privacy in ML training are generally referred to membership inference attacks (Shokri et al., 2017;Salem et al., 2018;Nasr et al., 2019;Hisamoto et al., 2020;Song and Raghunathan, 2020), where an adversary can know whether a given data points was used to train the model or not. In addition, model inversion attacks (Fredrikson et al., 2015;Wu et al., 2016;Zhu et al., 2019) can reconstruct a group of representative data points from the training set, e.g., utilizing gradients (Zhu et al., 2019). Our attack on TextHide works closely as the reconstruction attack (Dinur and Nissim, 2003;Carlini et al., 2020a), which aims to reconstruct the original data/information from the protected data (privately encoded data). Note that Carlini et al. (Carlini et al., 2020a) attacks the instance encoding on images while we extend this method to the language understanding domain.
Attack Setting. We assume that the attacker have full knowledge of the public dataset X public and the embedding model for downstream ML tasks. Besides, we assume that the attacker can obtain the private dataset (but unaware of the specific data for the training). Note that we need to consider the worst case (attacker) to evaluate the vulnerabilities of the privacy-enhancing schemes. That is, the strong knowledge (e.g., embedding model and private training dataset) can be accessed by a skilled attacker armed with any background knowledge. For instance, such private training dataset can be machine-generated. Specifically, if the dataset involves personal conversations, then the attacker can utilize some language models to generate a large set of commonly-used dialogs as the private training dataset. The attacker can also leverage some advanced inference attacks, e.g., side-channel or public essays to derive some sentences.
Attack Goal. Given a privately encoded dataset E (including the mixed label y), the attacker aims to reconstruct the original data vector e ∈ E, where E is the set of the original data vectors. W.l.o.g., we consider the basic mixup case that the two original data vectors are used for private encoding, i.e., for one encoded data e i , it will be constructed on two original data e j 1 and e j 2 . Then, we denote a mapping function for the attack as A m : e i ∈ E → {e j 1 , e j 2 } ∈ E × E. Thus, given A m ( e i ) = {e j 1 , e j 2 }, the attacker seeks to derive such mapping function. Note that our attack focuses on reconstructing the text representation vectors (processed by the language understanding model, e.g., BERT) and then we can utilize the model inversion attack (Zhu et al., 2019) to recover the raw text, i.e., x i = φ −1 (e i ).

Overview of The Attack
Our proposed attack consists of three main steps: 1. Removing the sign-flipping mask σ. We first nullify the sign-flipping step for encoding by taking the absolute value of the encoded data e ∈ E as: E ← {abs( e), e ∈ E}.
2. Revealing the mapping function A m to map the encoded data vector E to the original data vector via clustering (Section 4.2).
3. Reconstructing the original text representation vector e i (by computing the λ i ) given the mapping function A m (Section 4.3).

Revealing Mapping Function
The main procedure of this step is clustering the encoded text vectors and mapping the clusters back to the original text vectors. Given a set of original data instances |X| and every data instance will be encoded n times. Since each encoded text vector e i is corresponding to the two original data (i.e., A m ( e i ) = {e j 1 , e j 2 }), the clustering result would expect to be |X| clusters of size 2 * n encoded data vectors (the size of encoded data E is |X| * n).
1) Compute Similarity Score. For the cluster of E, we first compute a similarity score s ∈ [0, 1] among the two privately encoded data e i and e j : if A m ( e i ) ∩ A m ( e j ) = ∅, s = 1 (or close to 1), otherwise 0 (or close to 0). To compute the similarity score s, we train a neural network model f (·) by inputting two privately encoded vectors ( e i , e j ), and f ( e i , e j ) = {0, 1}. The two vectors will be stacked together (e.g., for d × 1 encoded vector, the input will be d × 2). Specifically, we utilize a vanilla MLP model trained with Adam (learning rate 0.01) on the crossentropy loss. We use the MNLI dataset (around 393k examples with all labels removed) (Williams et al., 2018) as the public dataset , and Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), and Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) as the private dataset. Then, we construct a large-scale training data pairs encoded with the above datasets by TextHide, which are labeled accordingly (1 if encoded with the same original text data; otherwise 0). The final model can achieve 94% accuracy.
Notice that reconstructing model f (·) by computing the similarity scores between two privately encoded data is based on a key hypothesis: given any instance encoding scheme which achieves a high accuracy, the privacy guarantee would be somewhat weak (since the original information should be preserved with high accuracy). In other words, if TextHide ensures high accuracy in the downstream tasks (e.g., sentence classification), then the instance encoded data can also be "learned" to recover the original text data (model f (·) can be viewed as a downstream task in NLP). We identify this as an intrinsic vulnerability of such instance encoding schemes, which can be exploited to launch the reconstruction attack.
2) Clustering. Given the similarity model, we can compute the similarity scores on all pairs of the encoded data ( e i , e j ) (| E| 2 pairs in total). This procedure can be computationally efficient. To find |X| clusters (exclusive), denoted the cluster set as {C p , p ∈ [1, |X|]} w.r.t. |X| original text vectors, we formulate the objective function as: Ideally, the size of each cluster should be exactly 2n, and any two encoded data ( e i , e j ) in every cluster C p should satisfy f ( e i , e j ) = 1 (or close to 1). Following K-NN, we can design a greedy method to iteratively update |X| clusters by selecting the encoded data which has the maximum average similarity score of all the data in the cluster. Furthermore, we can audit each cluster by checking the similarity scores among the encoded data and finally partition E into |X| clusters.

Reconstructing Original Text Vectors
After deriving the mapping function from the encoded data to the original data, we can reconstruct the original data. Roughly we can sum up the absolute values of all the encoded vectors mapping to one given original data vector e and average it: e = 1 n abs( e i ). The vector e is approximately close to the original e based on two aspects: 1) the sign-flipping mask σ is removed by taking the absolute values; 2) the values of other irrelevant mixup text vectors can be "cancelled out" by the averaging (could also result in some noises added into the vector). Thus, we need to ensure that the recovered result is close to the original result with tolerable noises.
We first recover the values of the mix-up coefficients λ via the mix-up labels. Specifically, we can get the list of λ with the mix-up labels since Tex-tHide utilizes one-hot vector labels. For example, given one TextHide label (0.4, 0, 0, 0.6), we can directly derive λ i , λ j as 0.4, 0.6 ( Figure 1 in (Huang et al., 2020a)). Then, the attacker can directly retrieve the values of λ. Note that there exists one special case: the mixed two data could belong to the same class (the mixed label will only have one non-zero entry), and thus we can consider λ i = λ j .
After we compute the value of λ, we can reconstruct the original vector e by trying to inverse the mixup operation (Equation 2). Specifically, we denote Λ as an |E| × |X| matrix. For each row of Λ, there are two non-zero entries i, j corresponding to the two mixup values λ i and λ j (other entries are 0). Denote the original text vectors as X = [e 1 , · · · , e |X| ] T (with dimension |X| × d), and the privately encoded vectors as Y = [ e 1 , · · · , e |E| ] T (with dimension |E|×d). Then, Equation 2 can be updated as: where denotes the potential introduced noises (X may not be exactly the original one). To compute X , we can directly solve the above equation: Since the noise could subject to Gaussian distribution, the component Λ −1 · ≈ 0 (the mean value would be close to 0, then we can average it). Furthermore, we can formulate another optimization to minimize the "extra" noise : Thus, with the minimization of the noise, we can accurately derive X (close to the true value). It is worth noting that X includes the sign-flipping mask σ. Recall that we nullify the mask σ by taking the absolute value, then Equation 8 can be updated: where abs is the element-wise absolute value function of the matrix X or Y. To solve Equation 8, we can utilize the gradient descent to search the value of X , and thus compute the based a fit solution of X (w.r.t. the objective function || || 2 2 ). Note that there may exist several values of to satisfy the constraints, then we can heuristically search the value of entry by entry to get the smallest || || 2 2 . Since the attackers have the full knowledge of the pre-trained language model φ(·), we can directly utilize model inversion attacks (Song and Raghunathan, 2020) to recover the original text.

Results and Analysis
We utilize the pre-trained BERT base model by (Devlin et al., 2019)  google-research/bert) as the language model to generate the text representations (the dimensionality d is 768). We evaluate our attack on two datasets for sentence classification: 1) Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019); 2) Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) (the private datasets). For the "public dataset", we use MNLI daset (Huang et al., 2020a). We utilize the open source code of TextHide (https:// github.com/Hazelsuko07/TextHide) to construct the private dataset. We vary the parameter k ∈ [1, 2, 4, 6] (the number of data for mixup). We keep the size of mask pool m = 1. Also, we evaluate the attack performance on varying the size of mask pool m = [1,16,64,256,1024,4096]. For each dataset, we randomly select 100 data points and generate 5000 encoded data via TextHide. In our attack, we will try to reconstruct the original data from such 5000 encoded data by instance encoding. We report the attack success rate (the percentage of reconstructed data out of the original data). Note that our attack is independent of datasets/applications and hyper-parameter free. Table 1 illustrates the attack results (the percentage of recovering original data) on the two datasets. We can observe that our proposed attack can almost recover the text vectors (high success rate). Moreover, while TextHide claims that the privacy will increase as K increases (while losing accuracy), the results show that the value of K does not impact privacy much. Similarly, Figure 1 shows that the mask cannot ensure privacy (but only increasing computational costs instead). Above all, the text vectors cannot be simply viewed as "real-number" vectors since they may still contain semantic meanings (features), which may help the attacker break the security oracle more efficiently.

Discussion
Privacy preserving machine learning (PPML) has been popular in industries under more and more restrictive data actions or regulations, e.g., General Data Protection Regulation (GDPR) in European Union. PPML could help the corporations improve business continuity while machine learning- based services deal with large amounts personal data/information, including text data-based applications such as the keyboard input prediction (Hard et al., 2018). Private instance encoding (e.g., Tex-tHide) has been proposed to address privacy risks in such application scenarios. However, weak privacy guarantees provided by TextHide (e.g., against our proposed attack) may leak the personal information, and also violate privacy regulations and laws. This would cause severe sanctions and lose enterprise reputation from their customers. As depicted earlier, a well-designed privacyenhancing scheme must ensure provable privacy guarantee, and show its performance on data protection. Since TextHide is based on such mixup encoding method, it would be possible to apply differential privacy (Dwork et al., 2006) to the mixup encoding and thus to show similar indistinguishability of the privately encoded instances. This can defend against our reconstruction attacks to some extent (at least reducing the information disclosure). Another possible defense method is to filter sensitive data from the training data. However, this might degrade the model performance.
It is also worth noting that the intrinsic property of DNN model (i.e., memorization) can also be utilized to extract/recover training data from model itself, especially for language models (Carlini et al., 2020b;Lehman et al., 2021). Such works are orthogonal with our proposed attack since we focus more on the encoded data. Nevertheless, our attack can be integrated with such attacks to be more powerful on instance encoding schemes.

Conclusion
We proposed a novel reconstruction attack on a recent private learning scheme, TextHide in the NLP domain. We have experimentally shown that such scheme cannot provide rigorous privacy guarantee even though it obtains good accuracy.

Ethical Impact
Data privacy topics (including privacy-enhancing technologies or attacks to breach data privacy) have been widely investigated in the machine learningbased applications. Such works should be carefully considered to be more ethical rather than harmful, especially for the attacks on breaking privacyenhancing technologies. We think this matches with our case. Even though it is possible that our proposed attack can be further utilized to attack, we firmly believe that our attack can call more attention on the privacy-enhancing works and then motivate more advanced defense schemes in the language understanding domain.