CAPE: Context-Aware Private Embeddings for Private Language Learning

Neural language models have contributed to state-of-the-art results in a number of downstream applications including sentiment analysis, intent classification and others. However, obtaining text representations or embeddings using these models risks encoding personally identifiable information learned from language and context cues that may lead to privacy leaks. To ameliorate this issue, we propose Context-Aware Private Embeddings (CAPE), a novel approach which combines differential privacy and adversarial learning to preserve privacy during training of embeddings. Specifically, CAPE firstly applies calibrated noise through differential privacy to maintain the privacy of text representations by preserving the encoded semantic links while obscuring sensitive information. Next, CAPE employs an adversarial training regime that obscures identified private variables. Experimental results demonstrate that our proposed approach is more effective in reducing private information leakage than either single intervention, with approximately a 3% reduction in attacker performance compared to the best-performing current method.


Introduction
Deep learning has provided remarkable advances in language understanding and modelling tasks in recent years (Vaswani et al., 2017;Devlin et al., 2019;Brown et al., 2020). However, this increased utility may harm user privacy, as neural models trained with datasets containing personal identifiable information can unintentionally leak information that users may prefer to keep private (Carlini et al., 2019;Song et al., 2017). Even seemingly innocuous collections of metadata (Xu et al., 2008) such as data provided by the users (e.g. at registration time on social media) or data which has been cleansed of identifying attributes (Sun et al., 2012), can provide latent information for the reidentification of participants.
Using social media data can also raise ethical considerations (Townsend and Wallace, 2016). Users may have edited or deleted posts that models continue to rely on in existing datasets, and may unintentionally reveal information they would rather keep private (Bartunov et al., 2012;Pontes et al., 2012;Goga et al., 2013). Research has shown practical attacks that exploit trained models to establish whether a particular individual formed part of a model's training dataset, in an attack known as membership inference (Leino and Fredrikson, 2020;Truex et al., 2019). Personally identifiable attributes such as age, gender, or location can be reliably reconstructed given the output of such a model (Fredrikson et al., 2015;Zhang et al., 2020). Neural representations of input data, including language embeddings, have proven to be a vulnerability for these inferences (Song and Raghunathan, 2020), thus privacy-preserving techniques should be applied to these text representations when they form part of a machine learning pipeline.
To minimise the risk of such attacks in uncovering sensitive information, previous work has employed an adversarial training objective (Coavoux et al., 2018;Li et al., 2018) by modifying the loss function of the model to impose a penalty when a simulated attacker task, such as predicting a private variable from the input sequence, performs well. However, this approach provides no formal privacy guarantees nor privacy loss accounting system. Phan et al. (2020) proposed an approach which implements classical differential privacy in an adversarial learning paradigm, however, this work relies on adversarial objectives to promote robustness to adversarial samples rather than privacy.
Providing a privacy guarantee leads to the notion of differential privacy (DP), as defined by Dwork and Roth (2013). This definition quantifies privacy loss as the maximum possible deviation between the same aggregate function applied to two datasets which differ only in a single record, which can be expressed by the variable . Definition 1.1 ( -differential privacy). The level of private information leaked by a computation M can be expressed by the variable where for any two data sets A and B, and any set of possible outputs S ⊂ Range(M ), This notion of -differential privacy has been extended to text embeddings through the application of calibrated noise (Fernandes et al., 2019;Beigi et al., 2019). Lyu et al. (2020) proposed a method based on local differential privacy-an extension to the schema under which noise is applied to the input data before it leaves the user's device and is encountered by the model owner-producing a private representation which can be sent to a server for classification. However, this approach uses simulated attacker performance as a test benchmark for private information leakage, rather than during training to improve privacy outcomes.
Determining the state-of-the-art in a task of relatively recent provenance and with somewhat limited practical research such as this proves challenging, however we consider the adversarial learning approach of Coavoux et al. (2018) and the local DP approach of Lyu et al. (2020) the focus of the most current research (Alnasser et al., 2021;Dayanik and Padó, 2021;Kaneko and Bollegala, 2021;Friedrich et al., 2019;Vu et al., 2019).

Contributions:
In this work, we propose an approach that combines perturbed pre-trained embeddings with a privacy-preserving adversarial training function that helps preserving the encoded semantic links in the input text while obscuring sensitive information. We demonstrate that our approach achieves comparable task performance against a competitive baseline while preserving privacy. We experiment with a dataset that contains personally identifiable information namely gender, location and birth year. To minimize harm, we experiment with a publicly available English-language dataset (Hovy et al., 2015). Specifically: • We introduce CAPE, "Context-Aware Private Embeddings" 1 , an approach that applies both DP-compliant perturbations and an adversarial learning objective to privatize the embedding outputs of pre-trained language models.
• We establish metrics for testing the privacy result of our system against non-DP-compliant models by offering an empirical framework for determining the level of success of simulated attacks.
• We find that attacker inferences demonstrate differing levels of accuracy depending on the type of the private attribute targeted.
• We establish superior privacy outcomes for our method compared to a sample adversarial learning approach (Coavoux et al., 2018) and a perturbation-only method (Lyu et al., 2020) representing the dominant approaches currently applied to other task domains.

Methodology
We consider the possibility that an attacker may have access to the intermediate feature representations extracted from text from a published language model along with a supervision signal that may allow them to train a model to recover private information about the text author, possibly garnered from access to a secondary data source as demonstrated in Narayanan and Shmatikov (2008) and Carlini et al. (2020). To mitigate this risk, we introduce a DP-compliant layer to the feature extractor that perturbs the representations by adding calibrated noise. We train a second classifier to predict known private variables in addition to our main target task classifier, then pass the error gradient from the secondary classifier through a reversal layer to promote embedding invariance to the private features. Figure 1 shows the system architecture.

Task Formulation & Data
We experiment with multi-class sentiment analysis on the UK section of the Trustpilot dataset (Hovy et al., 2015), which provides text reviews with an attached numerical rating from 1-5 as well as three demographic attributes: gender, location and birth year. Sentiment analysis from text reviews represents a popular task to which pre-trained language models are well suited. We use the gender as reported in the dataset, as a binary attribute, while birth years are separated into six equal-sized age range bins (<1955, 1955-1963, 1964-1971, 1972-1978, 1979-1985, >1986), and locations are translated from latitude/longitude pairs into Geohash strings with a precision of two characters, which results in five potential location classes. This dataset covers multiple regions and languages, however for ease of implementation we include only English language results from the UK region in these experiments. A summary of this dataset is included in We treat gender here as a binary categorical variable, since this is the way the value is represented in the dataset. We recognise that this dualism may not fully represent the range of potential gender expressions (Cao and Daumé III, 2020), and would advocate for a wider conception of potential gender representations in further dataset releases. Age of the respondent is listed in the dataset as a year of birth. We separate these values into six equal-sized bins, assigning each bin an integer ID which replaces the year in our input data. The location variable is encoded as a Geohash string 2 of length 2, which translates into a precision of ±630 km. This level of precision avoids the risk of under-poulated classes; with a more extensive dataset it would make sense to increase precision by extending the length of the Geohash string. This set of strings (a total of five possible strings) for our dataset fraction, is also given a categorical integer ID. Thus bucketed, these attributes are suitable variables for classification modelling.
In our initial baseline experiment, we train a feature extractor consisting of a pre-trained BERT model (Devlin et al., 2019)   layers in order to extract useful features from the input text x. We obtain the final hidden state of the pre-trained model for each token in the input, then take a mean average over the sequence to produce an embedding for the full text, such that: (1) Sentiment analysis is then carried out by a classifier which learns to predict the review rating label y given the embedding vector.
Layer size, dropout rate and other hyperparameters were optimised with a grid search, selecting the most effective with respect to the target task F1 score metric. Optimal parameters are shown in Table 2.
A sample setup as created for CAPE model testing is shown in Appendix A. Adversarial only, differentially-private only, and baseline setups are similar, omitting the noise layer, attacker classifier, or both respectively.
We simulate a task that an attacker may wish to perform on the input text by training a secondary classifier along with the target task that attempts to predict the value of private information variables z. Following Coavoux et al. (2018), we target several features of the respondent as extracted from the dataset, namely gender, location, and birth year. These features, while in reality not being private by virtue of being public information provided by users, represent good proxies for sensitive attributes that users may not wish to be inferred from similar public datasets. In this sense, they provide a useful benchmark of the potential privacy risk, while allowing us to avoid unethical inferences concerning private attributes not shared by the user.

Adversarial Training
In order to promote invariance in the text representation with respect to our private variables, we adopt the approach pioneered by Ganin et al. (2017). Initially designed to promote domainindependent learning, this system involves training a secondary objective to predict features we do not wish to be distinguishable via gradient descent, then passing the loss through a gradient reversal layer into a target task objective, represented in our experiments by the feature extractor.
For a single instance of our data (x e , y, z) the adversarial classifier optimizes: Hence, the combination of both target and attacker classifiers lead to the following objective function, where θ r , θ p , θ a represent the parameters of the feature extractor, classifier and adversarial classifier respectively: (3) where ¬ indicates that the log likelihood of the private label z is inverted, and λ is the regularization parameter scaling the gradient from our adversarial classifier.
The combined classification section therefore consists of two separate classification heads, one for our base task and one for our simulator attacker task. Each consists of two densely-connected layers separated by a dropout layer. The attacker classifier includes a gradient reversal layer which flips the sign of the gradient during the backwards pass.

Embedding Perturbation
Since it is also desirable to provide a measure of general privacy alongside the specific attacker task we simulate in our adversarial training, we adopt the local DP method of Lyu et al. (2020) to perturb the feature representations we produce. Converting the generated embedding into a DP-compliant representation requires us to inject calibrated Laplace noise into the hidden state vector obtained from the pre-trained language model as follows: where n is a vector of equal length to x e containing i.i.d. random variables sampled from the Laplace distribution centred around 0 with a scale defined by ∆f , where is the privacy budget parameter and ∆f is the sensitivity of our function. Since determining the sensitivity of an unbounded embedding function is practically infeasible, we constrain the range of our representation to [0,1], as recommended by Shokri and Shmatikov (2015). In this way, the L1 norm and the sensitivity of our function summed across n dimensions of x e are the same, i.e. ∆f = 1.

Context-Aware Private Embeddings (CAPE)
To preserve the general privacy benefits of DPcompliant embeddings with invariance to the specific private variable identified for adversarial training, we combine both processes in a system we call Context-Aware Private Embeddings (CAPE). Algorithm 1 presents the joint adversarial training scheme with perturbed embedding sequences derived from our feature extractor.

Evaluation
We evaluate performance on the target task (i.e. sentiment analysis) and on our simulated attacker task (i.e. classifying each private attribute) with the accuracy metric, as well as providing a measure of the F1-score along with standard deviation of those results. It should be noted that lower results for the attacker classifier denote greater empirical evidence of privacy (i.e., the attacker cannot predict the target variable), and therefore the lowest score in each scenario is indicated in bold, whereas the highest score for the target task is likewise indicated. All evaluations were performed by randomly selecting 70% of the data for training (the remaining 30% for testing). We compute mean and standard deviation of the F1-score over 4 runs.

Results
Results for our target and attacker tasks using the English-language only reviews drawn from the UK section of the Trustpilot dataset are listed below. Table 3 shows the results for each system, with and λ parameters static at 0.1 and 1.0 respectively. These values are derived from a set of experiments with a range of privacy parameter values as detailed in Table 4 Table 3: Results for the target task and the simulated attacker task. SD = Standard Deviation of F1 score over four cross-validation runs. CAPE outperforms all other approaches in terms of privacy-preservation for all variables.

Influence of privacy parameters
In order to determine the impact of increasing the stringency of privacy guarantees on performance, we tested our CAPE model with the gender private variable using several values of while maintaining a value of 1.0 for λ. A similar experiment was carried out for values of λ with static at 0.1. Results for both experiments are shown in Table 4.

Discussion and Conclusion
These results demonstrate the enhanced privacy afforded by the CAPE approach over either privacy approach applied in isolation. We provide evidence that adversarial training can produce superior outcomes to a DP-only approach, if we consider the private variable targeted in training. Adding DP noise clearly harms performance outcomes, indicating that we require further work to implement alternate processes for perturbing embeddings. Perturbed embeddings generated in Euclidean space perform more poorly as the privacy guarantee increases, so projecting embeddings into Hyperbolic space (Dhingra et al., 2018) or implementing a search mechanism to select semantically-similar vectors that represent real words (Feyisetan et al., 2020) could produce better outcomes with lower privacy budgets. Interestingly, we find that different private attributes are predictable by an attacker at different rates-while the attacker can predict the correct gender or location class effectively, results for age range are barely above random chance. It may well be the case in the UK that word choice varies more between areas and genders than age cohorts, for example, a reviewer who cites the product's "lush vanilla taste" may reside in the West of England, while calling a bad service "shite" may indicate they are Scottish. This is an interesting counterfinding to Welch et al. (2020) which found better embedding performance with age-and genderaware representations in a global population. Differing privacy requirements for separate attributes are a feature of multiple variations on differential privacy regimes (Kamalaruban et al., 2020;Alaggan et al., 2017;Jorgensen et al., 2015).
We note finally that English exhibits fewer grammatical markers that indicate gender than some other languages (Boroditsky and Schmidt, 2000), a peculiarity which may affect the utility of the model in significant ways. Further exploration on different language families will shed light on how privacy-preserving methods can assist in concealing private information.