Unsupervised Text Deidentification

Deidentification seeks to anonymize textual data prior to distribution. Automatic deidentification primarily uses supervised named entity recognition from human-labeled data points. We propose an unsupervised deidentification method that masks words that leak personally-identifying information. The approach utilizes a specially trained reidentification model to identify individuals from redacted personal documents. Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank for the correct profile of the document. To evaluate this approach, we consider the task of deidentifying Wikipedia Biographies, and evaluate using an adversarial reidentification metric. Compared to a set of unsupervised baselines, our approach deidentifies documents more completely while removing fewer words. Qualitatively, we see that the approach eliminates many identifying aspects that would fall outside of the common named entity based approach.


Introduction
In domains such as law, medicine, and government, it can be difficult to release textual data because it contains sensitive personal information (Johnson et al., 2016;Jana and Biemann, 2021;Pilán et al., 2022).Privacy laws and regulations vary by domain and impact the requirements for deidentification.Most prior work on automatic deidentification (Neamatullah et al., 2008;Meystre et al., 2010;Sánchez et al., 2014;Liu et al., 2017;Norgeot et al., 2020;Sberbank and Emelyanov, 2021) deidentifies data to the requirements of the HIPAA Safe Harbor method (Centers for Medicare & Medicaid Services, 1996).Annotations for these systems are based on a list of 18 identifiers like age, 1 Our code and deidentified datasets are available on Github.
phone number, and zip code.These systems treat deidentification as a named entity recognition problem within this space.Upon the removal of these pre-defined entities, text is no longer considered sensitive.
However, one of the 18 categories defined by HIPAA Safe Harbor includes "any unique identifying number, characteristic, or code [that could be used to reidentify an individual]".Prior work ignores this nebulous 18th category.One reason the category is ill-defined is due to the existence of quasi-identifiers, pieces of personally identifiable information (PII) that do not fall under any single category and therefore are difficult to identify and label in the general case (Phillips and Knoppers, 2016).Even data that has all of the categories from Safe Harbor removed may still be reidentified through quasi-identifiers (Angiuli et al., 2015).Supervised approaches cannot naturally detect quasiidentifiers, since these words are not inherently labeled as PII (Uzuner et al., 2007).
In this work, we propose an unsupervised deidentification method that targets the more general definition of PII.Instead of relying on specific rule lists of named entities, we directly remove words that could lead to reidentification.Motivated by the goal of K-anonymity (Lison et al., 2021), our approach utilizes a learned probabilistic reidentification model to predict the true identity of a given text.We perform combinatorial inference in this model to find a set of words that, when masked, achieve K-anonymity.The system does not require any annotations of specific PII, but instead learns from a dataset of aligned descriptive text and profile information.Using this information, we can train an identification process using a dense encoder model.
Experiments test the ability of the system to deidentify documents from a large-scale database.We use a dataset of Wikipedia Biographies aligned Tommy Jönsson 97%   with info-boxes (Lebret et al., 2016).The system is fit on a subset of the data and then asked to deidentify unseen individuals.Results show that even when all words from the profile are masked, the system is able to reidentify 32% of individuals.
When we use our system to deidentify documents, it is able to fully anonymize them while retaining over 50% of words.When we compare our deidentification method to a set of unsupervised baselines, our method deidentifies documents more completely while removing fewer words.We qualitatively and quantitatively analyze the redactions produced by our system, including examples of successfully redacted quasi-identifiers.

Related Work
Automated deidentification.There is much prior work on deidentifying text datasets, both with rule-based systems (Neamatullah et al., 2008;Meystre et al., 2010;Sánchez et al., 2014;Norgeot et al., 2020;Sberbank and Emelyanov, 2021) and deep learning methods (Liu et al., 2017;Yue and Zhou, 2020;Johnson et al., 2020).Each of these methods is supervised, relies on datasets with human-labeled PII, and focuses on removing some subset of the 18 identifying categories from HIPAA Safe Harbor.Other approaches include generating entire new fake datasets using Generative Adversarial Networks (GANs) (Chin-Cheong et al., 2019).Friedrich et al. (2019) train an LSTM on an EMR-based NLP task using adversarial loss to prevent the model from learning to reconstruct the input.Finally, differential privacy is a technique for ensuring provably private distributions (Dwork et al., 2006).It has mostly been used for training anonymized models on data containing PII, but requires access to the un-anonymized datasets for training (Li et al., 2021).Our deidentification approach does not provide the formal guarantees of differential privacy, but aims to provide a practical solution for anonymizing datasets in real-world scenarios.
Deidentification by reidentification.The NeurIPS 2020 Hide-and-Seek Privacy Challenge benchmarked both deidentification and reidentification techniques for clinical time series data (Jordon et al., 2021).In computer vision, researchers have proposed learning to mask faces in images to preserve the privacy of individuals using reidentification (Hukkelås et al., 2019;Maximov et al., 2020;Gupta et al., 2021).In NLP, some work has been done on evaluating the reidentification risk of deidentified text (Scaiano et al., 2016). El Emam et al. (2009) proposes a method for deidentification of tabular datasets based on the concept of K-anonymity.Gardner and Xiong (2009) deidentify unstructured text by performing named entity extraction and redacting entities until k-anonymity.Mansour et al. (2021) propose an algorithm for deidentification of tabular datasets by quantifying reidentification risk using a metric related to K-anonymity.In our work, we train a reidentification model in an adversarial setting and use the model to deidentify documents directly.
Learning in the presence of masks.Various works have shown how to improve NLP models by masking some of the input during training.Chen and Ji (2020) show that learning in the presence of masks can improve classifier interpretability and accuracy.Li et al. (2016) train a model to search for the minimum subset of words required that, when removed, change the output of a classifier.They apply their method to neural network interpretability, and use reinforcement learning.Liao et al. (2020) pre-train a BERT-style language model to do masked-word prediction by sampling a masking ratio from U (0, 1) and masking that many words.
While their method was originally proposed for text generation, we apply the same masking approach to train language models for redaction.

Quasi-Identifiers
In order to study the problem of deidentifying personal information from documents, we set up a model dataset utilizing personal profiles from Wikipedia.We use the Wikibio dataset (Lebret et al., 2016).Each entry in the dataset contains a document, the introductory text of the Wikipedia article, and a profile, the infobox of key-value pairs containing personal information.Is it difficult to deidentify individuals in this dataset?Wikipedia presents no domain challenges, and so finding entities is trivial.In addition many of the terms in the documents overlap directly with the terms in the profile table.Simple techniques should provide robust deidentification.
We test this with two deidentification techniques: (1) Named entity removes all words in documents that are tagged as named entities.(2) Lexical removes all words in the document that also overlap with the profile.To reidentify, we use an information retrieval model (BM25) and a dense neural network approach (described in Section 5).
Table 1 shows the results.While IR-based ReID is able to reidentify most of the original documents, without named entities or lexical matches, documents appear to be no longer reidentifiable.However, our model is able to reidentify 80% of documents, even with all entities removed.With all lexical matches with the profile removed (32% of total words), NN ReID is still able to reidentify a non-trivial number of documents.
This experiment indicates that even in the Wik-iBio domain, there are a significant number of pseudo-identifiers that allow the system to identify documents even when almost all known matching information is removed.In this work we study methods for discovering and quantifying these identifiers.

Deidentification by Inference
An overview of our data and system is shown in Figure 1.Given a document x 1 . . .x N , we consider the problem of uniquely identifying the corresponding person y from a set of possible options Y.The system works in the presence of redactions defined by a latent binary mask z 1 . . .z N on each position, where setting z n = 1 masks word x n .
We define a reidentification model as a model of p(y | x, z) that gives a probability to each profile in Algorithm 1 Greedy Deidentification x, ŷ ← input document and person z j ← 0 for all j for i = 1 to N do j * ← arg min j p(y = ŷ | x, z −j , z j = 1) Y for a masked document.During deidentification, we assume that we have access to the true identity ŷ of the document that we would like to hide.
Our objective is to find the minimally sized mask that ensures that ŷ is not in the top-K predictions of the identification model: This objective is motivated by the concept of K-Anonymity (Samarati and Sweeney, 1998).A dataset has K-anonymity if each person ŷ in the dataset is indistinguishable from at least K other people in Y.
The K-anonymity objective is combinatorial, and is intractable to solve with a non-trivial reidentification model.We instead approximate it with search.Specifically we use a simple greedy deidentification technique shown in Algorithm 1.

Reidentification Model
The core of this redaction system is a model of reidentification, p(y | x, z).Defining this model faces two challenges: a) to facilitate informed search in the presence of masks and b) to correctly identify a person from 100,000s of choices.
As we do not have access to supervised masks, we define the probability of unmasked identification as marginalizing over all possible masks: where p(z | x) is the mask prior and p(y | x, z; θ) is the reidentification model.
To assign a prior over masks p(z | x), we opt for a simple setting that avoids building in additional information and fits well with deidentification search.One possibility would be to follow BERT-style masking and mask words at a fixed ratio of 15% (Devlin et al., 2019).However, Liao et al. (2020) argue that while successful for classification, fixed-ratio masking works poorly for generation-style objectives.Following this advice, we use the following algorithm to construct masks of varying size: • Sample the number of masks l ∼ Uni(0, N ).
• Sample l masked words z m by uniformly sampling indices m from {1, . . ., M } without replacement.
For the reidentification model, p(y|x, z; θ), we follow the dense retrieval literature and use an embedding-based model (Karpukhin et al., 2020).Specifically we use an (asymmetric) bi-encoder model on documents and profiles.The document encoder f computes an embedding of the masked document, and the profile encoder g(y) produces an embedding of the profile table corresponding to person y.We score the match by computing the joint encoding f (x, z) ⊤ g(y) using the dot product between the vectors outputted by two neural networks.Define the matrix of profile embeddings as G = [g(y 1 ); ...; g(y |Y| )].The reidentification probability is defined as During training we utilize label smoothing on the distribution, which has also been shown to be useful when training for inference in an argmax setting (Müller et al., 2019).
To train the model we optimize a lower bound on the identification log-likelihood: Specifically we sample a word dropout mask z for each element x from the prior, and then mask words during reidentification training.
Note that for training we compute the full distribution and do not use a contrastive approximation.In order to learn the parameters of g we utilize coordinate ascent.Specifically we fix G and optimize the parameters of f .We then switch and optimize the profile encoder g on odd-numbered epochs to predict documents in X (with no masking), and then recompute G.We experiment with all combinations for reidentification models, specified by document-profile encoders, RoBERTa-RoBERTa (RR), RoBERTa-TAPAS (RT), PMLM-RoBERTa (PR), PMLM-TAPAS (PT).The PT model is the default for NN DeID.

Privacy
Baselines We consider several unsupervised redaction baselines based on lexical matches with the table and word frequencies.Lexical removes all overlapping words that appear in the profile from the document.IDF (Table-Aware) masks all overlapping words that appear in the profile from the document, then masks in order of descending Inverse Document Frequency (IDF) (rarest word first) until a fixed threshold.We compute IDF based on the full corpus of documents and profiles from the train, validation, and test sets.Named entity removes all named entities from the document. 2 Metrics A major challenge is how to evaluate text privacy in the presence of a strong reidentification models.As shown in Section 3, information retrieval metrics work well for lightly redacted documents, but fail under heavy masking.We ran preliminary experiments with human subjects, but found that even at seemingly low levels of masking, documents were nearly impossible for humans to reidentify.
Inspired by work on adversarial privacy such as the NeurIPS Hide-and-Seek challenge (Jordon   2 We identify named entities using the dslim/bert-base-NER-uncased model available from Hugging Face.Named entities identified are personal names (PER), organization names (OR), location names (LOC), and miscellaneous names (MISC) (Tjong Kim Sang and De Meulder, 2003) et al., 2021), we adapt a metric that utilizes an ensemble of reidentification models R as a benchmark.A masked document, x, z, is considered reidentified if any of the models can correctly select its profile, i.e. ŷ = arg max y p r (y | x, z) for any model r ∈ R. In order to diversify the ensemble we utilize different pretrained neural models as discussed above.We observe that each model can reidentify others with high accuracy indicating diversity features (more discussion in Section 8.3).
We also include a word-matching based IR model in the ensemble, but find that it is not competitive at reidentification.Explicitly, the ensemble consists of the three variant parameterizations (RR, PR, RT) as well as the IR matching model.As a metric of utility, we compute the average percentage of words masked, as well as the information loss percentage, computed as the ratio between the size of the original and redacted texts when compressed.For each method and baseline we sweep over mask sizes to compute a curve of reidentifiability and utility.
Inference We generate redactions from the reidentification models using greedy search to find the word to mask that causes the maximum decrease in the correct prediction.We use search implementations from the TextAttack library (Morris et al., 2020).Search takes in a stopping parameter K which indicates the rank cutoff of ŷ to stop search, ŷ ̸ ∈ K arg max y p(y|x, z).We run with different values of K to sweep over levels of privacy, and generate redactions with different masking rates.
We ignore stopwords to speed up the search since they will rarely be identifiers.

Results
Table 2 presents results comparing unsupervised deidentification techniques on privacy and utility under the ensemble reidentification metric.As noted above, we see that neither Lexical nor Named Entity redaction provide sufficient privacy.NN DeID can provide better privacy while masking fewer words.Both NN DeID and IDF based approaches can reach stronger levels of privacy (< 5% reidentifiability), but at these levels IDF masks most of the remaining words.At full deidentification under the ensemble, NN DeID masks less than half of the words.When we consider an information loss measure of utility, NN DeID also performs much better than IDF-based deidentification.
Figure 2 expands on these results by showing the Pareto curves for privacy and utility across methods.Curves are obtained by varying the K value used in NN DeID and the threshold for IDF based deidentification.Curves show that in addition to achieving better utility at very low rates of identifiablity, the method also achieves better utility than lexical matching, and a steeper privacy curve even at lower levels of redaction.3 shows an ablation study of the components added to the model to improve accuracy.An alternative approach to this task is to finetune a pretrained model directly for the reidentification task (baseline).However, we found that out-of-the-box this model was neither effective as a pure reidentification model nor as a model to guide search.We ablate each component added to NN DeID independently utilizing 1/10th of the training data and profiles, and compare both on the original documents and on documents with 30% of the words masked.Word dropout with the proposed sampling rate improves model accuracy particularly in the high mask regime.Interestingly weighting word dropout frequency using IDF hurts model accuracy in the full regime, and is not included in the final model.Increasing the dual encoder embedding sizes from 768 to 3072 and adding label smoothing both increase model accuracy.Finally, using coordinate ascent to optimize the profile encoder in addition to the document encoder has by far the largest impact on model accuracy.The combination of these approaches gives a deidentification model that is accurate across levels of masking.

Quasi-Identifiers in Redacted Examples
Table 4 shows examples of redacted documents.While the most common redacted entities in deidentified examples are names, dates, and locations, we find notable examples of redacted quasi-identifiers: • Determiners.Determiners can provide useful information in context.In the first example, the system removes "American" before musician, but also the word "an" which, in this context, signals the next word may be "American".This example is also interesting in that it preserves the word "Collective", leading the model to predict a musician Avey Tare from the band "Animal Collective".
• Gender markers.The model often redacts words marking gender in order to anonymize documents.In the second example, for the document on Madoko Hisagae, the model removes both the word "She" and "women's".This redaction leads to the prediction of Hiroki Ichigatani as the predicted match, a male Olympic fencer.
• Locations.The pretrained model seems to be able to identify relative locations even if they are not represented directly in the profile.In the third example, the profile indicates that Tim Tolkien is an English sculptor.The word "English" is masked immediately, but the location "Cradley Heath, West Midlands" is a quasi-identifier as to the country.Upon redacting this term, the model switches its prediction to Nesbert Mukomberanwa, a sculptor from Zimbabwe.

Redacted Word Types
The IDF (table-aware) model relies on overlapping words and rare words to redact content, whereas the NN DeID model can in theory remove any identifying word.Figure 4 compares the part-of-speech tags of the masked words between the two models at the same redaction level.We see that while similar, the NN DeID model masks fewer nouns, proper nouns and numbers, and more adjectives and pronouns.These word classes are less likely to fit the IDF or table-matching criterion.

Model Diversity
The ensemble used for deidentification contains three separate pretrained encoder variants.One potential issue is that the model used to deidentify the text may be overly correlated with the ensemble models used for evaluation.However, we find that each model is quite strong on reidentifying redactions made by other models.For example, the RR model can reidentify NN DeID (PT, K=1) with a surprisingly high 60.5% accuracy.In general we find the model rankings are quite different.Figure 3 demonstrates this phenomenon.In this figure, examples are deidentified to K = 8 with a PT parameterization, and we plot a rank-rank joint histogram with an RT parameterization.While there is some correlation in the rankings, the two models produce very different rankings, with RT even fully reidentifying some points.

Reidentification at high levels of masking
Table 5 shows examples of documents where our reidentification ensemble can correctly identify the individual even at extremely high levels of masking.Examples are randomly generated with a minimum of 95% of words masked.Because we permit punctuation in redacted examples, and we mask but do not erase words, models are able to exploit word counting and punctuation-specific features to identify individuals under very high masking rates.

Conclusion
We propose an unsupervised method for text deidentification that focuses on deidentifying pseudoidentifiers.The method first learns to reidentify from text utilizing a prior masking models.We then utilize search to find a mask to ensure K-anonymity in this model.This approach outperforms masking based on named entities and matching with tabular data, both of which fail to fully anonynize the document.Using an ensemble of reidentification models as a metric, we show that our approach can reach high levels of privacy with moderate levels of redaction.In future work we plan to utilize this approach in conjugation with downstream tasks in order to further demonstrate the utility of the redacted data.We also plan to compare and evaluate with domain-specific approaches for distribut- ing redacted models through manual and automatic redaction.

Limitations
Issues with Wikipedia.Many Wikipedia biographical articles within a given category follow a similar syntactic template, so it is possible that a model could learn to partially reidentify a person by looking at superficial features of the article structure.In the future, documents could be paraphrased during training to prevent the model from learning such syntactic idiosyncrasies.Additionally, since RoBERTa and TAPAS's pre-training data both include Wikipedia articles (Liu et al., 2019;Herzig et al., 2020) it is possible that the models can "cheat" on the test set by recalling data that they memorized during their pre-training.We hypothesize that cheating is unlikely to be happening for two reasons.First, articles in Wikibio make up a small percentage of the models' training data, so very little of their information is probably stored in the pre-trained weights of the models.Second, the models' performance on the test set before training is very low (0% test accuracy).Finally, Wikibio contains articles about a very small and biased subset of humanity (Yuan et al., 2021).
Need for a profile.Although the method we propose does not require any labeled data, it requires a different new data source in the form of profiles.This means that the information deidentified is limited to what can be captured in the profile.Thus, the work of adapting this to a new domain shifts from collecting human-labeled PII annotations to collecting as much personal information as possible into profiles.This is much easier in domains like medicine where a great deal of personal infor-mation is known about each patient, but collecting such profiles may not be possible in every scenario.
Number of words as a quasi-identifier This work focuses on redacting data by replacing words with masks.One unaddressed issue in this work is the fact that even when masked, the presence of a word can still leak information.Consider the following example: "Jack Leswick (January 1, 1910 -August 4, 1934) was a Canadian ice hockey centre for the <mask> <mask> <mask>.".Leswick's team, the Chicago Black Hawks, is one of 11 of 32 National Hockey League teams with three words in their name.An adversary can eliminate the possibility that Leswick played for any of the 20 two-name teams.Future work can consider the possibility of deleting words entirely or joining multiple masked words into a single mask token to provide additional privacy.
Hiding in the crowd.K-anonymity exists when an individual cannot be distinguished from K − 1 other individuals in the dataset.This means that for a given individual, all anonymity guarantees in our setting are with respect to the other individuals in the dataset.Therefore, the same document could be deidentified differently depending on which other profiles there are in the dataset, even without any changes to the document itself.tification models may be used as part of linkage attacks, where individuals can be pinpointed even from seemingly anonymized data.Additionally, the world knowledge of today's large language models may be well-suited for this type of linkage attack.We observed this behavior empirically, when our models were uncannily able to reidentify individuals within a dataset of 720, 000 identities, even from documents that appeared to have no remaining personal information.
We plan to release our models for deidentifying documents from Wikibio to the general public.We are open to hearing from users how our technology impacts both their lives and the lives of others, positively or negatively.If we receive any reports of misuse of our technology, we will mitigate accordingly.

Figure 1 :
Figure 1: Method overview.A document (x, top-left) paired with a profile (ŷ, top-right) is given to the system.A trained neural reidentification model (p(y|x, z), blue circle) produces a distribution over all possible profiles based on densely encoded representations.At each stage of inference, masks are added to the source document, changing the relative rank of the reidentification model.The method is run until k-anonymity of the reidentification model is achieved.Note that in this example, it is not necessary to remove all information, such as the month and day of birth, since the player is already deidentified.

Figure 3 :
Figure 3: Rank comparison of the true document (ŷ) in two different parameterized models of p(y | x, z) (RT and PT).Mask z comes from a deidentification (K = 8) on the PT model.While correlated, the two parameterizations produce very different rankings.

Figure 4 :
Figure 4: Percentage of words by part-of-speech tags that are masked by the IDF model and NN ReID model at K = 8 (similar masking level).

Table 1 :
We train on the training dataset of 582, 659 documents and profiles.During test time, we evaluate only test documents, but consider all 728, 321 profiles from the concatenation of the train, validation, and test sets.This dataset represents a natural baseline by providing Percentage of documents reidentified (ReID) for different deidentification methods.Percentage of words masked in parentheses.
a range of factual profile information for a large collection of individuals, making it challenging to deidentify.In addition, it provides an openly available collection for comparing models.

Table 2 :
Statistics comparing sets of 1000 documents redacted using different methods at various levels of identifiability.Reidentification rate measures the rate at at least one model in our neural-network ensemble can retrieve the correct profile for a redacted document.Information loss is measured as the percentage change in the size of the text when compressed.

Table 3 :
Ablation study.Effect of different factors on model ReID accuracy across data with different redaction strategies.Experiments are on RT parameterization and use 1/10 training data and number of profiles.

Table 4 :
Example redactions from the system.

Table 5 :
Examples of redactions where our neural ensemble can correctly reidentify the individual at extremely high levels of document masking, even though the documents were never seen during training.