Preventing Author Profiling through Zero-Shot Multilingual Back-Translation

Documents as short as a single sentence may inadvertently reveal sensitive information about their authors, including e.g. their gender or ethnicity. Style transfer is an effective way of transforming texts in order to remove any information that enables author profiling. However, for a number of current state-of-the-art approaches the improved privacy is accompanied by an undesirable drop in the down-stream utility of the transformed data. In this paper, we propose a simple, zero-shot way to effectively lower the risk of author profiling through multilingual back-translation using off-the-shelf translation models. We compare our models with five representative text style transfer models on three datasets across different domains. Results from both an automatic and a human evaluation show that our approach achieves the best overall performance while requiring no training data. We are able to lower the adversarial prediction of gender and race by up to 22% while retaining 95% of the original utility on downstream tasks.


Introduction
Data collections of natural language utterances bear the risk of disclosing sensitive information about the recorded participants, including their gender, race, or political preferences. Unlike explicit mentions of private information, like a user's name or location (Tang et al., 2004;Adelani et al., 2020), such user traits are often encoded rather subtly in a user's speaking or writing style. Nevertheless, they can be predicted with high accuracy by deep learning-based classifiers even when they are not obvious to humans (Elazar and Goldberg, 2018), enabling third-parties with access to the data sets to profile users without their knowledge.
A common method to alleviate this problem is the application of an intermediate transformation step to remove sensitive information via text style transfer. While a number of different style transfer techniques exist (Shen et al., 2017;Fu et al., 2018;Madaan et al., 2020), they require large amounts of text data labeled with user trait information to perform well. Additional annotations need to be provided for every new user trait that the model is expected to handle, multiplying the associated costs and effort. Furthermore, the impact that such transformations can have on the utility of the resulting data is often overlooked. Conversely, we argue that the privacy-utility dichotomy should be at the heart of all research on this topic because it is fairly easy to consider one of the two but difficult to improve both at the same time.
In this paper, we explore a simple yet effective zero-shot text transformation method based on multilingual back-translation. Back-translation (BT) is an alternative approach without the prerequisites of labeled training data. Sensitive user traits can be significantly obfuscated when translated to another language and back (Rabinovich et al., 2017;Prabhumoye et al., 2018) since many concepts cannot easily be mapped across languages. For example, in languages such as Japanese and Korean the speaker's gender can be inferred from the choice of certain pronouns. When back-translating them via an intermediate language that does not make such differences, such as English, these gender indicators will be largely obfuscated.
Results from extensive experiments show that our simple zero-shot text transformer has comparable or even better performance than popular style transfer methods, considering both the privacy and utility of the transformed texts. In summary, we make the following contributions: 1. We propose using multilingual back-translation for hiding users traits. We experiment with using 6 high-resourced languages: German, Spanish, French, Japanese, Russian, and Chinese as the pivot language. This provides more opportunities to pick a language that can hide sensitive information represented in the original language.
Our approach is zero-shot without the need for additional data to train style transfer models.
2. We show that our simple approach is competitive with style transfer models using automatic metrics, and better performance using human evaluation in terms of content preservation and fluency.
3. We perform a comprehensive evaluation on three datasets with popular style transfer methods. These methods have been well studied in the style transfer community, but they have never been evaluated for both privacy and utility preservation in downstream tasks.

Related Work
Attribute information such as gender, age, or race are being captured in the deep learning models. Traditional approaches prevented this information leakage via lexical substitution of sensitive words (Reddy and Knight, 2016). In recent years, many text style transfer techniques have been proposed to control certain attributes of generated text (e.g., formality or politeness) while preserving the content. A common paradigm is to disentangle the content and style in the latent space (Shen et al., 2017;John et al., 2019;Cheng et al., 2020). Another stream of work treats text style transfer as an analogy of unsupervised machine translation (Zhang et al., 2018;Lample et al., 2019;Zhao et al., 2019;He et al., 2020) to rephrase a sentence while reducing its stylistic properties (Prabhumoye et al., 2018). Beyond the end-to-end training methods, the prototype-based text editing approach also attracts lot of attention (Li et al., 2018;Sudhakar et al., 2019;Madaan et al., 2020), in which attribute markers of input sentences are deleted and then replaced by target attribute markers. These techniques have been well studied in the text style transfer community, but have never been evaluated for both privacy and utility preservation in downstream tasks. Shetty et al. (2018)   users inputs like sentiment, intent, and dialogue act but there is a need to preserve user privacy. We consider a scenario where an adversary attempt to predict demographic attributes of user utterances using a pre-trained attribute classification model. We assume that the adversary already has a pretrained attribute classification model based on publicly available data. Our goal is to transform the original user input text X to X such that X (1) prevents the accurate prediction of user attributes, (2) maintains the utility of downstream NLP tasks, (3) maintains the content of X and (4) is a fluent text itself.
In this paper, we explore a simple, zero-shot text transformation method through multilingual back-translation. Our assumption is that, as also supported in previous research (Rabinovich et al., 2017;Prabhumoye et al., 2018), text styles can be significantly obfuscated when being translated to another language (pivot language) then translated back. One example is shown in Table 1. The word "papi" is normally used among Latino Americans which exposes their race. When translating them to languages like Chinese then translating back, it becomes the standard form of "dad" and thereby protects the user privacy. Specifically, we define our text transformation function as: where L is the pivot language and T is a translation model. We make use of mBART50 1 -an off-theshelf machine translation model implemented by HuggingFace (Wolf et al., 2020). We consider 6 high-resourced languages as the pivot, so as to ensure a decent quality of machine translation models. The languages chosen are German (DE), Spanish (ES), French (FR), Japanese (JA), Russian (RU), and Chinese (ZH) based on the large amount of resources they have on OPUS (Tiedemann, 2012) and Common Crawl corpora 2 .

Datasets
In this paper, we conduct experiments on three datasets: DIAL (Blodgett et al., 2016), VerbMobil (Weilhammer et al., 2002) and Yelp (Reddy and Knight, 2016;Shen et al., 2017). These datasets comprise of a variety of domains with either race or gender as the sensitive attribute and they also have annotations for dialog acts and sentiment classification that we use to test the utility of downstream NLP tasks. For Yelp, we find two datasets previously used in the style transfer literature, one for gender (YelpGender) (Reddy and Knight, 2016) and the other for sentiment (YelpSentiment) (Shen et al., 2017). The texts are from the same source but each review do not have both gender and sentiment labels. By automatically comparing each review in the test set of YelpGender with the YelpSentiment Dev and Test sets, we created a new Dev set and Test set with 4K reviews, each with both gender and sentiment information. This can be used for future research to evaluate the utility of Yelp Gender dataset. The dataset is available on Github 3 . Table 2 shows the data splits for three datasets: Attribute Train, training set for attribute classification; Utility Train, training set for a downstream NLP task; Style Train, training set for style transfer, Dev, the development set, and the Test set. The detailed data description is in Appendix A.

Experimental Setup
We train five popular style transfer methods: 1) CAE (Shen et al., 2017)     Original this hotel seems to be very poorly run.

BT (DE)
The hotel seems to be very poorly operated. BT (JA) This hotel seems to be very poorly managed.

BT (ZH)
This hotel looks terrible.
CAE this place is definitely very good. BST this hotel seems poorly run. UNMT this hotel seems to be very clean . DLS i was n't very impressed with this place. Tag&Gen this hotel seems to be gorgeous run .  (Warstadt et al., 2019). Lastly, we introduce a new task, Utility (Util) to measure the performance of the transformed texts on an available downstream NLP task. Further details are in Appendix B. To measure the overall performance across all tasks, we compute an average of all the metrics (P Mean ). For transfer attribute strength, we subtract attribute F1 from 100 i.e (100 − Attr) because the value is decreasing while others are increasing. We provide more details in Appendix B.

Results
Automatic Evaluation We compare the performance of the style transfer models and backtranslation models in terms of attribute F1, utility F1, METEOR, and GAR on three datasets (DIAL, VerbMobil and Yelp). Table 3 shows the performance on DIAL dataset. We observe a reduction of 7−22% in attribute F1 by a simple back-translation, with Chinese (ZH) preserving more privacy while maintaining 95% of the original utility and highest score (81%) for fluency. German (DE) has better METEOR score and utility on average but sacrificed a lot of privacy. The BT (ZH) model has similar or better performance as the Adversarial training and SMDSP proposed by (Xu et al., 2019) in privacy preservation, utility and content preservation. However, we find style transfer methods have much better privacy preservation than BT models with 45 − 75% reduction in attribute F1, but they sacrificed a lot in terms of utility, content preservation (< 30 METEOR except Tag&Gen) and fluency (< 45% GAR), making them not practical for real-life applications. Table 4 shows the result on VerbMobil dataset. The BT models leads to a reduction of 3.5 − 9.7% in attribute F1 while maintaining over 86% of the original utility F1. We also find them to achieve better performance in METEOR and GAR, although the models are applied in zero-shot settings. The style transfer models performed terribly since they typically require massive amounts of data (Li et al., 2019) and might be skewed in a data-scarcity scenario (5k sentences for VerbMobil). One particular strength of our approach is that it requires no additional data and most suited for zero-shot settings.
We also examined the performance of BT models on Yelp dataset. The style transfer models preserve more gender privacy (19 − 54%) than the BT models (5−16%). However, they have much worse results in terms of utility and fluency. Overall, the P Mean of BT models is often better than the style transfer models for all datasets.

Human Evaluation
We further performed human evaluation for the two best privacy-preserving BT models (ZH and JA) and style transfer models (DLS and Tag&Gen) in terms of content preservation and fluency. We recruited three raters, who are volunteers from our research lab including authors of the paper to evaluate the models. The three volunteers rated 100 sentences per model i.e 400 sentences per rater. The volunteers were not paid for the rating, and were informed that they could in principle, choose to withdraw from the annotation without consequences. We provide the annotation guideline on Github 5 . Table 5 shows the average rating by three professional speakers of English language on 100 sentences in the Yelp dataset, we found out that ZH and JA are rated much higher in content preservation -over 4 (on a 1 − 5 Likert scale) while maintaining near perfect fluency (4.7). The interagreement Krippendorff α of our human raters is 0.69 for both content preservation and fluency. On the other hand, DLS and Tag&Gen are rated lower on both evaluation tasks. Although, Tag&Gen preserves privacy more on Yelp according to Table 4. Table 6 shows an example sentence confirming the content preservation and fluency of our approach. We provide more examples in Appendix C.

Conclusion
In this paper, we propose a zero-shot way to effectively lower the risk of author profiling through multilingual BT using off-the-shelf translation models. We compare our approach with different style transfer models, achieving the best overall performance using an automatic and a human evaluation while requiring no additional training data. In the future, we will (1) analyze how the language choice and translation quality affects the privacy preservation in BT, (2) investigate more on other metrics that can be used to aggregate the the four evaluation metrics corresponding to transfer attribute strength, content preservation, fluency, and utility, and (3) extend the zero-shot BT method with some supervision to improve privacy.
We highlight a few limitations of our work. First, back-translation transformation remove content style but does not necessarily replace attribute markers like style transfer models, for example, given a text "me and my husband ...", style trans-fer models are more likely to change "husband" to "wife" but back-translation will not. Second, our back-translation technique also inherit some of the problems of machine translation generated texts like hallucination (Raunak et al., 2021). We provide examples highlighting these issues in Appendix C.

Broader Impact Statement and Ethics
This paper presents an approach to prevent author profiling of sensitive user attributes. We understand there are many ethical concerns around gender and race, however, our definition and evaluation of user traits are constrained by the available datasets we found in the literature. We did not collect any new data to show the strength of our approach. We hope our research helps to protect the profiling of under-represented groups and communities.

A Data Description
In this paper, we conduct experiments on three datasets (DIAL, VerbMobil, and Yelp) from Twitter social media, dialog conversations, and business reviews domains. Each of the datasets have either race or gender as the sensitive information, and sentiment classification or dialog act classification as the downstream NLP task to measure utility. Table 2 shows the datasets and their splits: Attribute Train, training corpus for the attribute classifier; Utility Train, training corpus for an NLP task; Style Train, training corpus for style-transfer models, Dev, the development set, and the Test set.
DIAL created by (Blodgett et al., 2016) for dialectal tweets classification of African American (AAE) and Standard American English (SAE), and each tweet is assigned a predicted race information -AA or White, and sentiment (pos/neg). We make use of the subset of the tweets (Elazar and Goldberg, 2018) with over 80% confidence in race prediction. The final dataset has 180K tweets (90K each for AA and White race), 80K of the tweets are used for training the attribute classifier while the remaining 100K are used for training sentiment classifier and style transfer models.
VerbMobil corpus (Weilhammer et al., 2002) is a dialog corpus of human to human telephone conversation that are scheduling appointments. The English VerbMobil has over 10K utterances, with only 6,538 with gender information and 6,093 with dialog act (DA) information. We make use of 1,096 utterances with both gender information and DA as the test set, and others for training and Dev. We used the same training set for attribute classification and style transfer models due to limited data.
Yelp review corpus created by (Reddy and Knight, 2016) has gender annotation (male and female), we combined this dataset with another Yelp review corpus (Shen et al., 2017) with only sentiment annotation. By automatically comparing the reviews in the two datasets, we created a Dev and Test set with 4K reviews each with both gender and sentiment information. This can be used for future research to evaluate the utility of Yelp Gender dataset.

B Evaluation tasks and Metrics
Style transfer models are usually evaluated on three tasks: Transfer style (or attribute) strength, content preservation, and fluency (Jin et al., 2021).
1. Transfer attribute strength (Attr): For a binary attribute, the goal is to generate a sentence of attribute 1 given an initial sentence with attribute 0. We measure the success of the transfer by a drop in attribute F1-score on the transformed test set.

Content preservation(METEOR)
: This is measured using automatic metrics like BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005). We choose METEOR because it has better correlation with human than BLEU that is commonly used. Also, it takes into account word stems, synonyms and paraphrase when computing the score leading to better recall. Recently, it has been popularly adopted by the style-transfer community.
3. Fluency(GAR): measures grammaticality. In most cases, this is measured using perplexity on the transformed set. However, Krishna et al. (2020) proposed computing the grammaticality score from a classifier trained on Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019) instead of perplexity because it is unbounded and unnatural sentences with common words may have low perplexity. We compute grammaticality acceptance rate (GAR) using available fine-tuned models 6 .
4. Utility(Util): we introduce a new task to measure the performance of the transformed texts on an available downstream NLP task. For example, DIAL dataset that is popularly used can also be evaluated for sentiment classification (Xu et al., 2019). Here, we also used the F1-score.
To measure the overall performance across all tasks, we compute an average of all the metrics (P M ean ), because all the metrics range from 0 to 100. For the transfer strength, we use (100-F1) since the value is decreasing. Specifically, we compute:   okay what time what time will you have to go in Trier Tag&Gen yeah that is fine for me how about the twenty seventh or twenty seventh or twenty seventh Table 9: VerbMobil: Sample sentences for BT and style transfer models