The Effect of Round-Trip Translation on Fairness in Sentiment Analysis

Sentiment analysis systems have been shown to exhibit sensitivity to protected attributes. Round-trip translation, on the other hand, has been shown to normalize text. We explore the impact of round-trip translation on the demographic parity of sentiment classifiers and show how round-trip translation consistently improves classification fairness at test time (reducing up to 47% of between-group gaps). We also explore the idea of retraining sentiment classifiers on round-trip-translated data.


Introduction
It is both unethical and potentially illegal for document classification algorithms to perform significantly better for some groups in society than for others (Mehrabi et al., 2019). Many document classification technologies have, however, been shown to be sensitive to protected attributes such as gender and age (Mehrabi et al., 2019;Delobelle et al., 2020;Ferrer et al., 2020;Koh et al., 2021). This also holds for sentiment analysis Hovy, 2015;Kiritchenko and Mohammad, 2018;Bhaskaran and Bhallamudi, 2019;Touileb et al., 2020). At the same time it is known that round-trip machine translation (Huang, 1990;Federmann et al., 2019) can be used to normalize text (Ling et al., 2013;Rabinovich et al., 2017;Prabhumoye et al., 2018). This could potentially remove group specific deviations from normal language. However, Stanovsky et al. (2019) found that machine translation is also prone to potentially introducing gender bias. Combined, this leaves it an open question whether round-trip translation can be used to reduce the sensitivity of document classifiers to protected attributes of the authors.
In this paper, we evaluate the effect of roundtrip translation on fairness using a representative * Equal contributions.

Text
Sally is a whiz at math. Sally es una experta en matemáticas. Sally is a math expert. corpus of Danish Trustpilot reviews, in which reviews are associated with self-reported protected attributes (gender and age) . We evaluate this effect across nine different document classification architectures, both in the setting in which round-trip translation happens at test time only, and in the setting in which both training and test data are translated to a foreign language and back.
Contributions We evaluate round-trip translation as a technique for mitigating sensitivity to protected attributes across two attributes and three document classification architectures. We find that round-trip translation at test time consistently reduces the fairness gap (with up to 47%), but that for our best models (SVMs stacked on BERT representations), the effect disappears when both training and test data are translated into a foreign language and back.

Round-trip Translation
Round-trip machine translation (Huang, 1990;Federmann et al., 2019) is the process of machine translating a document to another language and then back to the original language. Table 1 shows a toy example of this process. Ling et al. (2013) found that machine translations of human translations of English tweets into Chinese, back into English, had a tendency to normalize the original text. Rabinovich et al. (2017) observed a similar normalization effect in machine translation systems, and based on their observation, Prabhumoye

Experiments
Fairness metric The fairness literature is rich with definitions of fairness (Mehrabi et al., 2019), most of which are interpretations of the Rawlsian notion of fairness as equal opportunity (Rawls, 1971). In this work, we adopt the following definition of fairness: If the maximal difference in empirical risk between any two groups in G is , we say θ is -fair. Below we use the F1-score for the negative class as our (·). 3 Note how fairness as equal risk is a generalization of approximately equal conditional risk (Donini et al., 2018).
Data The Danish section of the Trustpilot Corpus consists of 149,240 reviews annotated for (selfreported) sentiment, gender and age. The sentiment ratings are provided on a scale from 1 to 5, which we binarize by mapping low ratings to negative class, i.e., {1, 2, 3} → 0, and high ratings to positive class, i.e., {4, 5} → 1. This leads to a highly skewed distribution of 8,257 negative reviews and 140,983 positive ones. This also motivates the use of F1-score for the negative class as our performance metric.
We randomly split the data set into training and test leaving 75% of the reviews as training data, and 25% of the reviews as our test data. The test set is further split into six demographic groups according to self-reported gender 4 and age (binned in three equal groups), as presented in Table 2. We use these six roughly equal-sized groups to evaluate the fairness of our models.
Impact of round-trip translation We use KLdivergence 5 to get a first impression of the extent to which round-trip translation normalizes our data. This is done across the most frequent 1,000 words in the Trustpilot corpus. For each group, we calculate the probability distribution for these words and compute its KL-divergence to the overall distribution. We do this before and after round-trip translating. Table 4 lists the KL-divergencies before and after round-trip translation. As expected, we observe a significant decrease in KL-divergence for all groups after round-trip translating. This indicates that the reviews were indeed normalized during the process of round-trip translation. We also see that the number of unique words dropped by 36% after round-trip translation.

Document classifiers
Our document classifiers all rely on vector representations from pretrained language models. We use two different pretrained language models, namely the multilingual LASER model (Artetxe and Schwenk, 2019) 6 and a monolingual BERT (Devlin et al., 2019) trained for Danish. 7 On top of these we train several classifiers, including nearest neighbor, logistic regression, and (Gaussian kernel) support vector machines (SVMs). We set regularization parameters through grid-search and cross-validation over the training data, but also report results for unregularized logistic regression and SVMs. See Table 3 for hyper-parameters and results.

Results
In   Table 3: We use F 1 score of positive class as our performance metric. We make several observations: (a) Roundtrip translation at test time consistently reduces the fairness gap, and with up to 47%. (b) Round-trip translation of training and test data reduces the fairness gap for LASER models, but widens it for BERT models. (c) Generally, LASER models seem less fair than BERT models, and unregularized models seem more fair than regularized ones. The latter observation aligns with previous work indicating that sparseness is at odds with robustness and fairness (Globerson and Roweis, 2006;Søgaard, 2013;Khani and Liang, 2021).
Group KLD REVIEWS KLD ROUND-TRIP 1 0.027 0.021 2 0.028 0.023 3 0.011 0.009 4 0.023 0.021 5 0.028 0.022 6 0.027 0.020 Table 4: The KL-divergence between the probability distribution of the 1000 most frequent words in each group and the general distribution, before and after round-trip translation. Round-trip translation reduces group-level divergences.
baseline condition in which classifiers are trained and evaluated on Trustpilot reviews; (b) a scenario in which reviews are round-trip-translated at test time for normalization; and (c) a condition in which the classifiers are retrained on round-trip-translated reviews and evaluated on round-trip-translated reviews.
Test time normalization with round trip translation (b) has an overall positive effect on cross-group generalization, reducing the fairness gap with up to ∼ 27%. The third scenario (c) -i.e., the idea of using round-trip translation for normalizing both the training and the test data -yields mixed results, with fairness gap increases up to ∼ 39% (for BERT models) and decreases up to ∼ 47% (for LASER models). Machine translation introduces its own biases, and some representations may be more sensitive to such biases. Note also that the process of round-trip translating the data consistently reduces the overall accuracy of our document classifiers, suggesting a trade-off between fairness and accuracy.

Discussion
Round-trip translation is a simple technique for testtime input normalization, and we have shown that it can significantly reduce sensitivity to protected attributes at a low performance cost. One advantage of round-trip translation is that it does not require annotation of protected attributes. Such datasets are generally only available for English at this point. High-quality machine translation, in contrast, is available for hundreds of languages. In this paper, we experiment with using round-trip translation to reduce group disparity of sentiment classifiers for Danish.
It is important to note, however, that the overall performance drop that results from round-trip trans-lation, while relatively small, means that the absolute performance on minority group drops. In other words, all users experience worse performance with the more fair sentiment classifiers. This, of course, is unfortunate and potentially introduces an ethical dilemma. In fact, it is only with our LASER models that minority group performance improves and fairness is reduced.
Round-trip translation is orthogonal to other approaches to improving fairness, such as distributionally robust optimization (Sagawa et al., 2020), invariant risk minimization (Arjovsky et al., 2020), and adversarial training (Dayanik and Padó, 2021). Round-trip translation can thus easily be combined with any of these approaches, but note that these approaches require annotation of protected attributes. Round-trip translation does not and can thus be considered an unsupervised approach to reducing group disparities.
The fairness gap was most consistently reduced by test-time round-trip translation, but doing roundtrip translation may be more effective for other machine translation systems. In our experiments, Google Translate introduced new biases when relying on BERT representations, but the approach was successful for document classifiers based on LASER representations: Here, we saw both reductions of the fairness gap and improvements for minority groups. For 2/4 classifiers, we even saw improvements for the majority groups.

Conclusion
Sentiment classifiers perform better on reviews written by some demographic groups rather than others, with groups defined by protected attributes such as gender and age. We present a first experiment with round-trip translation as a means of reducing this fairness gap in sentiment classification. Specifically, we show that translating Danish product reviews into English and back, reduces group disparity across three different classification architectures. While the performance cost may in our case be prohibitive for some architectures, in practice, we believe that round-trip translation can be an important technique for improving the fairness of document classifiers in the future, which is easier to scale to new tasks and languages than approaches that require annotation of protected attributes.

Ethics Statement
The gender and age information in the Trustpilot Corpus is self-reported, and all reviewers were free to not report this information. All reviewers that supplied gender information, identified as either male or female, but were free to report other genders.