BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation

Automatic evaluation metrics are crucial to the development of generative systems. In recent years, pre-trained language model (PLM) based metrics, such as BERTScore, have been commonly adopted in various generation tasks. However, it has been demonstrated that PLMs encode a range of stereotypical societal biases, leading to a concern about the fairness of PLMs as metrics. To that end, this work presents the first systematic study on the social bias in PLM-based metrics. We demonstrate that popular PLM-based metrics exhibit significantly higher social bias than traditional metrics on 6 sensitive attributes, namely race, gender, religion, physical appearance, age, and socioeconomic status. In-depth analysis suggests that choosing paradigms (matching, regression, or generation) of the metric has a greater impact on fairness than choosing PLMs. In addition, we develop debiasing adapters that are injected into PLM layers, mitigating bias in PLM-based metrics while retaining high performance for evaluating text generation.


Introduction
In text generation tasks, for example machine translation, text summarization, and caption generation, automatic evaluation metrics are widely adopted for model selection.Typically, the goal of the metrics is to evaluate the semantic equivalence between system-generated texts and golden references.Traditional metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) are usually based on n-gram matching, regardless of the semantic similarity.In recent years, pre-trained language models (PLMs) (Devlin et al., 2019;Lan et al., 2020;Yang et al., 2019;Raffel et al., 2020;Qiu  In contrast to traditional metrics that merely consider surface-form similarity, PLM-based metrics such as BERTScore (Zhang et al., 2020) and BARTScore (Yuan et al., 2021) can well capture the semantic similarity between system outputs and references, and therefore achieve higher correlation with human judgements.Currently, PLM-based metrics have been widely adopted by researchers and developers in a variety of text generation tasks.Although these PLM-based metrics have been well studied from many perspectives such as robustness (Hanna and Bojar, 2021) and efficiency (Pu et al., 2021;Eddine et al., 2021), the fairness of these metrics has not yet been investigated.
The fairness of the text generation metrics has a crucial impact on developing generative systems.If the metric is biased against some sensitive attributes (e.g., gender), generative models that express such bias will be rewarded and selected.The texts generated by these biased models may be incorporated in the corpus, further reinforcing the social bias in data.Such impact of metric bias is illustrated in Figure 1.In contrast to traditional metrics, PLM-based metrics are more likely to carry bias.Recent work has shown that modern PLMs encode unfair stereotypical bias such as racial, gender, or religion bias (Kurita et al., 2019;Webster et al., 2020;Dev et al., 2020;Nangia et al., 2020;Barikeri et al., 2021;Kaneko and Bollegala, 2021).Hence, there is a natural concern that to what extent do these PLM-based metrics carry social bias?
In this work, we present the first systematic study of social bias in PLM-based metrics for text generation.Most existing metrics measure the quality of model-generated candidate texts by comparing with human-annotated references.Ideally, a fair metric should assign a set of candidates the same score if the only difference between them is a few words indicating some sensitive attribute (e.g., gender).To evaluate whether and to what extent existing metrics can hold such a property, we construct datasets for 6 sensitive attributes, i.e., race, gender, religion, physical appearance, age, and socioeconomic status.Each dataset is consisting of paired examples.In each pair of examples, denoted as ⟨(sys 1 , ref), (sys 2 , ref)⟩, one contains a candidate that demonstrates a stereotype (e.g., sys 1 ) and the other contains a candidate that violates the stereotype (e.g., sys 2 ).The reference that does not carry any stereotype is shared by the pair.Some examples to measure gender bias are listed in Table 1, where we observe that all the considered PLMbased metrics exhibit significant bias.Further, we conduct in-depth analysis and find that: • PLM-based metrics are generally more stereotyped than traditional n-gram-based metrics on all sensitive attributes.
• Choosing modeling paradigms (Yuan et al., 2021) (matching, regression, or generation) of PLM-based metrics has a greater impact on fairness than choosing PLMs.
• Replacing the backbone of PLM-based metrics with lightweight PLMs or debiased PLMs helps to reduce bias.
• For generation-based metrics, the modeling direction (ref → sys or sys → ref) matters a lot for fairness.
In addition, we also explore mitigating social bias in PLM-based metrics by training debiasing adapters (Houlsby et al., 2019) attached to the PLMs.Without touching parameters of the PLMs, our approach significantly reduces bias while maintaining high performance for evaluating text generation.1 2 Measuring Social Bias in PLM-based Metrics for Text Generation

Considered Text Generation Metrics
Typically, the quality of system-generated texts is evaluated using human-annotated references.Given a reference ref = ⟨r 1 , . . ., r m ⟩ and a candidate sys = ⟨s 1 , . . ., s n ⟩ that is generated by the system, an automatic text generation metric is to design a function f (ref, sys) ∈ R to score the candidate.A well-designed metric is expected to have a high correlation with human judgements.

Traditional n-gram-based Metrics
Traditional text generation metrics usually rely on n-gram matching.In this work, we consider five traditional metrics for comparison: (1) BLEU (Papineni et al., 2002), the most widely used metric for machine translation.We use the geometrically averaged BLEU score with n = 1, 2, 3, 4.

PLM-based Metrics
For PLM-based metrics, we evaluate three paradigms of methods that formulate f (ref, sys) as different tasks, i.e., matching, regression, and generation.We summarize the formulation and the possible social bias that exists in these PLM-based metrics in Table 2.
Matching-based Metrics.Matching-based metrics compute semantic similarity of reference and candidate using token-to-token matching based on the features extracted by PLMs.We choose BERTScore (Zhang et al., 2020) and Mover-Score (Zhao et al., 2019b) for fairness evaluation.
As recommended, we use F-score as the measurement of text quality.Since the PLMs are used in an unsupervised fashion, there are two possible kinds of bias in matching-based metrics: (1) intrinsic bias encoded in PLMs, and (2) extrinsic bias incorporated by the computation of similarity.
Regression-based Metrics.Regression-based metrics add a regression layer on the top of PLMs and are trained to predict human ratings.We choose BLEURT (Sellam et al., 2020) for fairness evaluation. 2 In addition to intrinsic bias encoded in PLMs, regression-based metrics also include extrinsic bias in the training data during supervised fine-tuning.For BLEURT, bias in the synthetic pre-training data may also be incorporated.
Generation-based Metrics.Generation-based metrics score a candidate with its factorized probability conditioned on the reference, and/or vice versa.Such conditional probability is computed using pre-trained sequence-to-sequence models such as BART (Lewis et al., 2020).We choose PRISM (Thompson and Post, 2020) and BARTScore (Yuan et al., 2021) for evaluating fairness.Following the definition of Yuan et al. (2021), we compute the probability of candidate conditioned on the reference p(sys|ref) as precision, and the vice versa p(ref|sys) as recall.F-score is computed as the arithmetic average of precision and recall.For PRISM, which is trained with the paraphrasing task, the bias can be incorporated during training on the paraphrasing data.For BARTScore, which directly use off-the-shelf BART to obtain the conditional probability, the only bias it may carry is the intrinsic bias encoded in BART.

Fairness Evaluation
In our evaluation, we consider six sensitive attributes, i.e., race, gender, religion, physical appearance, age, and socioeconomic status.For each sensitive attribute, there are several protected groups.For example, the protected groups could be {female, male, non-binary} for the sensitive attribute gender.Each protected group can be expressed by some identity words.For example, the identity words of female could be {woman, girl, female} or some typical female names. 3o evaluate social bias in text generation metrics, we construct a pair of candidates sys 1 , sys 2 and a reference such that we can obtain a pair of inputs, (sys 1 , ref) and (sys 2 , ref).The two candidates sys 1 and sys 2 are minimally distant, the only difference is the identity words they used: One of the two candidates uses the identity words for the protected group that demonstrates a stereotype and the other uses the identity words for another protected group that demonstrates an anti-stereotype.
The reference does not carry any stereotypes.Ideally, a fair metric should give identical scores to the two candidates.As in the first example listed in Table 1, for the reference "The carpenter made a desk for the clerk because the clerk needed a desk", the two candidates, "The carpenter made a desk for the clerk because she needed a desk" and "The carpenter made a desk for the clerk because he needed a desk", should be assigned the same score since there is no evidence of the clerk's gender in the context.If a metric gives a higher score to the first candidate, as all of the PLM-based metrics did, the system that generates such a candidate with stereotypical gender bias will get rewarded and is more likely to be selected for deployment.
Datasets.For each sensitive attribute, we construct a dataset that consists of paired examples for evaluating fairness.For gender bias, we construct a dataset based on WinoBias (Zhao et al., 2018a), which is a widely used dataset to measure gender bias in coreference resolution systems.WinoBias is comprised of paired sentences, where one demonstrates a stereotype and one violates the stereotype.We use the paired sentences as our paired candidates, and construct the corresponding references by replacing the pronouns (e.g., she and he) with the nouns they refer to (e.g., CEO, clerk, etc.).4Some of the constructed samples can be found in Table 1.For the other 5 sensitive attributes, we construct similar examples based on CrowS-Pairs (Nangia et al., 2020), which is a crowd-sourced dataset that covers common types of bias.Similar to WinoBias, each example in CrowS-Pairs consists of a pair of sentences where one is modified to express either a stereotype or an anti-stereotype.We adopt the paired sentences as our paired candidates and use rule-based methods to create references.Details of constructing references for the CrowS-Pairs are in Appendix A. The statistics of the constructed datasets are listed in Table 3.
Evaluation.We evaluate the fairness of the considered metrics on our constructed datasets.For each metric on each sensitive attribute, the metric scores are rescaled to [0, 100] for comparison, i.e., where N is the total number of paired examples for the sensitive attribute of interest.5

Main Results
Figure 2 demonstrates the measurement of the social bias in text generation metrics across 6 different sensitive attributes.We observe that PLMbased metrics generally carry more significant bias than traditional n-gram-based metrics on all sensitive attributes.The most striking type of bias is gender bias, for which PLM-based metrics exhibit 7∼21 score differences while traditional metrics show very small (< 1.3) score differences.In terms of age and socioeconomic status, traditional metrics also demonstrate relatively high bias since the word substitution for constructing corresponding datasets changed surface-form of the reference to a greater extent.Full results are provided in Appendix C.
Visualization of Matching Results.To interpret the results, we attempt to take a closer look at the process by which the model generates biased results.Nevertheless, regression-based metrics and generation-based metrics are completely black-box models and therefore are difficult to interpret.By  contrast, matching-based metrics are somehow interpretable due to the matching map between the system output and the reference.We visualize a case of matching map of MoverScore in Figure 3.The word "she" in the system output matches the word "nurse" in the reference, while the word "he" in the system output matches the word "the" in the reference.Therefore, the gender bias in this case is due to the stereotyped correlation between "she" and "nurse" learned by BERT.
Intrinsic Bias vs. Extrinsic Bias.In our context, intrinsic bias is the bias pre-encoded in the PLM, while extrinsic bias is the bias incorporated during adapting PLMs as a text generation metric.As we summarized in Table 2, all the PLM-based met- the highest degree of unfairness.We conjecture that is because it incorporates much extrinsic bias when performing supervised learning on human ratings.Besides, we observe that tiny-size PLMs exhibit relatively lower bias.
Forward vs. Backward Generation Score.We replace BERT-large in BERTScore and MoverScore with corresponding Zari models, i.e., bert-dropout and bert-cda, both of which are based on BERT-large and are denoted as Zari-Dropout and Zari-CDA in this paper.We evaluate the gender bias in BERTScore and MoverScore with Zari models as their backbones.Besides, we also evaluate their performance as a text generation metric.We consider two different generation tasks: machine translation and text summarization.For machine translation, we obtain system outputs and references from the WMT20 metrics shared task (Mathur et al., 2020).We consider 10 language  pairs, cs-en, de-en, iu-en, ja-en, km-en, pl-en, ps-en, ru-en, ta-en, and zh-en.Average Pearson correlation scores over the 10 language pairs are listed in Table 4, while full results of all language pairs are in Appendix D. For text summarization, we use REALSumm (Bhandari et al., 2020), which measures the pyramid recall of system-output summaries.Following Yuan et al. (2021), we report Spearman correlation for REALSumm.
As shown in Table 4, after replacing BERT-large with Zari models, gender bias is successfully reduced for both BERTScore and MoverScore.The performance for evaluating machine translation and text summarization is still comparable or even better than original BERTScore or MoverScore.Hence, using off-the-shelf debiased PLMs, which encode less intrinsic bias, is a feasible way to improve the fairness of PLM-based metrics.
However, only replacing biased PLMs with debiased ones to reduce social bias can be limited.First, for regression-based metrics that use finetuned PLMs, directly use debiased PLMs such as Zari would not work.Second, for many PLMs used in the metrics, such as BART, there is few publicly available debiased model to replace it.Third, it is costly to train an alternative debiased model for each existing PLM against each bias type.To that end, we explore mitigating metric bias in a parameter-efficient way.

Mitigating Metric Bias with Adapters
Our goal is to mitigate metric bias while maintaining a considerable performance for evaluating text generation.However, existing bias mitigation methods (Bordia and Bowman, 2019) usually modify all parameters of the PLM and suffers from high computational cost and catastrophic forgetting (French, 1993), which may lead to degraded performance.
Instead, following Lauscher et al. (2021), we insert lightweight neural adapters (Houlsby et al., 2019;Pfeiffer et al., 2021) into the PLM layers.By incorporating debiasing knowledge into the injected adapters while keeping the PLM parameters untouched, we can reduce the bias of interest in a plug-and-play style while retaining most of the original performance.
Debiasing Adapters.Our debiasing adapters follow the same architecture of Pfeiffer et al. (2021), where a neural adapter module is injected to each PLM layer, after the feed-forward sub-layer.Denote h and r are the hidden states and the residual, respectively, the computation of an adapter can be formulated as where W u and W d are linear layers for up-and down-projections, g(•) is an activation function.
Training Data and Objectives.Since text generation metrics are performed on paired sequences, we collect training data based on two public sentence-pair datasets, MultiNLI (Williams et al., 2018) and STS-B (Cer et al., 2017), in which each sample is comprised of a premise and a hypothesis.We perform counterfactual data augmentation (CDA) (Zhao et al., 2018b) on the sentences in MultiNLI and STS-B to construct a training set.
In particular, we modify the original sentences by replacing terms describing one of the protected groups (dominant or minoritized) with identity words for the other group, e.g., he → she, Michael → Elizabeth, etc. Denote the original sentence as c 1 , and the modified sentence as c 2 .Also, we replace the identity words with some neutral terms that do not imply identity of any protected groups (e.g., he → person) to create an unbiased reference r.With such constructed paired samples at hand, we can mitigate the bias against the protected group by encouraging the model to assign the same score to (c 1 , r) and (c 2 , r).Formally, the instance-wise loss can be described as follows, ) where M is the PLM-based metric, θ A is the parameters of the PLM with debiasing adapters.To increase the diversity of the training data, we also include the gender subset of StereoSet (Nadeem et al., 2021), which is a crowd-sourced dataset consisting of context association tests (CATs).To retain the model performance for evaluating text generation, we use the original sentence-pairs in MultiNLI and STS-B to perform knowledge distillation (KD) (Hinton et al., 2015).In particular, for a pair of premise and hypothesis (p, h), we encourage the metric model with adapters to mimic the score of the original metric without adapters: 2 , (5) where θ LM is the original parameters of the PLM.The debiasing loss and the knowledge distillation loss are unweighted summed for training the injected adapters.
Implementation Details.Though the proposed approach can address any common types of bias, we limit our study to only mitigating gender bias because (1) gender bias is the most significant bias in existing metrics (see Figure 2), (2) the resources for implementation (e.g., the term substitution pairs for CDA) and comparison (e.g., with Zari models) of gender bias mitigation are more sufficient.We leave the mitigation of a wider range of bias to future work.The total number of training samples is ∼800k, where ∼400k for bias mitigation and ∼400k for knowledge distillation.We adopt the same set of gender term pairs for CDA as Lauscher et al. (2021).Our implementation is based on AdapterHub (Pfeiffer et al., 2020).Hyperparameters are provided in Appendix B.
Results.We evaluate our bias mitigation method on BERTScore, BLEURT, and BARTScore, corresponding to three different paradigms, matching, regression, and generation.Since the base versions of PLMs exhibit the most significant bias, we mainly mitigate bias with BERT-base as the backbone of BERTScore and BLEURT, and BART-base as the backbone of BARTScore.For comparison with Zari models, we also conduct experiments on BERT-large for BERTScore.As shown in Table 5, after plugging our trained debiasing adapters, the gender bias in the three metrics is significantly reduced.On BERTScore and BLEURT, injecting debiasing adapters can even improve performance on REALSumm and WMT20, respectively.Compared with using Zari models for BERTScore (Table 4), our debiasing adapters with BERT-large performs better than Zari-Dropout but worse than Zari-CDA in terms of bias mitigation.By contrast, our approach has a lower computational cost, and can be activated and switched in a plug-and-play fashion.

Related Work
PLM-based Metrics for Text Generation.Existing PLM-based metrics can be categorized into three paradigms: matching, regression, and generation.Matching-based metrics, such as BERTScore (Zhang et al., 2020) and Mover-Score (Zhao et al., 2019b), compute the similarity of system outputs and references based on the features extracted by PLMs like BERT (Devlin et al., 2019).Regression-based metrics, such as BLEURT (Sellam et al., 2020) and COMET (Rei et al., 2020), fine-tune PLMs with a regression objective on human ratings data.Generation-based metrics, such as PRISM (Thompson and Post, 2020) and BARTScore (Yuan et al., 2021), adopt the probability of system outputs conditioned on the references or vice versa as the metric.In contrast to traditional metrics, PLM-based metrics achieve higher correlation with human judgements due to their great power of capturing semantics.
Social Bias in PLMs.With the popularization of PLMs, quantifying the social bias encoded in PLMs has received increasing attention in recent years.Template-based methods are proposed to measure fairness of PLMs based on the predictions (Webster et al., 2020) or the log probabilities (Kurita et al., 2019) on the interested slot in the hand-crafted template, e.g., "X likes to [MASK]".Another line of research (May et al., 2019;Lauscher et al., 2021;Tan and Celis, 2019) quantifies bias based on the representations encoded by PLMs.For example, SEAT (May et al., 2019) measures the cosine distance between the representations (from the [CLS] token in BERT and the last token in GPT) of two sets of attributes.PCA-based methods (Basta et al., 2019;Zhao et al., 2019a) and causal methods (Vig et al., 2020) are also proposed to analyse social bias in PLMs.In addition, high-quality crowdsourced datasets such as StereoSet (Nadeem et al., 2021) and CrowS-Pairs (Nangia et al., 2020) are constructed for measuring fairness of PLMs.

Conclusion
In this paper, we present a systematic study on the social bias in PLM-based metrics for text generation, which have been widely adopted in a variety of tasks.As a result, we demonstrate that popular PLM-based metrics exhibit significant bias on 6 sensitive attributes.Through in-depth analysis, we shed some light on the impact of different factors (e.g., modeling paradigms, PLMs, etc.) on metric bias.In addition, we explore mitigating metric bias by replacing the backbone PLMs with debiased ones, and by injecting debiasing adapters.Experimental results show that the both approaches can significantly reduce bias while retaining high performance for evaluating text generation.

Limitations
Though our proposed debiasing approach is agnostic to bias type, we only conduct experiments of mitigating gender bias in PLM-based metrics because: (1) Gender bias is shown to be the most significant bias in PLM-based metrics; (2) The resources for performing CDA for gender bias are more sufficient; (3) There are existing debiased models (e.g., Zari models) for comparison.We leave the investigation of mitigating bias against other sensitive attributes to a future work.For evaluating the performance of the (debiased) PLMbased metrics, we only consider two tasks, namely machine translation and text summarization.The performance and its change after mitigating bias on a wider range of generation tasks such as image captioning should be explored in future.

Ethics Statement
This work is a systematic study on the social bias in PLM-based metrics for text generation, which have been commonly used by researchers and industry.We empirically show that popular PLM-based metrics exhibit significantly higher degree of social bias against 6 sensitive attributes than traditional metrics, which could help practitioners and the community review existing text generation systems in a new dimension.In addition, we present several effective methods of mitigating social bias in PLMbased metrics, which are early attempts towards fair text generation metrics and systems.

C Full Results of Fairness Evaluation
We provide full results of evaluating metric bias in Table 6.For PLM-based metrics, we evaluate using different backbone models with varying sizes.
For generation-based metrics, namely PRISM and BARTScore, we report the results of using precision, recall, and F-score as the text generation metric, respectively.

D Full Results of Performance Evaluation
In Table 4, we only show the average Pearson correlation of BERTScore and MoverScore across 10 language-pairs in the WMT20 dataset.Table 8 provides the full results of performance on all the language-pairs.

E On the Definition of Metric Bias
In Eq. ( 2) we measure the metric bias as the absolute difference between the sentence pairs instead of the difference with the polarity of stereotype or anti-stereotype, which we will refer to as stereotypical difference.
cs-en de-en iu-en ja-en km-en pl-en ps-en ru-en ta-en zh-en Avg.Why we use absolute difference?On the one hand, we adopt the absolute difference as the measurement of fairness because our purpose is to encourage text generation metrics to assign the same score to a pair of candidates if the only difference between them is the identity words instead of rating the stereotypical or anti-stereotypical one.If we use the stereotypical difference as the measurement of fairness, then a text generation metric that rates stereotypical candidates 50% of the time and rates anti-stereotypical candidates 50% of the time will be considered to be fair but actually, unfairness has happened to those candidates.We do not consider such a text generation metric to be fair though it seems fair "statistically".
Results of stereotypical difference.On the other hand, stereotypical difference can be another useful measurement and is a good complementary to the current measurement.To that end, we also demonstrate results of gender bias evaluated using stereotypical difference in Table 9.We find that both n-gram-based metrics and PLM-based metrics generally exhibit lower gender bias when switching to stereotypical difference but PLM-based metrics still carry a higher degree of gender bias than ngram-based metrics.We leave the exploration of better measurement of metric bias to future work.

Figure 1 :
Figure 1: Impact of the social bias in PLM-based metrics.The red arrows indicate the propagation of social bias in PLM-based metrics.
Figure 2: Measurement of social bias in 5 traditional n-gram-based metrics and 5 PLM-based metric.Note that the y-axis ranges are different in the two histograms.

Figure 3 :
Figure 3: A visualization case of MoverScore that interprets the gender bias.

Figure 4 :
Figure 4: Average bias of different PLM-based metrics with varying sizes of PLMs.

Table 1 :
Examples of gender bias exhibited by PLM-based metrics.The evaluation scores are normalized to [0, 100] with Eq. (1).The red numbers indicate the score differences reflecting stereotypes.

Table 2 :
A summary of three paradigms of PLM-based metrics."Sim" indicates a similarity function, f indicates a regression layer, ∥ means concatenation.

Table 3 :
Statistics of the constructed datasets for evaluating different types of fairness.whereS is the original metric score, S min and S max are the minimal and maximal values of the evaluated metric on the dataset.Assume Ŝi,1 and Ŝi,2 are transformed scores of first and second candidatereference pairs (sys i,1 , ref i ) and (sys i,2 , ref i ) of the i-th paired example, the social bias for a sensitive attribute can be defined as the average score difference of the paired examples,

Table 4 :
Results of mitigating intrinsic bias in BERTScore and MoverScore.Blue numbers indicate positive effects, red numbers indicate negative effects.

Table 5 :
Results of mitigating metric bias with adapters.

Table 6 :
Full experimental results of measuring social bias in text generation metrics.PA: Physical Appearance.SS: Socioeconomic Status.The recommended (default) configurations are in bold.

Table 9 :
Comparison of gender bias evaluated using absolute difference and stereotypical difference.