Evaluation of Summarization Systems across Gender, Age, and Race

Summarization systems are ultimately evaluated by human annotators and raters. Usually, annotators and raters do not reflect the demographics of end users, but are recruited through student populations or crowdsourcing platforms with skewed demographics. For two different evaluation scenarios – evaluation against gold summaries and system output ratings – we show that summary evaluation is sensitive to protected attributes. This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.


Introduction
Summarization -the task of automatically generating brief summaries of longer documents or collections of documents -has, so it seems, seen a lot of progress recently.Progress, of course, is relative to how performance is measured.Generally, summarization systems are evaluated in two ways: by comparing machine-generated summaries to human summaries by text similarity metrics (Lin, 2004;Nenkova and Passonneau, 2004) or by human rater studies, in which participants are asked to rank system outputs.While using similarity metrics is controversial (Liu and Liu, 2008;Graham, 2015;Schluter, 2017), the standard way to evaluate summarization systems is a combination of both.
Both comparison to human summaries and the use of human raters naturally involve human participants, and these participants are typically recruited in some way.In Liu and Liu (2008), for example, the human subjects are five undergraduate students in Computer Science.Undergraduate students in Computer Science are not necessarily representative of the population at large, however, or of the end users of the technologies we develop.In this work, we ask whether such sampling bias when We take steps toward evaluating the impact of the gender, age, and race of the humans involved in the summarization system evaluation loop: the authors of the summaries and the human judges or raters.We observe significant group disparities, with lower performance when systems are evaluated on summaries produced by minority groups.See §3 and Table 1 for more details on the Rouge-L scores in the bar chart.
recruiting participants to evaluate summarization systems, is a problem?In other words, do different demographics exhibit different preferences in rater studies of summarization systems?NLP models are only fair if they do not put certain demographics at a disadvantage (Larson, 2017), and it is therefore crucial our benchmarks reflect preferences and judgments across those demographics (Ethayarajh and Jurafsky, 2020). 1ontributions We present the, to the best of our knowledge, first in-detail evaluations of summarization systems across demographic groups, focusing on two very different extractive summarization systems -TextRank (Mihalcea and Tarau, 2004) and MatchSum (Zhong et al., 2020).The groups are defined by the three protected attributes: gender, age, and race.While the systems are reported to perform very differently, we show that the system rankings induced by performance scores or user preferences differ across these groups of human summary authors and summary raters.We analyze what drives these differences and provide recommendations for future evaluations of summarization systems.

Experiments
We present two evaluations in this short paper: an automated scoring against human summaries (EXP.A) and a human rater study (EXP.B).
In both experiments, we use Amazon Mechanical Turk to recruit annotators from different demographic groups, and the first paragraphs of biographies from English Wikipedia as our input data, using the Wikidata API for extraction. 2 We create a dataset of biographies of women and men, obtain human summaries, and generate summaries of these biographies using two out-of-the-box extractive summarization systems.In EXP.A, we compare the system summaries directly to the human summaries (from different groups); in EXP.B, we let human raters compare and rate the two system summaries.To ensure differences between the two summarization systems, we use the 2004 graph-based TextRank (Mihalcea and Tarau, 2004) and the 2020 state-of-the-art, BERT-based Match-Sum (Zhong et al., 2020). 3We follow the Match-Sum guidelines described in (Zhong et al., 2020) and limit the length of the input biographies to a maximum 5 sentences and force the output summaries to be between 2-3 sentences long.Our final dataset consists of the original 975 biographies (700 men and 275 women), along with two automatic summaries, as well as human 3 sentence summaries, and is made freely available. 6 Our evaluations rely on annotations and ratings from Amazon Mechanical Turk.For quality control, we rely on a control question, as well as analyzing annotation time: If a task is completed faster than one standard deviation of the average time spent, the answers in that task are discarded.We 2 https://query.wikidata.org/ 3We use the implementation of TextRank by Barrios et al. (2016)  4 and the original MatchSum implementation. 5Match-Sum obtains state-of-the-art performance across a range of benchmarks by learning to produce summaries whose document encoding is similar to that of the input document.Tex-tRank is a much simpler extractive algorithm; it adopts PageRank to compute node centrality recursively based on a Markov chain model.While MatchSum obtains a Rouge-1 score of .44 on CNN/Daily Mail, TextRank obtains a Rouge-1 score of .33 (Zheng and Lapata, 2019).We use both systems with recommended parameters, as was done in Zheng and Lapata (2019).Note that TextRank, in contrast to MatchSum, is unsupervised.Our Rouge-1 scores below for Wikipedia biographies are generally comparable.et al., 2020) across self-reported protected attributes: gender, with values ♀, ♂, and other (all our annotators identified as either male or female), race, binarized here as white and other (in order to achieve rough size balance).The ROUGE scores of MatchSum are clearly higher when evaluated against reference summaries created by white men.We also considered age (binarized as ±30, to achieve size balance): Here we see slightly better performance when evaluated against summaries of older participants across all genders annotators identified with.
collected one manual summary and two system rankings per biography, resulting in 3,135 annotations.
Human summaries In EXP.A, participants were asked to enter the three most important sentences in the document and in three blank text fields; for quality control, we check that these sentences occur in the input document.We collect a total of 1,185 summaries, 53% of which are written by women (0.5% identified neither as male or female).74% of summaries are written by participants older than 30 years of age.76% identified as white; 11% as Blacks; 5% as American Indians; 4% as Asians, and 4% as Hispanics. 7We binarize race as white and other to achieve rough size balance across groups.Aggregating scores across multiple races is not ideal, but by doing so, we compensate for poor representation of some demographics.
Rater study In EXP.B, we present participants with two 2-3 sentence machine summaries and ask them to a) pick their preferred summary and b) rank the two summaries on 4-point forced Likert scales, for fluency, informativeness and usefulness.40.2% of our raters identified as female.37.5% were below 30 years of age.70.8% of ratings identified as white, the rest as American Indians (2.3%), Asians (3.5%), Blacks (19.1%),Hispanics (2.0%), or as others (2.2%).
We ask all participants to voluntarily submit their race and gender information, and require that they be US-based.We asked the participants in the rater study to also include age information.

Results
In Table 1, we present the results of EXP.A: Rouge-1 and Rouge-L results are significantly better when evaluated on summaries produced by white men than when evaluated on summaries produced by any other group.MatchSum summaries also align better with those written by white women compared to those written by nonwhite women.Generally, MatchSum aligns better with men than with women.EXP. 2 includes three demographic variables (gender, age, and race).Table 2 presents ratings across gender and age.Most participants prefer the reportedly superior system (with a Rouge-1 advantage of 0.11 on a standard benchmark; see §2), but younger women significantly preferred TextRank over MatchSum (p < 0.01).Table 3 presents the ratings across age and race.Here, we again find a single outlier group: Younger blacks significantly prefer TextRank over MatchSum (p < 0.01).Our results imply that our standard evaluation methodologies do not align with the subjective evaluations of younger women and younger blacks.
We try to explain these two observations in §5.
We checked for significant group rating differ- All these subdemographics exhibit significantly different ranking behavior from their peers.So, for example, our results show a significant difference between young and old raters.
We also bin our results by gender of the subjects of the biographies.We rely on Wikidata gender information to make this classification.There are 1409 preferences and ratings of men's biographies (MEN), and 585 of biographies of women (WOMEN).This of course means we see fewer significant differences in ratings of female biographies.For MEN, we find significant differences across a wide range of groups, and with stronger effects for some demographics, suggesting that the gender of the subject of the biography does impact ratings differently across subdemographics.We find significant results for WOMEN only for the subdemographic WHITE (p = 0.004).This result is interesting, though, since it shows that on female biographies, white and non-white annotators prefer different systems.
Finally, we also asked our annotators to rank the two systems based on fluency, informativeness and usefulness.We used a 4-point forced Likert scale.
One observation is that even across fine-grained dimensions, younger annotators rate summaries lower; see Table 4. Interestingly, however, this difference is only observed with female biographies (rows 3-6).See Table 5 for the results on ALL across race.While ratings are generally low, we see clear differences, with Hispanics finding Text-Rank significantly more informative and useful, and American Indians finding TextRank significantly more fluent.Interestingly, Hispanics exhibit significant differences across WOMEN and MEN, finding TextRank summaries of female biographies significantly more informative and useful than Text-Rank summaries of male biographies.

Analysis
In order to analyze the differences between the rating behavior of subdemographics, we learn which features are significant for each demographic by training a simple logistic regression text classifier trained on the summaries ranked by each of the subdemographics with significantly different ranking behavior.As task representation, we represent each ranking instance as a vector of 2*149 features, one 149-sized subspace for each summary.Each subspace is made up of a one-hot vector of 145 frequent words (from the English stop words list in NLTK 8 ), as well as four task specific features: the summary's average word length, whether the first sentence of the biography is included in the summary, the type/token ratio, and the text complexity of the summaries.We concatenate the 149 features from each system and scale them.We extract the top 20 most salient features for each demographic group and analyze them manually: The average word length of the MatchSum system correlates positively to annotators preferring MatchSum across several demographics, e.g., OVER 30 and MALE WHITE, but this effect is absent with female annotators.Since the inductive bias of TextRank does not explicitly prohibit redundancy (Mihalcea and Tarau, 2004), this finding indicates that MatchSum is preferred among older men, especially whites, when it is informative, introduces main entities, etc.However, other subdemographics seem less sensitive to this variation.MatchSum is not generally rated more informative and useful across demographics (Table 5).In other subdemographics, e.g., AMERICAN INDIAN, MatchSum summaries with pronouns are rated higher, indi-8 nltk.orgcating it is better than TextRank at extracting sentences with pronouns without breaking coreference chains.Referential clarity, e.g., dangling pronouns, is a known source of error in summarization (Pitler et al., 2010;Durrett et al., 2016).TextRank summaries are often preferred by AMERICAN INDIAN and ASIAN, when they include negation.This is unsurprising, since negated sentences can often be very informative, and may seem more sophisticated in the context of machine-generated summaries.Negation is also a known source of error (Fiszman et al., 2006).In our data, however, this effect varies across subdemographics.
Our main observation is that female and black participants under 30 prefer TextRank over Match-Sum.What drives this?The main predictors in our logistic regression analysis are a) TextRank extracting the first sentence of the biography (twice as frequently than MatchSum, in more than half of its summaries); and b) TextRank sentences containing negation.The former suggests a need for anchoring or framing of the summary, as initial sentences tend to provide this; the latter could suggest that young female or black participants are less prone to the common bias of evaluating negated sentences as less important (Kaup et al., 2013).

Conclusion
Our paper is, as far as we know, the first to evaluate summarization systems across different subdemographics.We did so in two different evaluation scenarios: automatic evaluation against gold summaries and system output ratings by human evaluators.We made the gold summaries and the ratings available for future research.
What did we learn from our experiments?Most importantly, of course, we learned that performance numbers differ when evaluated on summaries written by different subdemographics, and that the preferences of rathers from different subdemographics differ.In our experiments with automatic evaluation against gold summaries written by different subdemographics, we saw that summarization systems achieve higher performance scores when evaluated on summaries produced by white men, highlighting an unfortunate bias in these systems.In our rater studies, we also saw significant differences across subdemographics.Most surprisingly, perhaps, we saw that a summarization system from 2004 was rated better than a state-of-the-art system from 2020 by some subdemographics, and effect that was found to relate to the occurrence of first sentences (providing anchoring or framing of summaries) and negation (often evaluated as less important by majority groups).For now, we can only speculate what a summarization system optimized to perform well across all subdemographics would look like, e.g., a system minimizing the worst-case loss across subdemographics rather than the average loss.Our results show very clearly, however, the current state of the art in summarization is biased toward some demographics and therefore fundamentally unfair.

Figure 1 :
Figure 1: Social bias in automatic summarization:We take steps toward evaluating the impact of the gender, age, and race of the humans involved in the summarization system evaluation loop: the authors of the summaries and the human judges or raters.We observe significant group disparities, with lower performance when systems are evaluated on summaries produced by minority groups.See §3 and Table1for more details on the Rouge-L scores in the bar chart.

Table 2 :
System ratings across participant gender and age.We highlight the outlier: Younger women significantly preferred TextRank over MatchSum (p < 0.01).

Table 3 :
System ratings across participant race and age.We highlight the outlier: Young blacks significantly preferred TextRank over MatchSum (p < 0.01).