Evaluation of Summarization Systems across Gender, Age, and Race

Anna Jørgensen, Anders Søgaard


Abstract
Summarization systems are ultimately evaluated by human annotators and raters. Usually, annotators and raters do not reflect the demographics of end users, but are recruited through student populations or crowdsourcing platforms with skewed demographics. For two different evaluation scenarios – evaluation against gold summaries and system output ratings – we show that summary evaluation is sensitive to protected attributes. This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.
Anthology ID:
2021.newsum-1.6
Volume:
Proceedings of the Third Workshop on New Frontiers in Summarization
Month:
November
Year:
2021
Address:
Online and in Dominican Republic
Editors:
Giuseppe Carenini, Jackie Chi Kit Cheung, Yue Dong, Fei Liu, Lu Wang
Venue:
NewSum
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
51–56
Language:
URL:
https://aclanthology.org/2021.newsum-1.6
DOI:
10.18653/v1/2021.newsum-1.6
Bibkey:
Cite (ACL):
Anna Jørgensen and Anders Søgaard. 2021. Evaluation of Summarization Systems across Gender, Age, and Race. In Proceedings of the Third Workshop on New Frontiers in Summarization, pages 51–56, Online and in Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Evaluation of Summarization Systems across Gender, Age, and Race (Jørgensen & Søgaard, NewSum 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.newsum-1.6.pdf
Video:
 https://aclanthology.org/2021.newsum-1.6.mp4