Investigating Metric Diversity for Evaluating Long Document Summarisation

Cai Yang, Stephen Wan


Abstract
Long document summarisation, a challenging summarisation scenario, is the focus of the recently proposed LongSumm shared task. One of the limitations of this shared task has been its use of a single family of metrics for evaluation (the ROUGE metrics). In contrast, other fields, like text generation, employ multiple metrics. We replicated the LongSumm evaluation using multiple test set samples (vs. the single test set of the official shared task) and investigated how different metrics might complement each other in this evaluation framework. We show that under this more rigorous evaluation, (1) some of the key learnings from Longsumm 2020 and 2021 still hold, but the relative ranking of systems changes, and (2) the use of additional metrics reveals additional high-quality summaries missed by ROUGE, and (3) we show that SPICE is a candidate metric for summarisation evaluation for LongSumm.
Anthology ID:
2022.sdp-1.13
Volume:
Proceedings of the Third Workshop on Scholarly Document Processing
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Arman Cohan, Guy Feigenblat, Dayne Freitag, Tirthankar Ghosal, Drahomira Herrmannova, Petr Knoth, Kyle Lo, Philipp Mayr, Michal Shmueli-Scheuer, Anita de Waard, Lucy Lu Wang
Venue:
sdp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
115–125
Language:
URL:
https://aclanthology.org/2022.sdp-1.13
DOI:
Bibkey:
Cite (ACL):
Cai Yang and Stephen Wan. 2022. Investigating Metric Diversity for Evaluating Long Document Summarisation. In Proceedings of the Third Workshop on Scholarly Document Processing, pages 115–125, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
Investigating Metric Diversity for Evaluating Long Document Summarisation (Yang & Wan, sdp 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sdp-1.13.pdf
Code
 caiyangcy/sdp-longsumm-metric-diversity