What’s under the hood: Investigating Automatic Metrics on Meeting Summarization

Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp


Abstract
Meeting summarization has become a critical task considering the increase in online interactions. Despite new techniques being proposed regularly, the evaluation of meeting summarization techniques relies on metrics not tailored to capture meeting-specific errors, leading to ineffective assessment. This paper explores what established automatic metrics capture and the errors they mask by correlating metric scores with human evaluations across a comprehensive error taxonomy. We start by reviewing the literature on English meeting summarization to identify key challenges, such as speaker dynamics and contextual turn-taking, and error types, including missing information and linguistic inaccuracy, concepts previously loosely defined in the field. We then examine the relationship between these challenges and errors using human annotated transcripts and summaries from encoder-decoder-based and autoregressive Transformer models on the QMSum dataset. Experiments reveal that different model architectures respond variably to the challenges, resulting in distinct links between challenges and errors. Current established metrics struggle to capture the observable errors, showing weak to moderate correlations, with a third of the correlations indicating error masking. Only a subset of metrics accurately reacts to specific errors, while most correlations show either unresponsiveness or failure to reflect the error’s impact on summary quality.
Anthology ID:
2024.findings-emnlp.393
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6709–6723
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.393
DOI:
Bibkey:
Cite (ACL):
Frederic Kirstein, Jan Philip Wahle, Terry Ruas, and Bela Gipp. 2024. What’s under the hood: Investigating Automatic Metrics on Meeting Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6709–6723, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
What’s under the hood: Investigating Automatic Metrics on Meeting Summarization (Kirstein et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.393.pdf