DialSummEval: Revisiting Summarization Evaluation for Dialogues

Mingqi Gao, Xiaojun Wan


Abstract
Dialogue summarization is receiving increasing attention from researchers due to its extraordinary difficulty and unique application value. We observe that current dialogue summarization models have flaws that may not be well exposed by frequently used metrics such as ROUGE. In our paper, we re-evaluate 18 categories of metrics in terms of four dimensions: coherence, consistency, fluency and relevance, as well as a unified human evaluation of various models for the first time. Some noteworthy trends which are different from the conventional summarization tasks are identified. We will release DialSummEval, a multi-faceted dataset of human judgments containing the outputs of 14 models on SAMSum.
Anthology ID:
2022.naacl-main.418
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5693–5709
Language:
URL:
https://aclanthology.org/2022.naacl-main.418
DOI:
10.18653/v1/2022.naacl-main.418
Bibkey:
Cite (ACL):
Mingqi Gao and Xiaojun Wan. 2022. DialSummEval: Revisiting Summarization Evaluation for Dialogues. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5693–5709, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
DialSummEval: Revisiting Summarization Evaluation for Dialogues (Gao & Wan, NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.418.pdf
Video:
 https://aclanthology.org/2022.naacl-main.418.mp4
Code
 kite99520/dialsummeval
Data
DialSummEval