On the Effectiveness of Automated Metrics for Text Generation Systems

Pius von Däniken, Jan Deriu, Don Tuggener, Mark Cieliebak


Abstract
A major challenge in the field of Text Generation is evaluation, because we lack a sound theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we propose a first step towards such a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets. The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems in a given setting. We showcase the application of the theory on the WMT 21 and Spot-The-Bot evaluation data and outline how it can be leveraged to improve the evaluation protocol regarding the reliability, robustness, and significance of the evaluation outcome.
Anthology ID:
2022.findings-emnlp.108
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1503–1522
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.108
DOI:
10.18653/v1/2022.findings-emnlp.108
Bibkey:
Cite (ACL):
Pius von Däniken, Jan Deriu, Don Tuggener, and Mark Cieliebak. 2022. On the Effectiveness of Automated Metrics for Text Generation Systems. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1503–1522, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
On the Effectiveness of Automated Metrics for Text Generation Systems (von Däniken et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-emnlp.108.pdf
Video:
 https://aclanthology.org/2022.findings-emnlp.108.mp4