Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, Verena Rieser


Abstract
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.
Anthology ID:
D17-1238
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2241–2252
Language:
URL:
https://aclanthology.org/D17-1238/
DOI:
10.18653/v1/D17-1238
Bibkey:
Cite (ACL):
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why We Need New Evaluation Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Why We Need New Evaluation Metrics for NLG (Novikova et al., EMNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/D17-1238.pdf
Attachment:
 D17-1238.Attachment.zip
Code
 jeknov/EMNLP_17_submission