Spurious Correlations in Reference-Free Evaluation of Text Generation

Esin Durmus, Faisal Ladhak, Tatsunori Hashimoto


Abstract
Model-based, reference-free evaluation metricshave been proposed as a fast and cost-effectiveapproach to evaluate Natural Language Generation(NLG) systems. Despite promising recentresults, we find evidence that reference-freeevaluation metrics of summarization and dialoggeneration may be relying on spuriouscorrelations with measures such as word overlap,perplexity, and length. We further observethat for text summarization, these metrics havehigh error rates when ranking current state-ofthe-art abstractive summarization systems. Wedemonstrate that these errors can be mitigatedby explicitly designing evaluation metrics toavoid spurious features in reference-free evaluation.
Anthology ID:
2022.acl-long.102
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1443–1454
Language:
URL:
https://aclanthology.org/2022.acl-long.102
DOI:
10.18653/v1/2022.acl-long.102
Bibkey:
Cite (ACL):
Esin Durmus, Faisal Ladhak, and Tatsunori Hashimoto. 2022. Spurious Correlations in Reference-Free Evaluation of Text Generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1443–1454, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Spurious Correlations in Reference-Free Evaluation of Text Generation (Durmus et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.102.pdf
Data
DailyDialog