SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, William Yang Wang


Abstract
Is it possible to train a general metric for evaluating text generation quality without human-annotated ratings? Existing learned metrics either perform unsatisfactory across text generation tasks or require human ratings for training on specific tasks. In this paper, we propose SEScore2, a self-supervised approach for training a model-based metric for text generation evaluation. The key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. We evaluate SEScore2 and previous methods on four text generation tasks across three languages. SEScore2 outperforms all prior unsupervised metrics on four text generation evaluation benchmarks, with an average Kendall improvement of 0.158. Surprisingly, SEScore2 even outperforms the supervised BLEURT and COMET on multiple text generation tasks.
Anthology ID:
2023.acl-long.283
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5166–5183
Language:
URL:
https://aclanthology.org/2023.acl-long.283
DOI:
10.18653/v1/2023.acl-long.283
Bibkey:
Cite (ACL):
Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, and William Yang Wang. 2023. SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5166–5183, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes (Xu et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.283.pdf