Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

Philippe Laban, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong


Abstract
Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model’s output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.
Anthology ID:
2022.emnlp-main.135
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2094–2108
Language:
URL:
https://aclanthology.org/2022.emnlp-main.135
DOI:
10.18653/v1/2022.emnlp-main.135
Bibkey:
Cite (ACL):
Philippe Laban, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2094–2108, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets (Laban et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.135.pdf