A Critical Evaluation of Evaluations for Long-form Question Answering

Fangyuan Xu; Yixiao Song; Mohit Iyyer; Eunsol Choi

doi:10.18653/v1/2023.acl-long.181

A Critical Evaluation of Evaluations for Long-form Question Answering

Fangyuan Xu, Yixiao Song, Mohit Iyyer, Eunsol Choi

Abstract

Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices. We hire domain experts in seven areas to provide preference judgments over pairs of answers, along with free-form justifications for their choices. We present a careful analysis of experts’ evaluation, which focuses on new aspects such as the comprehensiveness of the answer. Next, we examine automatic text generation metrics, finding that no existing metrics are predictive of human preference judgments. However, some metrics correlate with fine-grained aspects of answers (e.g., coherence). We encourage future work to move away from a single “overall score” of the answer and adopt a multi-faceted evaluation, targeting aspects such as factuality and completeness. We publicly release all of our annotations and code to spur future work into LFQA evaluation.

Anthology ID:: 2023.acl-long.181
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3225–3245
Language:
URL:: https://aclanthology.org/2023.acl-long.181
DOI:: 10.18653/v1/2023.acl-long.181
Bibkey:
Cite (ACL):: Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol Choi. 2023. A Critical Evaluation of Evaluations for Long-form Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: A Critical Evaluation of Evaluations for Long-form Question Answering (Xu et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-long.181.pdf
Video:: https://aclanthology.org/2023.acl-long.181.mp4

PDF Cite Search Video