PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

Zongxia Li, Ishani Mondal, Huy Nghiem, Yijun Liang, Jordan Boyd-Graber


Abstract
Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods (BERTScore).
Anthology ID:
2024.findings-emnlp.548
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9373–9398
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.548
DOI:
Bibkey:
Cite (ACL):
Zongxia Li, Ishani Mondal, Huy Nghiem, Yijun Liang, and Jordan Boyd-Graber. 2024. PEDANTS: Cheap but Effective and Interpretable Answer Equivalence. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9373–9398, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence (Li et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.548.pdf