Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Huihan Li, Tianyu Gao, Manan Goenka, Danqi Chen


Abstract
Conversational question answering aims to provide natural-language answers to users in information-seeking conversations. Existing conversational QA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further investigate how to improve automatic evaluations, and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we analyze the impact of various modeling strategies and discuss future directions towards building better conversational question answering systems.
Anthology ID:
2022.acl-long.555
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8074–8085
Language:
URL:
https://aclanthology.org/2022.acl-long.555
DOI:
10.18653/v1/2022.acl-long.555
Award:
 Outstanding Paper
Bibkey:
Cite (ACL):
Huihan Li, Tianyu Gao, Manan Goenka, and Danqi Chen. 2022. Ditch the Gold Standard: Re-evaluating Conversational Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8074–8085, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Ditch the Gold Standard: Re-evaluating Conversational Question Answering (Li et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.555.pdf
Video:
 https://aclanthology.org/2022.acl-long.555.mp4
Code
 princeton-nlp/evalconvqa +  additional community code
Data
CANARDCoQAQuAC