Evaluating Question Answering Evaluation

Anthony Chen, Gabriel Stanovsky, Sameer Singh, Matt Gardner


Abstract
As the complexity of question answering (QA) datasets evolve, moving away from restricted formats like span extraction and multiple-choice (MC) to free-form answer generation, it is imperative to understand how well current metrics perform in evaluating QA. This is especially important as existing metrics (BLEU, ROUGE, METEOR, and F1) are computed using n-gram similarity and have a number of well-known drawbacks. In this work, we study the suitability of existing metrics in QA. For generative QA, we show that while current metrics do well on existing datasets, converting multiple-choice datasets into free-response datasets is challenging for current metrics. We also look at span-based QA, where F1 is a reasonable metric. We show that F1 may not be suitable for all extractive QA tasks depending on the answer types. Our study suggests that while current metrics may be suitable for existing QA datasets, they limit the complexity of QA datasets that can be created. This is especially true in the context of free-form QA, where we would like our models to be able to generate more complex and abstractive answers, thus necessitating new metrics that go beyond n-gram based matching. As a step towards a better QA metric, we explore using BERTScore, a recently proposed metric for evaluating translation, for QA. We find that although it fails to provide stronger correlation with human judgements, future work focused on tailoring a BERT-based metric to QA evaluation may prove fruitful.
Anthology ID:
D19-5817
Volume:
Proceedings of the 2nd Workshop on Machine Reading for Question Answering
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, Danqi Chen
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–124
Language:
URL:
https://aclanthology.org/D19-5817
DOI:
10.18653/v1/D19-5817
Bibkey:
Cite (ACL):
Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating Question Answering Evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 119–124, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Evaluating Question Answering Evaluation (Chen et al., 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-5817.pdf