Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking

Dirk Väth; Pascal Tilli; Ngoc Thang Vu

doi:10.18653/v1/2021.emnlp-demo.14

Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking

Abstract

On the way towards general Visual Question Answering (VQA) systems that are able to answer arbitrary questions, the need arises for evaluation beyond single-metric leaderboards for specific datasets. To this end, we propose a browser-based benchmarking tool for researchers and challenge organizers, with an API for easy integration of new models and datasets to keep up with the fast-changing landscape of VQA. Our tool helps test generalization capabilities of models across multiple datasets, evaluating not just accuracy, but also performance in more realistic real-world scenarios such as robustness to input noise. Additionally, we include metrics that measure biases and uncertainty, to further explain model behavior. Interactive filtering facilitates discovery of problematic behavior, down to the data sample level. As proof of concept, we perform a case study on four models. We find that state-of-the-art VQA models are optimized for specific tasks or datasets, but fail to generalize even to other in-domain test sets, for example they can not recognize text in images. Our metrics allow us to quantify which image and question embeddings provide most robustness to a model. All code s publicly available.

Anthology ID:: 2021.emnlp-demo.14
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Heike Adel, Shuming Shi
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 114–123
Language:
URL:: https://aclanthology.org/2021.emnlp-demo.14
DOI:: 10.18653/v1/2021.emnlp-demo.14
Bibkey:
Cite (ACL):: Dirk Väth, Pascal Tilli, and Ngoc Thang Vu. 2021. Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 114–123, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking (Väth et al., EMNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.emnlp-demo.14.pdf
Video:: https://aclanthology.org/2021.emnlp-demo.14.mp4
Code: patilli/vqa_benchmarking
Data: CLEVR, GQA, OK-VQA, TextVQA, Visual Question Answering

PDF Cite Search Code Video