Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Nikita Nangia; Samuel Bowman

doi:10.18653/v1/P19-1449

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Abstract

The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the data-poor setting that our annotators must learn in, we also train the BERT model (Devlin et al., 2019) in limited-data regimes, and conclude that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding.

Anthology ID:: P19-1449
Volume:: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2019
Address:: Florence, Italy
Editors:: Anna Korhonen, David Traum, Lluís Màrquez
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4566–4575
Language:
URL:: https://aclanthology.org/P19-1449/
DOI:: 10.18653/v1/P19-1449
Bibkey:
Cite (ACL):: Nikita Nangia and Samuel R. Bowman. 2019. Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4566–4575, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark (Nangia & Bowman, ACL 2019)
Copy Citation:
PDF:: https://aclanthology.org/P19-1449.pdf

PDF Cite Search Fix data