Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

Jason Phang, Angelica Chen, William Huang, Samuel R. Bowman


Abstract
Large language models increasingly saturate existing task benchmarks, in some cases outperforming humans, leaving little headroom with which to measure further progress. Adversarial dataset creation, which builds datasets using examples that a target system outputs incorrect predictions for, has been proposed as a strategy to construct more challenging datasets, avoiding the more serious challenge of building more precise benchmarks by conventional means. In this work, we study the impact of applying three common approaches for adversarial dataset creation: (1) filtering out easy examples (AFLite), (2) perturbing examples (TextFooler), and (3) model-in-the-loop data collection (ANLI and AdversarialQA), across 18 different adversary models. We find that all three methods can produce more challenging datasets, with stronger adversary models lowering the performance of evaluated models more. However, the resulting ranking of the evaluated models can also be unstable and highly sensitive to the choice of adversary model. Moreover, we find that AFLite oversamples examples with low annotator agreement, meaning that model comparisons hinge on the examples that are most contentious for humans. We recommend that researchers tread carefully when using adversarial methods for building evaluation datasets.
Anthology ID:
2022.dadc-1.8
Volume:
Proceedings of the First Workshop on Dynamic Adversarial Data Collection
Month:
July
Year:
2022
Address:
Seattle, WA
Editors:
Max Bartolo, Hannah Kirk, Pedro Rodriguez, Katerina Margatina, Tristan Thrush, Robin Jia, Pontus Stenetorp, Adina Williams, Douwe Kiela
Venue:
DADC
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
62–62
Language:
URL:
https://aclanthology.org/2022.dadc-1.8
DOI:
10.18653/v1/2022.dadc-1.8
Bibkey:
Cite (ACL):
Jason Phang, Angelica Chen, William Huang, and Samuel R. Bowman. 2022. Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair. In Proceedings of the First Workshop on Dynamic Adversarial Data Collection, pages 62–62, Seattle, WA. Association for Computational Linguistics.
Cite (Informal):
Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair (Phang et al., DADC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.dadc-1.8.pdf
Data
ANLIAdversarialQACosmosQAGLUEMultiNLISNLISQuADSWAG