An Analysis of Dataset Overlap on Winograd-Style Tasks

Ali Emami, Kaheer Suleman, Adam Trischler, Jackie Chi Kit Cheung


Abstract
The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlaps that occur between these corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Based on these results, we provide the WSC-Web dataset, consisting of over 60k pronoun disambiguation problems scraped from web data, being both the largest corpus to date, and having a significantly lower proportion of overlaps with current pretraining corpora.
Anthology ID:
2020.coling-main.515
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5855–5865
Language:
URL:
https://aclanthology.org/2020.coling-main.515
DOI:
10.18653/v1/2020.coling-main.515
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.515.pdf
Data
GLUEWSCWinoGrande