Do We Need to Create Big Datasets to Learn a Task?

Swaroop Mishra, Bhavdeep Singh Sachdeva


Abstract
Deep Learning research has been largely accelerated by the development of huge datasets such as Imagenet. The general trend has been to create big datasets to make a deep neural network learn. A huge amount of resources is being spent in creating these big datasets, developing models, training them, and iterating this process to dominate leaderboards. We argue that the trend of creating bigger datasets needs to be revised by better leveraging the power of pre-trained language models. Since the language models have already been pre-trained with huge amount of data and have basic linguistic knowledge, there is no need to create big datasets to learn a task. Instead, we need to create a dataset that is sufficient for the model to learn various task-specific terminologies, such as ‘Entailment’, ‘Neutral’, and ‘Contradiction’ for NLI. As evidence, we show that RoBERTA is able to achieve near-equal performance on 2% data of SNLI. We also observe competitive zero-shot generalization on several OOD datasets. In this paper, we propose a baseline algorithm to find the optimal dataset for learning a task.
Anthology ID:
2020.sustainlp-1.23
Volume:
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing
Month:
November
Year:
2020
Address:
Online
Editors:
Nafise Sadat Moosavi, Angela Fan, Vered Shwartz, Goran Glavaš, Shafiq Joty, Alex Wang, Thomas Wolf
Venue:
sustainlp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
169–173
Language:
URL:
https://aclanthology.org/2020.sustainlp-1.23
DOI:
10.18653/v1/2020.sustainlp-1.23
Bibkey:
Cite (ACL):
Swaroop Mishra and Bhavdeep Singh Sachdeva. 2020. Do We Need to Create Big Datasets to Learn a Task?. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 169–173, Online. Association for Computational Linguistics.
Cite (Informal):
Do We Need to Create Big Datasets to Learn a Task? (Mishra & Sachdeva, sustainlp 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.sustainlp-1.23.pdf
Video:
 https://slideslive.com/38939445
Data
ANLISNLI