On the Role of Corpus Ordering in Language Modeling

Ameeta Agrawal, Suresh Singh, Lauren Schneider, Michael Samuels


Abstract
Language models pretrained on vast corpora of unstructured text using self-supervised learning framework are used in numerous natural language understanding and generation tasks. Many studies show that language acquisition in humans follows a rather structured simple-to-complex pattern and guided by this intuition, curriculum learning, which enables training of computational models in a meaningful order, such as processing easy samples before hard ones, has been shown to potentially reduce training time. The question remains whether curriculum learning can benefit pretraining of language models. In this work, we perform comprehensive experiments involving multiple curricula strategies varying the criteria for complexity and the training schedules. Empirical results of training transformer language models on English corpus and evaluating it intrinsically as well as after fine-tuning across eight tasks from the GLUE benchmark, show consistent improvement gains over conventional vanilla training. Interestingly, in our experiments, when evaluated on one epoch, the best model following a document-level hard-to-easy curriculum, outperforms the vanilla model by 1.7 points (average GLUE score) and it takes the vanilla model twice as many training steps to reach comparable performance.
Anthology ID:
2021.sustainlp-1.15
Volume:
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing
Month:
November
Year:
2021
Address:
Virtual
Editors:
Nafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasović, Sujith Ravi
Venue:
sustainlp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
142–154
Language:
URL:
https://aclanthology.org/2021.sustainlp-1.15
DOI:
10.18653/v1/2021.sustainlp-1.15
Bibkey:
Cite (ACL):
Ameeta Agrawal, Suresh Singh, Lauren Schneider, and Michael Samuels. 2021. On the Role of Corpus Ordering in Language Modeling. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 142–154, Virtual. Association for Computational Linguistics.
Cite (Informal):
On the Role of Corpus Ordering in Language Modeling (Agrawal et al., sustainlp 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.sustainlp-1.15.pdf
Video:
 https://aclanthology.org/2021.sustainlp-1.15.mp4
Data
GLUEQNLIWikiText-103WikiText-2