On Losses for Modern Language Models

Stéphane Aroca-Ouellette, Frank Rudzicz


Abstract
BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP’s effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks – sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant – that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERTBase on the GLUE benchmark using fewer than a quarter of the training tokens.
Anthology ID:
2020.emnlp-main.403
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4970–4981
Language:
URL:
https://aclanthology.org/2020.emnlp-main.403
DOI:
10.18653/v1/2020.emnlp-main.403
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.403.pdf
Video:
 https://slideslive.com/38939128
Code
 StephAO/olfmlm
Data
BookCorpusGLUEQNLISuperGLUE