When Do You Need Billions of Words of Pretraining Data?

NLP is currently dominated by language models like RoBERTa which are pretrained on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? To explore this question, we adopt five styles of evaluation: classifier probing, information-theoretic probing, unsupervised relative acceptability judgments, unsupervised language model knowledge probing, and fine-tuning on NLU tasks. We then draw learning curves that track the growth of these different measures of model ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that these LMs require only about 10M to 100M words to learn to reliably encode most syntactic and semantic features we test. They need a much larger quantity of data in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other, unidentified, forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.


Introduction
Pretrained language models (LMs) like BERT and RoBERTa have become ubiquitous in NLP. New models require massive datasets of tens or even hundreds of billions of words (Brown et al., 2020) to improve on existing models on language understanding benchmarks like GLUE (Wang et al., 2018). Much recent work has used probing methods to evaluate what these models do and do not For each method, we compute overall performance for each RoBERTa model tested as the macro average over sub-task's performance after normalization. We fit an exponential curve which we scale to have an initial value of 0 and an asymptote at 1. Classifier and MDL probing mainly test models' encoding of linguistic features; BLiMP tests model's understanding of linguistic phenomena; LAMA tests factual knowledge; SuperGLUE is a suite of conventional NLU tasks.
learn (Belinkov and Glass, 2019;Tenney et al., 2019b;Rogers et al., 2020;Ettinger, 2020). Since most of these works only focus on models pretrained on a fixed data volume (usually billions of words), many interesting questions regarding the effect of the amount of pretraining data remain unanswered: What have data-rich models learned that makes them so effective on downstream tasks? How much pretraining data is required for LMs to learn different grammatical features and linguistic phenomena? Which of these skills do we expect to improve when we scale pretraining past 30 billion words? Which aspects of grammar can be learned from data volumes on par with the input to human learners, around 10M to 100M words (Hart and Risley)?
With these questions in mind, we evaluate and probe the MiniBERTas (Warstadt et al., 2020b), a group of RoBERTa models pretrained on 1M, 10M, 100M, and 1B words, and RoBERTa BASE  pretrained on about 30B words, using five methods: First we use standard classifier probing on the edge probing suite of NLP tasks (Tenney et al., 2019b) to measure the quality of the syntactic and semantic features that can be extracted by a downstream classifier with each level of pretraining. Second, we apply minimum description length (MDL) probing (Voita and Titov, 2020) to the edge probing suite, with the goal of quantifying the accessibility of these features. Third, we test the models' knowledge of various syntactic phenomena using unsupervised acceptability judgments on the BLiMP suite (Warstadt et al., 2020a). Fourth, we probe the models' world knowledge and commonsense knowledge using unsupervised language model knowledge probing with the LAMA suite (Petroni et al., 2019). Finally, we fine-tune the models on five tasks from SuperGLUE  to measure their ability to solve conventional NLU tasks.
For each evaluation method, we fit an exponential learning curve to the results as a function of the amount of pretraining data, shown in Figure 1. We have two main findings: First, the results of classifier probing, MDL probing, and unsupervised relative acceptability judgement (BLiMP) show that the linguistic knowledge of models pretrained on 100M words and 30B words is similar, as is the description length of linguistic features. Second, RoBERTa requires billions of words of pretraining data to effectively acquire factual knowledge and to make substantial improvements in performance on dowstream NLU tasks. From these results, we conclude that there are skills critical to solving downstream NLU tasks that LMs can only acquire with billions of words of pretraining data. Future work will likely need to look beyond core linguistic knowledge if we are to better understand and advance the abilities of large language models.

Methods
We probe the MiniBERTas, a set of 12 RoBERTa models pretrained from scratch by Warstadt et al. (2020b) on 1M, 10M, 100M, and 1B words, the publicly available RoBERTa BASE , which is pretrained on about 30B words, 1 and 3 RoBERTa BASE models with randomly initialized parameters.
Descriptions of the five evaluation methods appear in the subsequent sections. 2 In each experiment, we test all 16 models on each task involved. To show the overall trend of improvement, we use non-linear least squares to fit an exponential learning curve to the results. 3 We upsample RoBERTa BASE results in regression in order to have an equal number of results for each data quantity. We use a four-parameter exponential learning curve used to capture diminishing improvement in performance as a function of the number of practice trials (Heathcote et al., 2000;Leibowitz et al., 2010): where E(P n ) is the expected performance after n trials, 4 P 0 and P ∞ and are the initial and asymptotic performance, and α and β are coefficients to translate and dilate the curve in the log domain.
We plot the results in a figure for each task, where the y-axis is the score and the x-axis is the amount of pretraining data. 5 For some plots, we use min-max normalization to adjust the results into the range of [0, 1], where 0 and 1 are the inferred values of P 0 and P ∞ , respectively. 6

Classifier Probing
We use the widely-adopted probing approach of Ettinger et al. (2016), Adi et al. (2017), and otherswhich we call classifier probing-to test the extent to which linguistic features like part-of-speech and coreference are encoded in the frozen model representations. We adopt the ten probing tasks in the 1 The miniBERTas' training data is randomly sampled from Wikipedia and Smashwords in a ratio of 3:1. These two datasets are what Devlin et al. (2019) use to pretrain BERT and represent a subset of the data used to pretrain RoBERTa. RoBERTaBASE's training data also includes of news and web data in addition to Wikipedia and Smashwords. Warstadt et al. ran pretraining 25 times with varying hyperparameter values and model sizes for the 1M-, 10M-, and 100M-word settings, and 10 times for the 1B-word setting. All the models were pretrained with early stopping on validation set perplexity. For each dataset size, they released the three models with the lowest validation set perplexity, yielding 12 models in total.
2 Code: https://github.com/nyu-mll/ pretraining-learning-curves 3 We use SciPy's curve fit implementation. 4 In our case, a trial is one word of pretraining. 5 We plot the no-pretraining random baseline with an xvalue of 1. 6 The unnormalized results are included in the appendix.   In each subplot we also plot the overall edge-probing performance, which we calculate for each MiniBERTa as its average F1 score on the 10 edgeprobing tasks (after normalization). For context, we also plot BERT LARGE performance for each task as reported by Tenney et al. (2019a). edge probing suite (Tenney et al., 2019b). 7 Classifier probing has recently come under scrutiny. Hewitt and Liang (2019) and Voita and Titov (2020) caution that the results depend on the complexity of the probe, and so do not precisely reveal the quality of the representations. However, 7 Task data sources: Part-of-Speech, Constituents, Entities, SRL, and OntoNotes coref. from Weischedel et al. (2013) we see two advantages to this method: First, the downstream classifier setting and F1 evaluation metric make these experiments easier to interpret in the context of earlier results than results from relatively novel probing metrics like minimum description length. Second, we focus on relative differences between models rather than absolute performance, and include a randomly initialized baseline model in the comparison. When the model representations are random, the probe's performance reflects the probe's own ability to solve the target task. Therefore, any improvements over this baseline value are due to the representation rather than the probe itself.
Task formulation and training Following Tenney et al., we use attention pooling to generate representation(s) of the token span(s) involved in the task and train an MLP that predicts whether a given label correctly describes the input span(s). We adopt the "mix" representation approach described in the paper. To train the probes, we use the same hyperparameters used in Tenney et al. and tune the batch size and learning rate. 8 Results We plot results in Figure 2. From the single-task curves we conclude that most of the   feature learning occurs with <100M words of pretraining data. Based on the best-fit curve, we can estimate that 90% of the attainable improvements in overall performance are achieved with <20M words. Most plots show broadly similar learning curves, which rise sharply with less than 1M words of pretraining data, reach the point of fastest growth (in the log domain) around 1M words, and are nearly saturated with 100M words. The most notable exception to this pattern is the Winograd task, which only rises significantly between 1B and 30B words of pretraining data. 9 As the Winograd task is designed to test commonsense knowledge and reasoning, the results suggest that these features require more data to encode than syntactic and semantic ones, with the caveat that the dataset is smaller than the other edge probing tasks, and results on Winograd tasks are highly sensitive to factors such as task formulation (Liu et al., 2020). We observe some general differences between different types of tasks. Figure 3 shows the aggregated learning curves of syntactic, semantic, and commonsense tasks. The syntactic learning curve rises slightly earlier than the semantic one and 90% of the improvements in syntactic learning can be made with about 10M words, while the semantic curve still rises slightly after 100M. This is not surprising, as semantic computation is generally thought to depend on syntactic representa-

Minimum Description Length Probing
In this experiment, we study the MiniBERTas with MDL probing (Voita and Titov, 2020), with the goal of revealing not only the total amount of feature information extracted by the probe, but also the effort taken by the probe to extract the features. MDL measures the minimum number of bits needed to transmit the labels for a given task given that both the sender and the receiver have access to the pretrained model's encoding of the data.
A well-trained decoder model can help extract labels from the representations and thus reduce the number of bits needed to transmit the labels. Since the model itself will also need to be transmitted, the total description length is a sum of two terms: The data codelength is the number of bits needed to transmit the labels assuming the receiver has the trained decoder model, i.e. the cross-entropy loss of the decoder. The model codelength is the number of bits needed to transmit the decoder parameters.
We follow Voita and Titov's online code estimation of MDL, where the decoder is implicitly transmitted. As in Section 3, we train decoders using the same hyperparameter settings and task definitions as Tenney et al. (2019b). 10 Results We plot the online code results in Figure  4. The overall codelength shows a similar trend to edge probing: Most of the reduction in feature codelength is achieved with fewer than 100M words. MDL for syntactic features decreases even sooner. Results for Winograd are idiosyncratic, probably due to the failure of the probes to learn the task.
The changes in model codelength and data codelength are shown on the bar plots in Figure 4. We compute the data codelength following Voita and Titov (2020) using the training set loss of a classifier trained on the entire training set, and the model codelength is the total codelength minus the data codelength. The monotonically decreasing data codelength simply reflects the fact that the more data rich RoBERTa models have smaller loss. When it comes to the model codelength, however, we generally observe the global minimum for the randomly initialized models (i.e., at "None"). This is expected, and intuitively reflects the fact that a decoder trained on random representations would provide little information about the labels, and so it would be optimal to transmit a very simple decoder. On many tasks, the model codelength starts to decrease when the pretraining data volume exceeds a certain amount. However, this trend is not consistent across tasks and the effect is relatively small.

Unsupervised Grammaticality Judgement
We use the BLiMP benchmark (Warstadt et al., 2020a) to test models' knowledge of individual grammatical phenomena in English. BLiMP is a challenge set of 67 tasks, each containing 1000 minimal pairs of sentences that highlight a particular morphological, syntactic, or semantic phenomena. Minimal pairs in BLiMP consist of two sentences that differ only by a single edit, but contrast in grammatical acceptability. A language model classifies a minimal pair correctly if it assigns a higher probability to the acceptable sentence. Since RoBERTa is a masked language model (MLM), we measure pseudo log-likelihood (Wang and Cho, 2019) to score sentences (Salazar et al., 2020).

Results
We plot learning curves for BLiMP in Figure 5. Warstadt et al. organize the 67 tasks in BLiMP into 12 categories based on the phenomena tested and for each category we plot the average accuracy for the tasks in the category. We do not normalize results in this plot. For the no-data baseline, we plot chance accuracy of 50% rather than making empirical measurements from random RoBERTa models. We find the greatest improvement in overall BLiMP performance between 1M and 100M words of pretraining data. With 100M words, sensitivity to contrasts in acceptability overall is within 9 accuracy points of humans, and improves only 6 points with additional data. This shows that substantial knowledge of many grammatical phenomena can be acquired from 100M words of raw text.
We also observe significant variation in how much data is needed to learn different phenomena. We see the steepest learning curves on agreement phenomena, with nearly all improvements occurring between 1M and 10M words. For phenomena involving wh-dependencies, i.e. filler-gap dependencies and island effects, we observe shallow and delayed learning curves with 90% of possible improvements occurring between 1M and 100M words. The relative difficulty of wh-dependencies can probably be ascribed to the long-distance nature and lower frequency of those phenomena. We also observe that the phenomena tested in the quantifiers category are never effectively learned, even by RoBERTa BASE . These phenomena include subtle semantic contrasts-for example Nobody ate {more than, *at least} two cookies-which may involve difficult-to-learn pragmatic knowledge (Cohen and Krifka, 2014 (Speer and Havasi, 2012), and SQUAD (Rajpurkar et al., 2016). The Google-RE and T-REx tasks are each divided into three sub-tasks.

Results
We plot the results on LAMA in Figure  6.  data may be needed for the model to be exposed to relevant factual knowledge. The learning curves for many LAMA tasks do not show clear signs of saturation in the range of 0 to 30B words, suggesting further improvements are likely with much larger data quantities. Among LAMA tasks, Concept-Net most directly tests commonsense knowledge. The steep slope of the ConceptNet curve between 100M and 30B words of pretraining data and the large precision jump (> 0.05) from 1B to 30B show that increasing the pretraining data to over 1B words significantly improve the LM's commonsense knowledge, which explains the shape of the Winograd coref. learning curve in Section 3.

Fine-tuning on NLU Tasks
SuperGLUE is a benchmark suite of eight classification-based language-understanding tasks . We test each MiniBERTa on five SuperGLUE tasks on which we expect to see significant variation at these scales. 12 The hyperpa- rameter search range used for each task is described in the appendix.

Results
We plot the results on the selected Su-perGLUE tasks in Figure 7. Improvements in Su-perGLUE performance require a relatively large volume of pretraining data. For most tasks, the point of fastest improvement in our interpolated curve occurs with more than 1B words. None of the tasks (with the possible exception of Commitment-Bank) show any significant sign of saturation at 30B words. This suggests that some key NLU skills are not learnt with fewer than billions of words, and that models are likely to continue improving substantially on these tasks given 10 to 100 times more pretraining data. Figure 1 plots the overall learning curves for these five methods together. The most striking result is that good NLU task performance requires far more data than achieving good representations for linguistic features. Classifier probing, MDL   probing, and acceptability judgment performance all improve rapidly between 1M and 10M words and show little improvement beyond 100M words, while performance on the NLU tasks in Super-GLUE appears to improve most rapidly with over 1B words and will likely continue improving at larger data scales. While the linguistic features we test are undoubtedly needed to robustly solve most NLU tasks, a model that can extract and encode a large proportion of these features may still perform poorly on SuperGLUE. What drives improvements in NLU task performance at larger data scales remains an open question.

Discussion
Factual knowledge may play a large role in explaining SuperGLUE performance. This hypothesis is backed up by results from the Winograd edge-probing task (Figure 2) and the LAMA tasks ( Figure 6), which suggest that most of the im-provements in the model's world and commonsense knowledge are made with over 100M words. However, the LAMA learning curve shows signs of slowing between 1B and 30B words, the Super-GLUE curve does not.
Another possible explanation is that linguistic features encoded by a model may not be easily accessible during fine-turning. Warstadt et al. (2020b) found that RoBERTa can learn to reliably extract many linguistic features with little pretraining data, but requires billions of words of pretraining data before it uses those features preferentially when generalizing.
In light of Warstadt et al.'s findings, we had initially hypothesized that feature accessibility as measured by MDL might show a shallower or later learning curve than standard classifier probing. 13 Our findings do not support this hypothesis: Figure 1 shows no substantial difference between the classifier probing MDL probing curves.
However, we do not totally rule out the possibility that linguistic feature accessibility continues to improve with massive pretraining sets. There are potential modifications to Voita and Titov's approach that could more faithfully estimate feature accessibility. First, although RoBERTa is actually fine-tuned in most applications, we and Voita and Titov measure MDL taking the outputs of the frozen RoBERTa model as input to a trainable MLP decoder. It may be more relevant to measure MDL by fine-tuning the entire model (Lovering et al., 2021). Second, MDL actually estimates the information content of a particular dataset, rather than the feature itself. Whitney et al. (2020) propose an alternative to MDL that measures feature complexity in a way that does not depend on the size of the dataset.

Related Work
Probing neural network representations has been an active area of research in recent years (Belinkov and Glass, 2019;Rogers et al., 2020). With the advent of large pretrained Transformers like BERT (Devlin et al., 2019), numerous papers have used classifier probing methods to attempt to locate linguistic features in learned representations with striking positive results (Tenney et al., 2019b;Hewitt and Manning, 2019). However, another thread has found problems with many probing methods: Classifier probes can learn too much from training data (Hewitt and Liang, 2019) and can fail to distinguish features that are extractable from features that are actually used when generalizing on downstream tasks (Voita and Titov, 2020;Pimentel et al., 2020;Elazar et al., 2020). Moreover, different probing methods often yield contradictory results (Warstadt et al., 2019).
There have also been a few earlier studies investigating the relationship between pretraining data volume and linguistic knowledge in language models. Studies of unsupervised acceptability judgments find fairly consistent evidence of rapid improvements in linguistic knowledge up to about 10M words of pretraining data, after which improvements slow down for most phenomena. van They measure RoBERTa's preference for linguistic features over surface features during fine-tuning on ambiguous classification tasks. Schijndel et al. (2019) find large improvements in knowledge of subject-verb agreement and reflexive binding up to 10M words, and little improvement between 10M and 80M words. Hu et al. (2020) find that GPT-2 trained on 42M words performs roughly as well on a syntax benchmark as a similar model trained on 100 times that amount. Other studies have investigated how one model's linguistic knowledge changes during the training process, as a function of the number of updates (Saphra and Lopez, 2019;Chiang et al., 2020). Raffel et al. (2020) also investigate how performance on SuperGLUE (and other downstream tasks) improves with pretraining dataset size between about 8M and 34B tokens. In contrast to our findings, they find that models with around 500M tokens of pretraining data can perform similarly on downstream tasks to models with 34B words. However, there are many differences in our settings that may lead to this divergence. For example, they pretrain for a fixed number of iterations (totaling 34B token updates), whereas the MiniBERTas we use were pretrained with early stopping. They also use prefix prompts in their task formulations, and adopt an encoder-decoder architecture and thus their model has roughly twice the number of parameters of the largest model we evaluate.
There is also some recent work that investigates the effect of pretraining data size of other languages. Micheli et al. (2020) pretrain BERT-based language models on 10MB, 100MB, 500MB, 1GB, 2GB, and 4GB of French text and test them on a question answering task. They find that the French MLM pretrained on 100MB of raw text has similar performance to the ones pretrained on larger datasets on the task, and that corpus-specific selfsupervised learning does not make a significant difference. Martin et al. (2020) also show that French MLMs can already learn a lot from small-scale pretraining.
Concurrent work (Liu et al., 2021) probes RoBERTa models pretrained on different numbers of iterations using a set of probing tasks similar to ours. They find that linguistic abilities are acquired fastest, world and commonsense knowledge learning takes more iterations, and reasoning abilities are never stably acquired. Both studies show that linguistic knowledge is easier to learn than factual knowledge.

Conclusion
We track several aspects of RoBERTa's ability as pretraining data increases. We find that ability in syntax and semantics largely saturates after only 10M to 100M words of pretraining data-on par with the data available to human learners-while learning factual knowledge requires much more data. We also find that scaling pretraining data size past billions of words significantly improves the NLU performance, though we cannot fully explain what abilities drive this improvement. Answering this question could be a stepping stone to more data-efficient models.

Acknowledgments
This material is based upon work supported by the National Science Foundation under grant no. 1850208. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We would like to thank Udit Arora, Jason Phang, Clara Vania, and ML 2 for feedback on an earlier draft. Thanks also to Kyunghyun Cho, Tal Linzen, Grusha Prasad, and Emin Orhan for suggestions regarding the exponential learning curve, and to Elena Voita, Ian Tenney, and Haokun Liu for the discussion about the implementation of the probing methods.

Ethical Considerations
There are several ethical reasons to study LMs with limited pretraining data. Training massive LMs like RoBERTa from scratch comes with non-trivial environmental costs (Strubell et al., 2019), and they are expensive to train, limiting contributions to pretraining research from scientists in lower-resource contexts. By evaluating LMs with limited pretraining, we demonstrate that smaller LMs match massive ones in performance in many respects. We also identify a clear gap in our knowledge regarding why extensive pretraining is effective. Answering this question could lead to more efficient pretraining and ultimately reduce environmental costs and make NLP more accessible. On the other hand, there is a danger that our work, by projecting substantial gains in model performance by increasing pretraining size, could legitimize and encourage the trend of ever growing datasets.
Massive LMs also replicate social biases present in training data (Nangia et al., 2020). By establish-ing benchmarks for smaller LMs and highlighting their efficacy for certain purposes, we hope to spur future work that takes advantage of smaller pretraining datasets to carefully curate the data distribution, as advocated by Bender et al. (2021), in order to build LMs that do less to reproduce harmful biases and are more inclusive of minority dialects.