Pre-training LLMs using human-like development data corpus

Pre-trained Large Language Models (LLMs) have shown success in a diverse set of language inference and understanding tasks. The pre-training stage of LLMs looks at a large corpus of raw textual data. The BabyLM shared task compares LLM pre-training to human language acquisition, where the number of tokens seen by 13-year-old kids is magnitudes smaller than the number of tokens seen by LLMs. In this work, we pre-train and evaluate LLMs on their ability to learn contextual word representations using roughly the same number of tokens as seen by children. We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task. We also try to loosely replicate the RoBERTa baseline given by the task organizers to observe the training robustness to hyperparameter selection and replicability. We provide the submission details to the strict and strict-small tracks in this report.

Humans typically encounter fewer than 100 million tokens through language exposure by the time they are 13 years old (Warstadt et al., 2023).LLMs, on the other hand, parse tens of billions to trillions of tokens in their pre-training stage, typically from sources like Wikipedia (Wikipedia contributors, 2004), and Open Book Corpus (Zhu et al., 2015), which consist of different tokens than the ones seen by children.In this paper, we evaluate the capabilities of popular architectures on various tasks when trained on a number of tokens comparable to that seen by 13-year-old children.Such scaled-down pre-training has several potential benefits: • A better sandbox for the development of new LLM training techniques inspired by the cognitive science literature (Yiu et al., 2023).
• Robust evaluation of models on human behavioral signatures (Shah et al., 2023).
• Building plausible human cognition models using LLMs aligned to actual human actions (Park et al., 2022).

Key Contributions
Given the benefits of using scaled-down humanlike pre-training data, our work focuses on the following aspects of the shared task: 1. Replication details: Can we replicate the results of the baselines given by the task organizers?
2. Can we understand the impact of more training epochs on the same architecture?
3. Providing each training checkpoint for the different model architectures to facilitate future modeling of development.All checkpoints can be found here.
We provide details of training and evaluation for the strict and strict-small tracks of this task.
2 Related Work

Cognitive science driven LLM architecture development
With the efforts put into LM pre-training, learning frameworks informed by cognitive science have received increasing attention.For instance, unsupervised and adversarial pre-training methods have been employed to enhance the logical reasoning capabilities of language models (Pi et al., 2022b).Using pre-training to inject numerical (Pi et al., 2022a) and commonsense reasoning (Zhong et al., 2019) has also been recently explored.Huebner et.al have constructed pre-training paradigms using curriculum learning (Huebner et al., 2021) to show the advantages of incremental learning.

Pre-training with limited data
Previous experiments show that pre-training data size is positively correlated with the syntactic capabilities of RoBERTa in terms of generalization and robustness (Pérez-Mayos et al., 2021).However, it has been discovered that model performance gains bring a high financial and environmental cost (Tay et al., 2021).This justifies the appeal of smallscale pretraining with data limitations.There have also been explorations of how human-like data scales could improve our understanding of language acquisition and solidify current cognitive models (Dupoux, 2018).3 Methodology

Models
We use the simple-transformers library (Rajapakse, 2019) to pre-train the models below from scratch.
The library uses the Huggingface trainer for pretraining.Note: We build new vocabularies for all models and limit the number of training epochs due to computational constraints in certain models.
• RoBERTa: We train the RoBERTa-base model (Liu et al., 2019) for comparison to the baseline given by the task organizers.This model is trained for 20 epochs on both datasets (strict and strict-small).The size of this model is roughly 125M parameters.
• DistilBert (uncased): Because this model (Sanh et al., 2020) is smaller (roughly 66M parameters) and quicker to pre-train, we additionally train it for 60 epochs.This allows us to explore the impact of more training epochs on performance.
• GPT2: We include a decoder-based architecture (Radford et al., 2019) in our pre-training to explore the impact of architecture type on the evaluation tasks.This model has a similar size to RoBERTa (117M parameters).We train it for 20 epochs due to computational constraints.
All of the checkpoints for the three architectures and the two tracks are uploaded on Huggingface (Wolf et al., 2020).Hyperparameters: We perform a grid search over the hyperparameters for all three architecture types.We use a subset of 0.5 GB of the training data for the search.The learning rate ranges from 5e-5 to 4e-4 across the searches, with weight decay in place but no early stopping mechanisms.

Results
Table 2 shows the results obtained from the dynabench submission portal.The individual results for each of the tasks in different benchmarks are available in Tables 3, 4, 5, 6, 7. Looking at these tables, we observe the following patterns: 1. We see that training for more epochs leads to better overall performance (compare 20 and 60 epochs of DistilBert in Table 2).
2. Variation among architecture types exists when limiting the training to the same number of epochs, but it is difficult to identify a definitively better architecture.3. Tables 3, 4, 5, 6, and 7 show that pre-training (RoBERTa) is not robust to initialization, and the competition scores would greatly benefit from a warm-up or a grid search over different hyper-parameters.
4. In most cases, the pre-training improves performance over the majority label in the Super GLUE tasks.
5. Tables 8, 9 shows that the performance on the BLIMP tasks becomes better with more training epochs.While this is orthogonal to wisdom performance saturates at one epoch (Biderman et al., 2023).Our results hint that training saturation or stability may be a function of model size divided by the number of tokens seen.

Conclusions
We pre-train popular LLM architectures on the kind of textual data seen by children when they are around 13 years old.We show that pre-training paradigms like Masked Language Modeling or Causal Language Modeling lead to only minor variations.Our results show that models are not robust to the initialization of weights.Our work provides each and every checkpoint of the model architectures on Huggingface to facilitate future research.All checkpoints can be found here.

Ethical Considerations
All researchers in this study have active responsible code of conduct in research certifications.The models shared on Huggingface have the same risks associated with any other Large Language Model.
Researchers in this study have tried to be mindful of the environment while doing the pre-training runs and hope that publically available checkpoints will help other researchers avoid computation and environmental costs associated with repeat pre-training.

Computational Resources
The models are trained on Nvidia-RTX 2080 GPUs with 12 GB RAM.The models are trained for nearly 975 GPU hours.

Table 2 :
Model scores on dynabench

Table 3 :
Results for the Super GLUE tasks

Table 4 :
Results for the Blimp tasks

Table 5 :
Results for the Blimp supplemental tasks

Table 9 :
Results for the BLIMP tasks across different epochs of the DistilBERT-base model architecture for the strict (100M token) track.