ToddlerBERTa: Exploiting BabyBERTa for Grammar Learning and Language Understanding

,


Introduction
Over the past few years, there has been a lot of effort put into improving the pretraining of large language models (LLMs) on a large scale (Brown et al., 2020;Raffel et al., 2019;Chowdhery et al., 2022;Hoffmann et al., 2022).While there is often a focus on increasing the number of parameters, there has also been significant growth in dataset size.However, there has been minimal progress in pretraining on smaller data scales that are comparable to how humans learn language.
Exploring pretraining on a smaller scale can serve as a trial area for developing original techniques that boost data effectiveness.These techniques can be scaled up to larger datasets utilized in real-world natural language processing (NLP) scenarios and employed to enhance current methods for modeling low-resource languages.
The BabyLM challenge (Warstadt et al., 2023) has been created to address the gap in research on pretraining for small-scale language models.Our focus will be on a limited corpus of approximately 10 million words, which includes child-directed speech, transcribed speech from various sources, children's books, and Wikipedia data.
We trained more than 180 BabyBERTa (Huebner et al., 2021) models in different sizes and hyperparameters to determine how well language models learn and understand language.Our findings showed that scaling the model and data resulted in significantly better outcomes compared to baseline models.All models are trained on the strict-small portion of the challenge.

Related Work
In the field of natural language processing (NLP), there has been a lot of research on data-efficient language models.These models aim to achieve high accuracy in language tasks while using less training data than their larger counterparts.One way to create data-efficient language models is to reduce the number of model parameters while maintaining high performance.For instance, DistilBERT (Sanh et al., 2019) is a smaller and faster version of the popular BERT model.It was trained by distilling knowledge from the larger model into a smaller version.TinyBERT (Jiao et al., 2019), on the other hand, was designed for low-resource environments, such as mobile devices.It was trained using a combination of teacher-student learning and knowledge distillation techniques.
Another example of a data-efficient language model is ALBERT (Lan et al., 2019) which reduces the number of parameters of the BERT model by using factorization techniques and sharing parameters across different layers.This results in a more data-efficient model that can achieve similar or better performance than the larger BERT model.
GPT-Neo (Black et al., 2021) is another dataefficient language model that was trained on a large dataset of text, but it can be fine-tuned on smaller datasets with good results.It has demonstrated competitive performance on various natural lan-guage processing tasks, including language generation, summarization, and question-answering.
ELECTRA (Clark et al., 2020) is a novel pretraining approach for language models that is designed to be more data-efficient than traditional models like BERT.Instead of using a traditional masked language modelling task, ELECTRA uses a discriminator network to predict whether a given input is real or generated by another model.This approach allows for more efficient training and can achieve similar or better performance than traditional models.
TinyStories (Eldan and Li, 2023) is an artificial collection of short stories, specifically designed with words understandable to 3 to 4-year-olds.These stories are generated using GPT-3.5 and GPT-4 (OpenAI, 2023).TinyStories can effectively serve as a training and evaluation dataset for language models (LMs) that are considerably smaller than the current state-of-the-art models (less than 10 million parameters) or have simpler architectures (with just one transformer block).Despite their reduced size and simplicity, these LMs are capable of producing coherent and consistent stories spanning multiple paragraphs.The stories are diverse, exhibit nearly flawless grammar, and showcase impressive reasoning abilities.
BabyBERTa is a lightweight model for language acquisition (Huebner et al., 2021).BabyBERTa is similar to RoBERTa (Liu et al., 2019), but it is much smaller and simpler.BabyBERTa was trained on a dataset of 5M words of American-English child-directed input, and it can be run on a single desktop with a single GPU.BabyBERTa was able to achieve comparable performance to RoBERTa on a number of language acquisition tasks, including grammatical knowledge acquisition, generalization to novel grammatical contexts, syntactic structure learning, and semantic word and phrase learning.These results suggest that BabyBERTa could be a valuable tool for language acquisition research.BabyBERTa is a significant contribution to the field of language acquisition research.It provides a new way to study how children learn a language, and it opens up new possibilities for developing new language-learning technologies.
Small size: BabyBERTa is much smaller than RoBERTa, with only 8 layers, 8 attention heads, 256 hidden units, and an intermediate size of 1024.This makes it much faster and easier to train and use than RoBERTa.
Comparable performance: Despite its smaller size and simpler training regime, BabyBERTa was able to achieve comparable performance to RoBERTa on a number of language acquisition tasks.This suggests that BabyBERTa could be a valuable tool for language acquisition research.
BabyBERTa makes a number of contributions to the field.First, it demonstrates that a small, lightweight model can be used to acquire grammatical knowledge from child-directed input.Second, it shows that BabyBERTa can generalize to novel grammatical contexts.Third, it shows that Baby-BERTa is able to learn the syntactic structure of sentences.Fourth, it shows that BabyBERTa is able to learn the semantics of words and phrases

Experiment Settings
In light of the remarkable achievements of the BabyBERTa language model, as expounded in the seminal work of (Huebner et al., 2021), we embrace BabyBERTa as the foundational model for our research endeavour.Building upon this foundation, our investigation sets forth to explore an array of model sizes and diverse hyperparameters in a systematic and rigorous manner.
We construct five different models to validate and then further exploit the performance of Baby-BERTa.All hyperparameters are kept the same except, hidden size, intermediate size, number of attention heads and number of layers.Models configurations can be found in Table 1.
Our study closely follows the established hyperparameters of BabyBERTa but with three key variations: mask patterns, epochs, and batch size.This yields around 180 models, enabling a comprehensive exploration of their true capabilities.This rigorous approach aims to reveal insights into the impact of hyperparameter choices on model performance.

Evaluation Setup
We adopt the official evaluation pipeline of the BabyLM Challenge (Warstadt et al., 2023;Gao et al., 2021), which combines BLiMP (Warstadt et al., 2019), SuperGLUE (Wang et al., 2019), MSGS (Warstadt et al., 2020), and a Supplement benchmark.Our best model is evaluated on all benchmarks, while other models are evaluated on BLiMP due to limited computing resources.This approach ensures a rigorous assessment of our model's performance across diverse tasks while

Baselines
The competition organizers provide a set of baseline models along with the designated evaluation pipeline.In training these baseline models, they adopt a straightforward approach: they extract the hyperparameters from several established large language models, namely OPT (Zhang et al., 2022), RoBERTa (Liu et al., 2019), and T5 (Raffel et al., 2019), and subsequently train them from scratch on fixed datasets.Given the nature of the competition dataset, which is exclusively released for the purposes of the competition, the absence of other external models precludes the possibility of conducting direct model comparisons.As a result, we treat the provided baseline models as our reference points for evaluating the performance and efficacy of our models within the competition context.This pragmatic approach allows us to effectively gauge the relative performance of our models against established benchmarks, serving as a valuable point of reference in the absence of external comparisons.

Evaluation pipeline Critique
The evaluation pipeline, reliant on huggingface (Wolf et al., 2019), exclusively accommodates transformer-based language models, thereby constraining researchers from enhancing existing architectures or devising novel ones.While the widespread success of Transformers justifies their prevalent usage, it also introduces complexities when attempting to transfer model weights between transformer-based models.An illustrative instance is an incompatibility between huggingface and fairseq (Ott et al., 2019) for certain models, presenting obstacles in evaluating our model's performance.
Given these limitations, we have conscientiously reassessed our approach, redirecting our focus towards optimizing existing models.This strategic shift allows us to navigate the challenges posed by the evaluation pipeline and leverages the wealth of knowledge surrounding established models.By refining and fine-tuning these pre-existing architectures, we endeavor to achieve performance improvements that are paramount to advancing the state-of-the-art in natural language processing.
Embracing this revised perspective, we aim to contribute valuable insights to the field while creatively optimizing existing models to their full potential.Our scholarly endeavor aligns with the quest for advancements in language modeling, despite the constraints posed by the evaluation framework.By diligently pursuing this path, we hope to advance the collective understanding of transformer-based models

Results and Analysis
As stipulated earlier, a substantial portion of our model evaluations is conducted under BLiMP (Warstadt et al., 2019), encompassing comparisons across various linguistic tasks.Additionally, we undertake a comprehensive evaluation of our best-performing model using the entire prescribed evaluation pipeline.As a result, we present our findings as two distinct sets of results: BLiMP results and main results.

ToddlerBERTa-xs
Our ToddlerBERTa-xs model, with approximately 750 thousand parameters, achieves competitive performance compared to the larger T5 baseline on the BLiMP benchmark, in Figure 1.This scaling behavior highlights the potential benefits of optimizing smaller architectures for specific tasks, showcasing efficient language modeling approaches.

ToddlerBERTa-s
ToddlerBERTa-s model, consisting of 1.8 million parameters, exhibits superior performance com- pared to the OPT baseline across various configurations.Remarkably, experimental results demonstrate that even with smaller parameter sizes, these models can outperform larger counterparts in the low data regime when leveraging the BabyBERTa training and preprocessing recipes.

ToddlerBERTa-base
The ToddlerBERTa-base and BabyBERTa (Huebner et al., 2021) have the same number of parameters, which is 8.5 million.However, the bestperforming model of ToddlerBERTa-base scores 0.7407 with more epochs and mask patterns than the original, as shown in Figure 3. On the other hand, the average score obtained by Vanilla Baby-BERTa is 0.666.

ToddlerBERTa-l
The utilization of data scaling techniques is evidently advantageous in enhancing model performance for grammar learning tasks.However, our research findings demonstrate that surpassing the RoBERTa baseline is achievable through the increase of model parameters.This observation prompts an inquiry into the sustainability of this trend.In order to address this question, we developed ToddlerBERTa-l, featuring a substantial parameter count of approximately 30 million.Our experimental results emphasize the indispensability of model size, despite the relatively modest increase in the top score, Figure 4. Notably, a significant performance boost is observed in the majority of models when larger architectures are employed.These findings underscore the critical role of model size in optimizing grammar learning capabilities.

ToddlerBERTa-xl
In our pursuit of exploring the capabilities of Baby-BERTa within the strict-small portion of BabyLM, we introduce ToddlerBERTa-xl, a language model equipped with 92 million parameters similar to RoBERTa (Liu et al., 2019).Prior experiments have highlighted the significance of both data and model size; however, these studies have predominantly employed relatively smaller model sizes compared to baseline models, which exhibit exceptional results when trained on extensive corpora over extended periods.Such large models excel under substantial data volumes but tend to perform inadequately in low-data scenarios.Consequently, previous investigations (Eldan and Li,

BLiMP Summary
Through our comprehensive experimentation, we have observed that enhancing the BabyBERTa methodology involves employing numerous mask patterns, processing single sentences, and utilizing small context and vocabulary sizes while keeping batch sizes and the number of epochs limited.However, in pursuit of leveraging larger models to achieve superior performance, we deviate from these restrictions by increasing both batch sizes and the number of epochs.The adoption of larger batch sizes contributes to enhanced training stability for larger models, whereas a higher number of epochs allows the models to encounter training samples multiple times, leading to more effective learning outcomes.As a result, our finest model surpasses the original BabyBERTa model recipe by an impressive margin of 12 points in BLiMP, underscoring the efficacy of these modifications.
In our endeavour to refine our models based on the BLiMP evaluation, we diligently consider the average of the results to make informed adjustments.Nonetheless, we remain cognizant of the potential influence of outliers, which could inadvertently skew the data and compromise the reliability of our analysis.To comprehensively explore the intricate interplay between the various features in our dataset, which comprises nearly 180 results, we employ a Spearman correlation matrix as a robust analytical tool.The Spearman correlation matrix facilitates a comprehensive examination of the relationships between the features, shedding light on potential patterns and dependencies that may not be apparent through other conventional statistical methods, in Figure 6.The majority of the tasks exhibit a strong positive correlation with the average, with the exception of Island Effects, Filler Gap, and Control/Raising.In order to gain insights into the underlying reasons behind this anomaly, we present a visual analysis by plotting the scores of these specific tasks in ascending order based on their respective average scores, as illustrated in Figure 7.The plot reveals that all task scores begin to converge and align closely after reaching an average score of 0.7.This observation leads us to postulate that these particular tasks may be inherently more challenging, demanding a larger volume of data and more complex model architectures for optimal performance.

Main Results
After evaluating various models on BLiMP (Warstadt et al., 2019), we select the best one as our final model.We then assess its performance on Blimp Supplement and fine-tune it on (Wang et al., 2019) and MSGS (Warstadt et al., 2020) using the evaluation pipeline (Warstadt et al., 2023).
BLiMP: In this comprehensive investigation, our primary focus revolves around the thorough evaluation of baseline models during the iterative model training process.Additionally, we endeavor to enrich our analysis by incorporating the results derived from the illustrious RoBERTabase model (Liu et al., 2019), thus augmenting the breadth and depth of our observations.As evidenced by the empirical findings presented in Table 2, RoBERTa-base distinctly outperforms our model, demonstrating a marked disparity in their respective performance metrics.Notably, RoBERTabase's exceptional proficiency can be attributed, in part, to its extensive exposure to an expansive corpus, comprising a staggering 3 billion words during the training process.
In contrast, our very own ToddlerBERTa model is trained on a comparably smaller dataset, consisting of a mere 10 million words.This significant difference in training data size presents a formidable challenge in achieving parity with RoBERTa-base's performance.To bridge this considerable gap and enhance the utility of the available data, we have devised a strategic approach, effectively increasing the number of mask patterns employed in Tod-dlerBERTa's training regime.This augmentation technique enables ToddlerBERTa to optimize the exploitation of the available data, albeit under the constraints of approximately 1 billion word exposures, thus yielding more favourable learning outcomes.
Despite the intrinsic limitations imposed by the data volume discrepancy, the results we have obtained underscore the robustness of our proposed methodology.Notably, ToddlerBERTa's performance gap with the state-of-the-art RoBERTa-base remains relatively narrow, suggesting a commendable degree of adaptability and learning potential within our model's constraints.The significant insights derived from our analysis reveal that even when confronted with limited data, skillful manipulation of mask patterns can effectively amplify data utilization, leading to more proficient language model training.

SuperGLUE:
The SuperGLUE benchmark comprises diverse natural language understanding tasks that encompass a broad spectrum of linguistic phenomena, demanding reasoning abilities surpassing mere pattern recognition.Our models grapple with a notable trade-off, as they exclusively process single sentences, thereby rendering the Super-GLUE benchmark exceptionally challenging.This challenge arises from the fact that the SuperGLUE dataset frequently comprises inputs with multiple sentences, while our model only encounters such inputs during the fine-tuning phase, which may not be sufficient to outperform models explicitly trained on multiple sentences.Despite this inherent constraint, we are pleased to report that our model has demonstrated remarkable competitiveness when compared to baselines leveraging pretraining on multiple sentences.Our findings indicate that even with the limitations of single-sentence processing during pretraining, our model exhibits a commendable capacity to grasp complex linguistic relationships and reasoning abilities, effectively aligning its performance with state-of-the-art baselines harnessing broader contextual information.This observation accentuates the proficiency of our model and underscores its potential to yield robust language understanding capabilities, even in scenarios involving multi-sentence inputs, in Table 5.

Models
O v e r a l l OPT-125m(baseline) 62.    (Warstadt et al., 2019) benchmark results, baseline scores are taken from the GitHub page of evaluation pipeline, RoBERTa-base from (Huebner et al., 2021).

Models
O v e r a l l  designed to assess the generalization propensities of language models concerning two critical categories: linguistic features, such as the presence of specific syntactic constructions, and surface features, such as the occurrence of a word before a particular position.This benchmark serves as a critical tool in probing the fundamental tendencies of pretrained language models during fine-tuning, helping to elucidate whether they exhibit a preference for one type of feature over the other.Upon thorough analysis, we speculate that the observed performance gap, in Table 4, might be partly attributed to the phenomenon of overexposure.To improve our training, we add additional mask patterns to augment the data and utilize them for a significant number of epochs., which inevitably introduces repeated patterns and examples in the training dataset.This overexposure to similar instances might inadvertently influence the model's learning process, leading to a preference for specific linguistic or surface-level features.Consequently, the model may struggle to adapt optimally to novel and less familiar patterns encountered in the MSGS, resulting in suboptimal performance compared to the RoBERTa baseline.
BLiMP Supplement: The challenge has been enriched with an extra benchmark, the details of which have not been published yet, but it is presumed to be connected to the BLiMP evaluation framework.Analysis of the results presented in Table 3 leads us to speculate that decoder-based models may possess a competitive edge in these tasks.This assumption is supported by the substantial performance gap observed between the OPT model and our own, while our model significantly outperforms the RoBERTa baseline.However, it is essential to note that the specific characteristics and goals of this additional benchmark remain elusive until an official publication sheds light on its intricacies.

Conclusion
Our research has undertaken a systematic and rigorous exploration of language models, building upon the foundational work of BabyBERTa.Through the development and evaluation of five distinct Toddler-BERTa models, we have demonstrated the significance of hyperparameter choices and model sizes in the context of natural language processing.
Our experiments have revealed the potential benefits of optimizing smaller architectures for specific linguistic tasks, showcasing the efficiency of language modeling techniques in tackling various challenges.Additionally, our best-performing Tod-dlerBERTa models have exhibited competitive performance compared to established baselines, showcasing their adaptability and capacity to excel in diverse language understanding tasks.
The comprehensive evaluations conducted on  By contributing to the collective understanding of transformer-based models and their potential for natural language processing, our research aims to inspire future investigations and innovations in the field.As the quest for advancements in language modeling continues, we emphasize the importance of replicability and reproducibility in research to facilitate the development of robust and reliable language models.

Limitations
Despite the valuable contributions of our research, it is essential to acknowledge its limitations.Firstly, the exploration of hyperparameters and model sizes may not have encompassed all possible configurations due to computational constraints.This leaves room for potential superior settings to be uncovered.Secondly, the evaluation framework's focus on transformer-based models may limit the comparability with other non-transformer architectures.Additionally, the fixed dataset used for training and evaluation may restrict the model's exposure to diverse linguistic patterns and contexts.Furthermore, the reliance on single-sentence processing during pretraining could impact the model's performance on tasks requiring broader contextual understanding.Lastly, our study did not extensively explore architectural innovations or novel training methodologies.Despite these limitations, our research provides valuable insights into language modeling, calling for further investigations to address these constraints and advance the field.

Ethics Statement
The model under consideration, ToddlerBERTa, is devoid of generative capabilities, thereby ensuring that it cannot engender unfair, biased, or harmful content.The datasets employed in this study have been sourced from widely acknowledged repositories with an established reputation for safety in research applications, being meticulously selected to preclude the inclusion of personal information or offensive material.

Figure 1 :
Figure 1: Average scores of the ToddlerBERTa-xs models on BLiMP are reported.We shorten the different configuration names as number of epochs: e, number of dynamic patterns: p and batch size: b.

Figure 2 :
Figure 2: Average scores of the ToddlerBERTa-s models on BLiMP are reported.We shorten the different configuration names as number of epochs: e, number of dynamic patterns: p and batch size: b.

Figure 3 :
Figure 3: Average scores of the ToddlerBERTa-base models on BLiMP are reported.We shorten the different configuration names as number of epochs: e, number of dynamic patterns: p and batch size: b.

Figure 4 :Figure 5 :
Figure 4: Average scores of the ToddlerBERTa-l models on BLiMP are reported.We shorten the different configuration names as number of epochs: e, number of dynamic patterns: p and batch size: b.

Figure 6 :
Figure 6: Spearman correlation matrix on the scores of BLiMP tasks.

Figure 7 :
Figure 7: Models are ranked by the average BLiMP score in ascending order, in the Blue time series plot.Other time series plots represent how task scores vary while the average score consistently improves.

Table 3 :
BLiMP Supplement benchmark results, baseline scores are taken from the GitHub page of evaluation pipeline.