Mini Minds: Exploring Bebeshka and Zlata Baby Models

In this paper, we describe the University of Lyon 2 submission to the Strict-Small track of the BabyLM competition. The shared task is created with an emphasis on small-scale language modelling from scratch on limited-size data and human language acquisition. Dataset released for the Strict-Small track has 10M words, which is comparable to children's vocabulary size. We approach the task with an architecture search, minimizing masked language modelling loss on the data of the shared task. Having found an optimal configuration, we introduce two small-size language models (LMs) that were submitted for evaluation, a 4-layer encoder with 8 attention heads and a 6-layer decoder model with 12 heads which we term Bebeshka and Zlata, respectively. Despite being half the scale of the baseline LMs, our proposed models achieve comparable performance. We further explore the applicability of small-scale language models in tasks involving moral judgment, aligning their predictions with human values. These findings highlight the potential of compact LMs in addressing practical language understanding tasks.


Introduction
LMs accurately encode language-specific phenomena required for natural language understanding and generating coherent continuation of text.LMs gain language understanding about morphosyntax and grammar from large corpora during pretraining.However, they demonstrate partial functional linguistic competence when applying grammatical knowledge to novel expressions at inference time, which is caused by memorising the most occurring linguistic patterns from the training corpus and limited generalization ability of learnt linguistic representations (Wu et al., 2022;Tucker et al., 2022;Mahowald et al., 2023).
Recent pre-training dynamics studies revealed that the performance of LMs can be seen as a function of training corpus vocabulary: (1) grammatical knowledge improves with the expansion of the pretraining data vocabulary (van Schijndel et al., 2019) and ( 2) small-scale LMs can perform on par with RoBERTa if the vocabulary of used tokenizer is close to the actual human and even child's vocabulary (Liu et al., 2019).
In this paper, we introduce small-scale LMs with an architecture optimized for the STRICT-SMALL track data of BabyLM competition (Warstadt et al., 2023).Our objective is to estimate the general performance and capabilities of shallow LMs in downstream tasks beyond the ones suggested in the evaluation pipeline of shared task.That was achieved through two main contributions.Contribution 1.We determine an optimal architecture of encoder-based LMs using the Treestructured Parzen Estimator algorithm and minimal perplexity as a minimizing objective function.Our parameter search results suggest that optimal LMs have a ratio of attention heads to layers around 2, while the ratio of previously tested and existing LMs at their base configuration is equal to one.We introduce new small-scale LMs submitted to the shared task: (i) 4-layer encoder Bebeshka2 and (ii) 6-layer decoder Zlata. 3The parameters of the models are presented in Table 1.Our LMs perform on par with the shared task baselines, while they are half the size of those.
Contribution 2. We investigate the alignment of small-scale LMs predictions with shared human values in the context of moral judgment tasks.We find that shallow LMs, yet trained on limited corpora, perform on par with base LMs in commonsense morality scenarios, and, surprisingly outper- forming existing baselines in such tasks as virtue and justice assessment.To the best of our knowledge, our work represents one of the earliest attempts to investigate how predictions made by tiny language models trained on a developmentally plausible corpus correlate with human-shared values.This paper has the following structure.After a short section dedicated to related work ( §2), we first describe tokenizer training ( §3.1), architecture search results and optimal model selection ( §3.2), and the final architecture of the pretrained LMs ( §3.3).Then, we present scores on datasets included in the shared task ( §4), and we present ethics evaluation results ( §5).

Related Work
Recent large LMs found applications in many NLP tasks, such as grammatical correction, text completion, and question answering; yet, their usage is constrained by their computational cost.Previous works reduce the model size and inference time with knowledge distillation, parameter quantization and other compression techniques (Sanh et al., 2019;Yao et al., 2021;Tao et al., 2022).Other studies investigated the relationship between model parameter count and performance.Kaplan et al., 2020 has introduced scaling laws, showing the power-law dependency between perplexity and the model size, as well as between the training loss and dataset size.The paradigm of scaling laws further formed the basis for recent research examining the behaviour of LMs at a small scale (Fedus et al., 2022;Fu et al., 2023).For instance, Puvis de Chavannes et al., 2021 presented results of Neural Architecture Search in limited parameter space, suggesting that optimal LMs are smaller than the existing base configurations.
In parallel, there is numerous research focusing on the efficiency of dataset size, vocabulary and representation that can help to reduce computation cost by minimizing the training steps (van Schijndel et al., 2019;Huebner et al., 2021;Schick and Schütze, 2021;Warstadt and Bowman, 2022).van Schijndel et al., 2019 have demonstrated that LMs trained on a small-volume corpus can reach human performance under some grammatical knowledge evaluation scenarios, questioning the necessity of large datasets for pre-training.Huebner et al., 2021 introduced a small encoder-based LM BabyBERTa with 5M parameters and showcased the efficiency of small training data; that work bridged the gap between earlier studies on model size reduction and optimal data size.
The aforementioned related works mainly analyse the difference between compact LMs and their larger counterparts with throughput time measures and performance on GLUE benchmark (Wang et al., 2018).In this paper, we evaluate LMs at a small scale trained on a 10M size dataset of BabyLM shared tasks and try to complement existing research with additional evaluation on moral judgment tasks.The decision to focus on the moral judgment task is driven by recent studies that reveal human-like biases in the moral acceptability judg-ments made by large language models trained on extensive corpora (Schramowski et al., 2022).This paper complements existing research by conducting a moral judgment evaluation for small language models.

Methodology
We follow pre-training tasks of RoBERTa (Liu et al., 2019) and GPT-2 (Radford et al., 2019) and refer to these as the architecture baselines in this section.We train Bebeshka4 and Zlata5 with masked language and causal language modelling objectives, respectively, and compare their vocabularies and architectures with the baselines.

Vocabulary
Training Data We use data provided within the STRICT-SMALL track of the shared task.We report statistics of the training corpus in Table 6 (Appendix A).The transcribed speech, extracted from recordings of casual speech addressed to children and educational movie subtitles, makes up the bulk of the corpus.The average length of the texts is around 30 tokens; considering that and the maximum text length, we lower the maximum sequence length from the base 512 to 128 tokens for the configuration of our LMs.

Input Representation
We follow tokenization models of the baselines (GPT-2, RoBERTa) and BabyBERTa (Huebner et al., 2021) and use bytelevel Byte-Pair Encoding (BPE) algorithm (Sennrich et al., 2016); that is, a tokenization method based on iterative merging of the most occurring bytes pairs in a further shared vocabulary.For the encoder Bebeshka, we build a case-insensitive vocabulary6 of size 8K.We find a few mismatches between Bebeshka and RoBERTa tokenization and provide more details in Appendix B. The decoder Zlata has a 30K vocabulary constructed with default parameter settings of Tokenizers trainer;7 that value also allows for bypassing the inclusion of onomatopoeic words that prevail in some transcribed texts of the shared task data.

Model Selection
To determine an optimal configuration of encoder LM, we use an Optuna-implemented Bayesian op-timization algorithm (Akiba et al., 2019) and tune parameters listed in Table 2 that determine the architecture.The upper bounds of the numerical parameters in a search space are chosen in accordance with the base RoBERTa configuration.We set the lower bounds to 1, ensuring a thorough exploration of architectural variations to find the optimal configuration for the masked language modelling task.Optuna features efficient implementation of optimization algorithms; in our optimization study, we use a standard Tree-structured Parzen Estimator (TPE) algorithm, which uses tree-structured representations and Parzen windows for modelling the probability distributions of hyper-parameters and their density estimation.We use TPE to sample parameter values from the search space and an automated early-stopping based on pruning runs with an intermediary perplexity higher than the median of preceding runs.
We set masked language modelling loss (perplexity) of RoBERTa initialized with the TPE sampled configuration parameters as a minimizing objective function.The perplexity is calculated on the STRICT-SMALL validation set after training the model for 10 epochs on written English texts sample (Gutenberg and Children's Book Test corpora and Wikipedia) from the training BabyLM corpus (see Table 6).We choose a corpus sample to reduce parameter search executing time since dataset size directly impacts an LM training time at each optimization step.We manually found that training on written texts yields a better score.Optimization study with an upper bound of 100 trial runs ran for roughly two days on a single A100 GPU.
Table 2 reports parameter search results for the best and worst runs according to perplexity on the validation dataset.
The optimal configuration for encoder LMs can be summarized as follows: (1) the ratio of the number of attention heads to the number of layers fluctuates within the 1.5-2 range, (2) employing relative key query type positional embeddings, (3) the dropping ratio 0.3 for attention probabilities.We further use these three key configuration attributes to initialize Bebeshka.Parameters other than positional embeddings type, dropout ratio and the number of layers/heads vary significantly across the top 10% runs.Precisely, all types of activation functions, except for ReLU, appear evenly in the best range.When it comes to the hidden size per head, it takes values from 65 to 85, with a mean of 81.6.We also observe a notable deviation of intermediary size from the mean value.Altogether our results show that the best-performing encoder LMs are smaller than the base configuration of RoBERTa, which aligns with Puvis de Chavannes et al., 2021.

Model Pre-training
We train our models on 4 Graphcore IPUs with two encoder layers trained on each with mixed precision8 and use STRICT-SMALL training split.
Table 1 shows the configuration settings of our LMs.
Bebeshka The 16M parameters model is based on RoBERTa architecture with determined optimal layer sizes ( §3.2).We train Bebeshka on the 10M training corpus of the shared task.We decrease the probability for selecting masked tokens from standard 15% to 13.5%, which is one of the equivalents to set RoBERTa unmasking probability to 0 discussed by Huebner et al., 2021.
Zlata That decoder LM is a light 66M version of GPT-2 with 6 layers trained for 10 epochs on the training STRICT-SMALL data.Motivated by the configuration of the best encoder LM, we use the ratio of attention heads to decoder layers equal to 2. We explain parameter choice in Appendix C.

Experiments Results
In this section, we report the results submitted for the BabyLM shared task.LMs discussed in this section are pre-trained on the shared task data, including the baselines.We use baselines that were created with existing tokenizers and released by the organizers of the BabyLM competition.9

Pre-training Objective Loss
We present the evaluation results of our LMs in Table 3, where we compare their performance against the shared task baselines and evaluation runtime.While the baselines were trained for 20 epochs, we can observe competitive results by pre-training our small-scale models for ten epochs.One of the main advantages of the introduced models lies in their compact size, which makes them more efficient at inference time, even though they do not outperform the baselines by a large margin, which can be seen from the average run time.

Linguistic Minimal Pairs
Figure 1 depicts the evaluation results of our LMs on the BLiMP dataset (Warstadt et al., 2020a) in a zero-shot setting.The goal of this evaluation benchmark is to assess a model's ability to distinguish between grammatically acceptable and unacceptable sentences without specific fine-tuning on the task.The dataset consists of minimal pairs annotated with a grammatical phenomenon.We report detailed LMs accuracy scores across various BLiMP tasks in Table 7 (Appendix D).The general trend is that LMs trained on BabyLM data perform well on minimal pairs with morphological tasks, such as Irregular Forms and Determiner-Noun Agreement.
Zlata achieves the best accuracy (92.1%) on Irregular Forms and outperforms OPT-125M baseline on some morphological tasks (Anaphor Agreement, Subject-Verb Agreement), minimal pairs with a violation in phrasal movements (Filler Gap) and other tasks, such as NPI Licensing.Bebeshka achieves the second-best accuracy (64.7%) on Filler Gap minimal pairs and distinguishes sentences with syntactic errors in pronoun and its antecedent relationship or syntactic islands (Binding, Island Effects).The results show that LMs trained on the BabyLM corpus have syntactic and morphology understanding which influences their behaviour on downstream tasks discussed next.

GLUE
Table 4 shows results of fine-tuned LMs evaluation on a variety of tasks present in GLUE and Super-GLUE benchmarks.10Submitted to the shared task, Bebeshka and Zlata were fine-tuned for ten epochs on most of the tasks (see Appendix C for more detail).The overall trend is that the introduced small-scale encoder Bebeshka and decoder Zlata demonstrate scores comparable with large baseline LMs on downstream tasks.That highlights that LMs at a small scale can quickly adapt to the finetuning task, though may achieve lower performance in a zero-shot evaluation on BliMP.When comparing decoder LMs, we observe that the introduced Zlata outperforms OPT-baseline on paraphrase detection (MRPC & QQP), entailment/contradiction detection (MNLI), and question answering (BoolQ) downstream tasks.As for the encoder LMs, the encoder Bebeshka has moderate scores compared to RoBERTa, which, in general, achieves the best scores on GLUE.However, Bebeshka outperforms OPT-125M baseline on QQP and MRPC tasks with F1 scores of 73.5% and 66.4%, respectively.
The most difficult task for shallow LMs seems to be Recognizing Textual Entailment (RTE).We suppose that LMs trained on STRICT-SMALL corpus with an average length of 28.65 tokens (Table 6, Appendix D) or restricted to the 128 maximum sequence length, can perform well on datasets with short sequences and contexts, which can explain lower results on some fine-tuned tasks; another issue can be the fine-tuning hyper-parameters search: perhaps, shallow LMs require more epochs to improve the submitted scores.

Mixed Signals Generalization
The MSGS dataset introduced by (Warstadt et al., 2020b) comprises 20 binary classification tasks and is used to test whether a LM has a preference for linguistic or surface generalizations.The evaluation pipeline of the shared task includes 11 MSGS tasks; we report obtained accuracy scores for the fine-tuned LMs in Table 8 (Appendix D).The Matthew's Correlation Coefficient (MCC; Matthews, 1975) scores suggest that all LMs fine-tuned in a controlled setting show better results (>0.9) than those fine-tuned in an ambiguous scenario, with the only exception for Control Raising category; the highest scores are achieved on Lexical content and Relative position tasks.Lexical Content is a task of classifying sentences with "the" (the mouse vs a mouse) when Relative Position is a task of determining whether "the" precedes "a" in a sentence.Decoder LMs perform similarly on MSGS tasks chosen for the BabyLM competition, excluding Syntactic Category-Lexical Content (SC-LC) classification task, where SC is a task of detecting sentences with adjectives.A decoder LM Zlata seems to adopt surface generalization during fine-tuning on unambiguous data (SC-LC), whereby the baseline model OPT learns to represent linguistic features.Bebeshka behaves likewise on the Syntactic Category task and reaches scores close to RoBERTa on Lexical Content and Main Verb classification problems, suggesting that Bebeshka tends to encode surface features.

Age of Acquisition
Portelance et al., 2023 introduced a method for measuring the age-of-acquisition in LMs compared to the actual age-of-acquisition by English American children on words set from the CHILDES corpus.Table 9 (Appendix D) illustrates that deviation measured in months for the introduced and baseline LMs.The models Zlata and Bebeshka demonstrate comparable scores to the baselines.

Moral Judgments
In this section, we present the results of additional experiments on moral judgements that we conduct outside of the main shared task evaluation.We evaluate small-scale LM's understanding of fundamental moral principles in various scenarios covered by ETHICS benchmark (Hendrycks et al., 2020).The benchmark consists of 5 morality judgment tasks, including reasonable and fair justice, virtue responses, permitted behaviour depending on context-specified constraints (deontology ethics), pleasant scenario choice (utilitarianism ethics), and commonsense morality.We grid search hyper-parameters for our LMs and use test splits for further evaluation.We fine-tune Bebeshka for ten epochs on each of the tasks and evaluate Zlata in a few-shot setting (see more details in Appendix C).Table 5 outlines the moral judgements classification results.Our small LMs generally outperform existing baselines with respect to accuracy scores on sentence-level tasks, and the best results are achieved on Virtue moral judgements.
We suggest that the efficiency of small LMs in these tasks can be explained by some properties of pre-training data, such as lower mean sequence length, transcribed speech prevalence with single-word reactions or responses, childrendirected speech, and imperatives.For example, Virtue task is a collection of scenario-trait pairs, such as "Jordan will never do harm to his friends.<sep> caring", which have a structure similar to one-word responses in transcribed dialogues.

Conclusion and Future Work
In this paper, we present our results for the STRICT-SMALL track of the BabyLM competition.Our submission to the shared task consists of two LMs, namely encoder Bebeshka and decoder Zlata.We first search for an optimal architecture, minimizing perplexity on the released training corpus, and find that the best models have around 6 encoder layers on average, down from 12 layers of existing base models, and have twice as many attention heads.When the number of encoder layers fluctuates among the best models, we find that they all have an attention-heads-to-layers ratio of two, which we further use for building our LMs.Our final LMs, which are scaled-down versions of RoBERTa and GPT-2 with a total of 16M and 66M parameters, perform better than the baseline LMs on development and test BabyLM corpora.Zero-shot evaluation results suggest that our shallow LMs have some basic grammatical knowledge of language syntax and morphology.The introduced LMs also perform better than OPT model on several downstream tasks when having 2 times fewer parameters.We also observe a good performance of our small LMs in a range of ethics judgment tasks, showing that their vocabulary and after-training knowledge can positively contribute to the morality assessment of the described scenarios.These results can serve as baselines for the evaluation of ethical judgment capabilities in small language models.The achieved scores may be attributed to the interplay between ethical and linguistic rules, particularly in encoding action verbs used to describe moral and immoral behaviour.This aspect can be further explored by examining the usage of verbs in various syntactic contexts within the BabyLM corpus and their encoding by trained language models.
In our future work, we plan to determine more capabilities of small LMs, trained on small-size corpora, such as short stories data containing words only 4-year-old children can understand (Eldan and Li, 2023).We also plan to extend our experiments with an analysis of fine-tuning dynamics to investigate how small models adapt to the tasks.

Limitations
Despite achieving good performance on BabyLM test data, our approach has some limitations.We use a variant of Bayesian optimization (TPE algorithm, §3.2) to find an optimal range of parameters that we further use for building our LMs.We predefine constraints for parameters (Table 2) that narrows down the search space and can influence further parameter distributions built with Parzen (kernel density) estimators and, thus, future candidate selection.Future work can benefit from both expanded search space and parameter limits range.The architecture of our small language models, including the number of layers, heads, and hidden layer size, can serve as a minimum lower bound for the parameter search space.

B Tokenization Tests
We compare the tokenization of Bebeshka and RoBERTa on the corpus of STRICT-SMALL track and find that the tokenization coincides on 87% of the sequences.We manually analyse a random sample of 100 non-matching tokenization cases and find that those fall on transcribed speech sentences with no more than three words or include two words missing in RoBERTa vocabulary but processed as a whole word by Bebeshka LM (sweetie and duke).We also found that the RoBERTa tokenizer splits non-capitalised first names or other terms used for addressing (th-omas, m-ister, mom-my) opposed to Bebeshka.

C.1 Pre-training parameters
We experimented with the same configuration for our decoder LM Zlata as we used for Bebeshka, including 4 layers and the same type of positional embeddings; however, that always resulted in gradients underflow and that loss was not decreasing.We manually found the 6-layer and absolute positional embedding configurations by increasing and traversing values of the parameters that were grid searched for Bebeshka (Table 2).We pre-train our LMs using 4x IPUs freely available in Paperspace11 and use IPU Trainer API.We use auto-loss scaling with an initial value of 16384 and half-precision for training our LMs.Training with IPUs requires specifying IPU configuration, containing instructions for mapping layers between the devices; for Bebeshka, we use one layer per IPU, and for Zlata, we use that parameter equal to 2. For both LMs, we use per-device training batch size equal to 1 and gradient accumulation steps equal to 64.Each batch consists of 1,000 concatenated data examples from the training corpus.The time for the computational graph construction took under 10 minutes for both training both LMs.

C.2 Fine-tuning parameters
BabyLM Evaluation For Bebeshka fine-tuning, we use parameters used by default in the evaluation pipeline of the competition, that is, learning rate equal to 5e-5, batch size equal to 64, and maximum epochs equal to 10.For Zlata fine-tuning, we use the learning rate equal to 1e-4 and fine-tune the tasks for 5 epochs.That allowed us to reduce fine-tuning time.Note that the performance of our LMs can be improved upon the submitted results if grid search the optimal hyper-parameters.
Moral Judgement We use a weighted loss for fine-tuning Bebeshka and grid search optimal parameters using an official implementation by the authors of the dataset.12For our GPT-2 based model Zlata, we use an existing evaluation harness benchmark in the k-shot setting with k equal to 15. 13

Table 1 :
Shoeybi et al., 2019.andpre-training details of Bebeshka and Zlata LMs compared to RoBERTa-base and GPT-2 medium.Our LMs have configurations of optimal architecture determined with an architecture search ( §3.2).GPT-2 official training information has not been publicly disclosed; we report GPT-2 pre-training hardware details when using model parallelism specified byShoeybi et al., 2019.We use Graphcore Intelligence Processing Units (Jia et al., 2019ining our LMs(Jia et al., 2019provide a detailed review on IPUs).MLM=Masked Language modelling, CLM=Causal Language modelling, L=Layers, A=Attention heads, H=Hidden size per head, F =Feedforward (intermediary) layer size.

Table 2 :
Parameter search space of Optuna study for pre-training encoder LMs on STRICT-SMALL corpus and mean parameter values across 10 best and worst runs sorted by the perplexity.For non-numerical parameters, we report the most common parameter values among study runs.

Table 3 :
Pre-training objective loss on validation and test data of Bebeshka and Zlata compared to baseline models and average run time in seconds.We run an evaluation of all LMs on the same V100 GPU and use Hugging Face Trainer API for calculating the scores.The best score is in bold, and the second-best score is underlined.

Table 4 :
Evaluation results on GLUE and SuperGLUE (BoolQ, MultiRC, WSC) benchmark datasets.We report metrics suggested in the shared task evaluation pipeline and baselines.The best score is in bold, and the second-best score is underlined.
Figure Accuracy on BLiMP tasks of our LMs withRoBERTa-base, OPT-125M, and T5-base baselines.The lighter colours correspond to greater accuracy and, hence, better scores.Morphology: Anaphor Agr., D-N Agr., Irregular Forms, S-V Agr.. Semantics: NPI Licensing, Quantifiers.Syntax-Semantics:Binding, Control/Raising.The rest phenomena correspond to the Syntax category.

Table 5 :
Hendrycks et al., 2020.CS benchmark.LMs trained on STRICT-SMALL corpus reach results close to the large model baselines reported byHendrycks et al., 2020.We do not report results for the fine-tuning tasks which require the maximum sequence length exceeding the one of an LM.The best score is in bold, and the second-best score is underlined.
features matter: RoBERTa acquires a preference for linguistic generalizations (eventually).In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217-235, Online.Association for Computational Linguistics.

Table 6 :
Statistics of the training corpus offered in the STRICT-SMALL track of BabyLM competition.