Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

In this paper, we describe our submission to the BabyLM Challenge 2023 shared task on data-efficient language model (LM) pretraining (Warstadt et al., 2023). We train transformer-based masked language models that incorporate unsupervised predictions about hierarchical sentence structure into the model architecture. Concretely, we use the Structformer architecture (Shen et al., 2021) and variants thereof. StructFormer models have been shown to perform well on unsupervised syntactic induction based on limited pretraining data, and to yield performance improvements over a vanilla transformer architecture (Shen et al., 2021). Evaluation of our models on 39 tasks provided by the BabyLM challenge shows promising improvements of models that integrate a hierarchical bias into the architecture at some particular tasks, even though they fail to consistently outperform the RoBERTa baseline model provided by the shared task organizers on all tasks.


Introduction
Transformer-based Language Model (LM) performance is heavily influenced by three scaling factors: the number of model parameters, the pretraining dataset size, and the amount of computing.For optimal performance, all three factors must be simultaneously scaled up (Kaplan et al., 2020).This scaling law has introduced several challenges in advancing research on neural language modeling.One major obstacle lies in the unequal distribution of resources across languages.Consequently, the current approach of transformer-based models falls short of achieving equally high-performance levels for models dedicated to different languages (Choudhury and Deshpande, 2021).
Moreover, we see a considerable difference when comparing the way LMs learn how humans acquire language.One difference concerns the data that is input to learning: LMs such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b) or GPT-3 (Brown et al., 2020) are exposed to billions of tokens during training, far surpassing what an individual human is exposed to when learning a language (Warstadt and Bowman, 2022).This fundamental discrepancy raises important considerations when drawing parallels between language learning in machines and humans.
To improve the data-efficiency of LMs, one direction is to adapt the model architecture.An effective approach in this endeavor involves incorporating an inductive bias into the models' architectures, which could potentially facilitate acquiring more knowledge from the same amount of data compared to standard models.However, the specific type of inductive bias to be added is still under exploration.Recently, there have been efforts to investigate the use of syntactic hierarchical inductive biases as a potential improvement (Mulligan et al., 2021;Papadimitriou and Jurafsky, 2023). 2ne of these potential solutions is the Struct-Former architecture (Shen et al., 2021), a transformer that is trained on the masked language modeling task.An additional convolutional neural network (CNN) component produces unlabeled dependency and constituency trees as a byproduct and influences the self-attention mechanism of the transformer layers.The model has obtained demonstrated competitive results in structure induction evaluations and a decrease in perplexity over a vanilla transformer baseline (Vaswani et al., 2017).However, it is an open question whether the inductive bias learned in this architecture enhances performance on downstream NLP tasks.
We pretrain the StructFormer architecture on a dataset from a different domain that had not been tested on that model before.Moreover, we use a more sophisticated tokenizer in comparison to the most frequent words dictionary used to train the models in the original experiment.Additionally, we modify the model architecture to investigate whether injecting a hierarchical bias in the middle layers of the transformer architecture (rather than after the embedding layer) leads to improved downstream performance.Eventually, we evaluate seven model variants through the evaluation pipeline of the shared task and submit our best-performing model to the shared task challenge.

The BabyLM Challenge
The BabyLM Challenge is a shared task with the aim of data-efficient language modeling for English.Participants pretrain a LM from scratch on data that corresponds to the amount of linguistic data available to a child.The task is a great setting for conducting our experiments.It provides us with a pretraining dataset, a thorough evaluation pipeline, and, furthermore, an environment where we can compare our models' performance to other interesting architectures from the systems participating in the shared task.
Dataset The shared task is conducted in two tracks with different dataset sizes: a 100M words corpus, and a 10M words corpus as a sample of the larger corpus.The size is inspired by the assumption that children are exposed to 2M-7M words per year (Gilkerson et al., 2017).To account for the fact that children mostly interact with spoken rather than written language data, the datasets include a high proportion of transcribed data from different domains.For more details regarding the source domains, please refer to Warstadt et al. (2023).
Evaluation A thorough evaluation pipeline that comprises 39 different tasks is used to evaluate every model participating in the shared task.These tasks are supposed to represent a model's performance with respect to efficiency and applied NLP, as well as cognitive science, and linguistics.A group of 17 tasks, named BLiMP (Warstadt et al., 2020a) are performed via zero-shot predictions, while the other two groups of tasks; SuperGLUE (11 tasks, Wang et al., 2019) and MSGS (11 tasks, Warstadt et al., 2020b) need finetuning of the submitted models for classification.Refer to Appendix A for the complete list of tasks.

Language Modeling and Hierarchical Information
Transformer LMs use syntactic information in their predictions.This has been shown by work on interpreting their internal representations as well as by investigating the grammatical correctness of their predictions (Mahowald et al., 2023;Kulmizev and Nivre, 2022).However, the vanilla transformer architecture that underlies both encoder and decoderbased LMs does not encode hierarchical information explicitly.Rather, objectives such as masked language modeling and next-token prediction are based on linear relationships between tokens.This has inspired two lines of work that incorporate hierarchical knowledge into LMs.The first group of papers introduces models in which the training objective involves syntactic labels explicitly (e.g.Dyer et al., 2016;Sartran et al., 2022), The second group introduces models in which hierarchical information is encoded implicitly as a byproduct of a language modeling task (Shen et al., 2018(Shen et al., , 2021;;Li et al., 2019;Kim et al., 2019;Choi et al., 2018;Williams et al., 2018).We consider the second group of models more relevant for this shared task since it allows us to train models with a hierarchical architecture bias on raw text data.In particular, we use the StructFormer model (Shen et al., 2021), a transformer in which one architecture component, the parser network, predicts the position of each token in the hierarchical structure of the sentence.The prediction of the parser network puts soft constraints on the attention mask of the transformer layers.The model is pretrained on the masked language modeling task, and we view two experimental contributions of Shen et al. (2021) as most relevant for using this model: First, they show that a StructFormer achieves lower perplexity on limited training data than a transformer that replaces the parser network with standard self-attention.Second, the induced hierarchical structure corresponds to unlabeled dependency trees.Concretely, evaluation on the Penn Treebank (PTB) shows that 61.6% of the undirected dependency edges are recovered.We further implement a variant of the model in which the parser network predicts hierarchical information based on hidden states that are contextualized with classical transformer layers, rather than using uncontextualized token embeddings as direct input to the parser network (Sec.3.2.4).
This section introduces the objectives of our experiment, a description of the model architectures, and the technical aspects of the pretraining and evaluation process.

Objectives
In this work, we aim to validate the claim that the performance of LMs, in particular on syntaxsensitive tasks, can be improved through the implicit integration of an inductive bias into the model's architecture that yields a hierarchical structure of the tokens.Concretely, we conduct experiments towards pursuing the following three primary objectives: 1. Assess the robustness of the finding that LM performance is enhanced through the utilization of a linguistically informed model architecture (Shen et al., 2021).
2. Investigate whether the claim that transformer architectures better represent syntactic information in their middle attention layers is supported in a practical use case (Vig and Belinkov, 2019;Arps et al., 2022;Müller-Eberstein et al., 2022).
3. Develop models that surpass the performance of the baseline models offered by the organizers of the shared task.

Methodology
In order to address the questions posed by the experiment's objectives, we train a tokenizer, develop several model variants, and perform iterations of model pretraining, finetuning, and evaluation.Due to limited resources, we only conducted our experiments on the 10M words dataset.Furthermore, from the model architectures provided by the shared task, we chose the encoder-type models due to their adaptability for integrating a hierarchical bias in the model architecture.

Tokenizer
We use the same tokenizer across all variations of our models.Specifically, we train a Byte Pair Encoding (BPE) tokenizer (Sennrich et al., 2016;Gage, 1994)  accurately represents tokens in our relatively small dataset while adhering to best practices for LMs.
To achieve this, we train the tokenizer on the same corpus with different vocabulary sizes.We then observed the resulting vocabularies and identified the least frequent tokens within each (Table 1).
Based on our analysis, a vocabulary size of 32K tokens provides a fair representation relative to the corpus size for the least frequent tokens.Additionally, Geiping and Goldstein (2022) found that a BPE tokenizer with 32K tokens yielded the best results.

Baseline model
To achieve objective 1, we pretrained a standard transformer architecture that we call transformerbase, using our custom-trained tokenizer and following the same model and training hyperparameters to minimize any effects due to uncontrolled variables.

Hyperparameters
Due to resource limitations, and to assure fair comparisons between models, we use one set of pretraining and finetuning hyperparameters: We chose the default hyperparameters settings that were used to pretrain the shared task baseline models (Warstadt et al., 2023).In order to speed up the evaluation of finetuning tasks, we made modifications to the finetuning hyperparameters that were used to evaluate the baseline models.Our main hyperparameters are reported in Appendix B. We pretrain all models with the same batch size and the same number of steps.We use the training pipeline that Warstadt et al. (2023) introduced to train their baseline modes to minimize any effects due to uncontrolled variables.
However, one variable that could not be fixed during the experiment is the number of trainable parameters in each model.When adding a convolution parser network to a particular model, the increase in the number of parameters in that model is inevitable (parameter counts are listed in Appendix B).We are aware that this can have misleading effects on the results and conclusions, however, we still think that the experiment in its current setting can show interesting behaviors that may encourage further investigation in a fully controlled experiment.

Model Architectures
We develop two primary variants of model architectures for our experiment.
StructFormer This variant (Figure 1) closely follows the architecture in Shen et al. (2021).In brief, it incorporates a parser network that consists of 4 convolution layers.The input to the parser network is token embeddings, and the output is probability distributions for dependencies between tokens.These distributions are then integrated into the multi-head self-attention mechanism of a standard transformer model.For a complete description of the architecture, we refer readers to Shen et al. (2021).We name models of this variant by the prefix structformer.
StructRoBERTa The second variant (Figure 1) is similar to the StructFormer, but instead of employing a standard transformer, it utilizes a base RoBERTa encoder (Liu et al., 2019b).We modify the HuggingFace (Wolf et al., 2020) implementation, which has a few differences from the vanilla transformer implementation, mainly adding normalization and dropout layers after the embeddings layer, and also adding an additional intermediate block within each layer.The models following this architecture will be identified with the prefix structroberta.
Vanilla transformer For transformers without parser networks, we reuse the implementation by Shen et al. (2021) which follows the standard transformer introduced by Vaswani et al. (2017), except that a layer normalization is added in front of each layer.
Variants Subsequently, for each of the main variants, structformer and structroberta, we create two sub-variants to explore a different placement of the parser network within the architecture (Figure 2).This decision is based on insights from previous experiments, which indicate that syntactic information tends to be better represented in the middle layers of the transformer (Liu et al., 2019a;Vig and Belinkov, 2019;Arps et al., 2022).
In our approach, we divide the initial n context layers of either the transformer or RoBERTa component in structformer or structroberta respectively.We label these n context layers as the Front Attention Layers, while the remaining attention layers are labeled as Rear Attention Layers.The input embeddings pass through the Front component, generating embeddings that are subsequently fed into the parser network.The parser network, in turn, outputs dependency distributions that are integrated into the Rear component of the architecture.To distinguish between the two sub-variants, we append the suffix s 1 to models with the parser network before the attention layers (Figure 1), and the suffix s 2 to models with the parser network in-between the middle attention layers (Figure 2).
To achieve objective 3, we introduce two additional models, structroberta s1 ′ and structroberta s2 ′ , to enhance the evaluation scores so we could submit the best attainable results to the shared task.These two models are basically an upgrade in the number of convolution layers (from 4 to 6) of the parser network in structroberta s1 and structroberta s2 respectively.

Results
After completing the pretraining process of the 7 investigated models, a comprehensive linguistic evaluation is conducted for the seven models under study.The shared task evaluation pipeline is used for this purpose.Detailed evaluation results are presented in Tables 2 3, 4, and 5.We compare the scores of the following models: transformer-base (TF base ), structformer s1 (SF s1 ), structformer s2 (SF s2 ), structroberta s1 (SR s1 ), structroberta s2 (SR s2 ), structroberta s1 ′ (SR s1 ′ ) and structroberta s2 ′ (SR s2 ′ ).We are particularly interested in assessing to which extent the introduction of a hierarchical bias improves a model's performance on a specific task.Therefore, in addition to the scores of the individual models, we also report the differences in scores as follows: All numerical values in the result tables are measures of accuracy unless explicitly stated otherwise.

Pseudo-perplexity
We report the corpus-level pseudo-perplexity (P P P L, Salazar et al., 2020) on the test split of the BabyLM shared task dataset3 (Table 2).P P P L is computed by masking out each token in turn and collecting the log-likelihoods.This evaluation contributes to objective 1 in our experiment.Shen et al. (2021) found that structformer models incorporating hierarchical inductive bias achieve lower P P P L than their baseline transformer model.We want to assess this finding on the BabyLM dataset and using our custom-trained tokenizer.SF s1 shows lower P P P L compared to TF base , which follows the previous findings.However, the model with a parser network within the middle layers shows a higher P P P L than the baseline TF base .The addition of more convolution layers at the parser network shows an improvement at SR s2 ′ but surprisingly shows a deterioration at SR s1 ′ .

BLiMP
BLiMP is a challenging benchmark comprising a set of tests designed to evaluate the linguistic knowledge of LMs with a specific focus on linguistic phenomena encompassing syntax, morphology, and semantics (Warstadt et al., 2020a).Originally, the benchmark consisted of 12 tasks (see Appendix A).Additionally, in the shared task (Warstadt et al., 2023), 5 more tasks were added to BLiMP as heldout tasks, aiming to assess the generalization capabilities of the submitted models.The random chance accuracy for all original BLiMP tasks is 50, while chance was not reported for the additional 5 supplement tasks.
According to the BLiMP scores in Table 3, within the Set A models, the models incorporating hierarchical inductive bias (SF s1 and SF s2 ) do not show consistent outperformance or underperformance in comparison to the baseline model TF base .
However, on average, the SF s1 model is on par with and occasionally outperforms the TF base model.In particular, SF s1 excels in the following tests: Argument Structure, Determiner Noun Agreement, Filler Gap, Irregular Forms, Quantifiers, and Subj.Verb Agreement.Conversely, SF s1 underperforms the TF base in the tasks of QA Congruence Easy, Subject Aux Inversion and Turn Taking.We hypothesize that this is because syntactic knowledge is helpful for the former list of tasks, but to a lesser degree for the latter, for example, Turn Taking, which focuses on knowledge of discourse and dialogue structure, in particular of referential properties of NPs, which is not reflected in the syntactic structure.A sample pair from this data set is "Should you quit?" -"No, I shouldn't."(good) versus "Should she quit?" -"No, I shouldn't."(bad).The negative and the positive data points have the same syntactic structure and the dependents are perfectly fine as argument fillers.
While the model with a parser network inbetween the middle layers SF s2 , underperforms TF base on average, but interestingly it demonstrates a noteworthy improvement in the specific task of Irregular Forms.Remarkably, similar to SF s1 , SF s2 significantly outperform TF base in this particular task.The task of Irregular Forms involves aspects of lexical decisions but the syntax of course also plays a role.
Within the RoBERTa model variations in Set B, again the model with a parser network in-between the middle layers SR s2 fails to improve over the one with a parser network ahead of the encoder layers SR s1 in most of the tasks.It even gets worse with the upgrade in the number of convolution layers within the parser network at SR s2 ′ .On the other hand, the upgrade in the number of convolution layers at SR s1 ′ shows also an upgrade in accuracies over SR s1 .Generally, SR s1 ′ achieves the best results among all the investigated models on average.
Moreover, the Set B models exhibit improvements over Set A models in the tests of Binding, Det.Noun Agreement, Subject Verb Agreement, and QA Congruence Easy.
It is not so clear how to interpret the results of the two Question Answering (QA) Congruence tasks, where the baselines achieve only very low scores.For the QA Congruence Easy task, which tests for detecting selectional preference violations on ob-ject fillers in answers (e.g., "What did you sell?-A chair."(good) versus "What did you sell?-Sarah."(bad)), knowing about the syntactic structure of the first sentence probably helps to apply selectional restrictions and thereby assessing the quality of the second as a possible reply.This might be the reason why we see an improvement in model performance in the SR models when adding implicit hierarchical information that reflects syntactic dependencies.The QA Congruence Tricky task is similar, except that the selectional preference that is violated in the negative data points does not refer to the direct object.Furthermore, the object is dropped in most examples and sometimes the (incorrect) argument filler would be a plausible direct object (e.g., "Who ate? -Sarah ate." (good) versus "Who ate? -Pasta ate." (bad)).This is why the task is tricky.In this context, it is important to keep in mind that our StructFormer models learn only unlabeled dependencies and therefore cannot distinguish between object and subject.This means that for Pasta ate, a structure would be implicitly predicted where pasta is a dependent of ate, which is perfectly fine semantically (as a direct object).This might be a reason why the structformer models struggle with this test and partly lead to a decrease in the performance, compared to our baseline, since the unlabeled dependency tree actually licenses the negative data points.

SuperGLUE
SuperGLUE consists of eleven diverse tasks (see Appendix A) which evaluate various performance aspects.These tasks include sentiment analysis, linguistic acceptability judgments, entailment detection, and semantic similarity evaluations of words within contexts, among others (Wang et al., 2019).
The scores (see Table 4) in most of the tasks fall in a narrow range across all the investigated models.The incorporation of hierarchical inductive bias does not show clear improvements in most of the tasks.A noticeable result that is observed for the models with a parser network within the middle layers (s2) is the result of the MRPC task, where s2 models consistently outperform the s1 models in both sets for this particular task.The upgrade in the number of convolution layers also does not show a clear improvement in most of the tasks for both SR s1 ′ and SR s2 ′ models.
Notably, in the case of the WSC task, we observe that all models' predictions heavily favored one specific class.This raises concerns about the success of the finetuning process for this particular task.

MSGS
The MSGS tasks, listed in Appendix A, were introduced by the shared task as held-out tests specifically designed to evaluate generalization capabilities.Detailed information and further insights about these tasks are expected to be disclosed in an upcoming publication.MSGS tasks are measured using the Matthews correlation coefficient (MCC).MCC is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by Matthews (1975) The MSGS results (Table 5), resemble to the Su-perGLUE results.The models incorporating hierarchical inductive bias show contradicting behavior across the different tasks.While for some tasks e.g Control Raising (Control), Relative Position (Control), and Syntactic Category (Relative Position), SF s1 and SF s2 are strengthening the correlation in comparison to the baseline model, but with other tasks e.g Lexical Content (Control), Main Verb (Lexical Content) and Syntactic Category (Lexical Content), SF s1 and SF s2 are shown weakening the correlation.

Aggregation
Indeed, analyzing the performance changes across 39 tasks for 7 different models is a complex process.To simplify the assessment and present a concise summary of each model's overall performance, we report an aggregate score of all the 39 scores for each model (Table 6).This aggregation approach was internally computed by the shared task submission platform to represent each model with a single score, providing a more straightforward evaluation of the overall performance.Subsequently, we select the model with the best aggregate score SR s1 ′ to represent our submission in the shared task.

Discussion
Although the evaluation pipeline of the shared task was meticulously designed to encompass a comprehensive analysis of pretrained LMs, covering aspects of efficiency, applied NLP standards, cognitive science, linguistics, and language acquisition (Warstadt et al., 2023), it was discussed in Warstadt et al. (2020a) that some tasks that involve semantic phenomena such as Island Effects and NPI Licensing are very difficult for LMs in general.Consequently, the consistently low performance observed across all models on these tests can be attributed to this matter.As a result, we refrain from considering the aggregate score as a single definitive metric for representing how a model's performance compares to another.Instead, we advocate for a thorough investigation of individual tests while considering the test's objectives, dataset, and evaluation strategy.
Overall, the models incorporating hierarchical inductive bias did not show significant improvement in the scores of the BabyLM evaluation tasks, however, some exceptions of the evaluation tasks that show improvements in terms of scores when using the structformer and structroberta models, encourage a deeper investigation for patterns in the outputs predictions that might lead to a different conclusion.Namely, the tasks that we think are worth more investigation are: Argument Structure, Determiner Noun Agreement, Filler Gap, Irregular Forms, Quantifiers, Subj.Verb Agreement, Control Raising (Control), Relative Position (Control) and Syntactic Category (Relative Position).
Contrary to our expectations, the modification of placing the parser in-between the middle attention layers has not demonstrated notable improvements but rather a decline in performance compared to the models with the parser placed right after the input  embedding layer.We can only speculate about why this is so.It might be that it is an advantage to push the model very early towards identifying structural relations between words.More precisely to do so at a stage where the contributions of the single tokens are still separated from each other.The parsing network placed between the middle layers acts at a moment where single token contributions are already blurred.
To understand the effect of placing the parser network within the middle layers, we propose probing the layers of the Front and Rear modules and comparing them to the corresponding layers in the model where the parser network is placed ahead of the attention layers.Such a comparative analysis can provide valuable insights and either support or contradict our hypothesis regarding the learning of syntactic features in the middle layers of transformer models.
Regarding the aim of achieving competitive scores on the shared task challenge, the best score we could get was from the model structroberta s1 ′ , this model is an upscaling of the structroberta s1 .

Conclusion
In this paper, we extend the work of Shen et al. (2021) to explore the capabilities of the Struct-Former architecture as an example of employing hierarchical bias in addressing the challenges posed by relatively small LLM pretraining datasets.Furthermore, we modify the StructFormer architecture to examine whether integrating the hierarchical bias within the middle attention layers leads to performance improvements.To accomplish these objectives, we pretrain seven model variants using the same dataset and configuration settings.We evaluate these models on 39 different tasks.The evaluation outcomes reveal varying behavior across the models, exhibiting inconsistencies in performance.We could not show strong evidence that models incorporating hierarchical bias are performing better in the context of this shared task, nor could we show practical evidence for the claim that syntactic information is better represented in the middle attention layers within the scope of our experiment.We have noted substantial enhancements in certain tasks when models incorporate hierarchical bias in their architectural designs.Nonetheless, to ensure the reliability of our findings and to eliminate potential confounding factors related to the varying number of parameters in each model, as well as the distinct objectives and complexities of individual tasks, we intend to carry out an in-depth analysis of each model's performance on a task-by-task basis.

Figure 2 :
Figure 2: In-between Parser Architectures (s 2 ), dotted lines indicate intervening the encoder layers at two positions, where the parser network connects the two split parts of the encoder

Table 1 :
Tokenizer Vocabulary Size Experiments from scratch on the 10M BabyLM corpus.Since BPE tokenizers require specifying the vocabulary size as a hyperparameter before training on the corpus, we carefully determined an appropriate size.Our goal was to obtain a tokenizer that

Table 4 :
(Super)GLUE Results.Values are not aggregated across each model due to the presence of different metrics (Accuracy, F1 score, and MCC)

Table 6 :
Shared Task Leaderboard Results