Evaluating Neural Language Models as Cognitive Models of Language Acquisition

The success of neural language models (LMs) on many technological tasks has brought about their potential relevance as scientific theories of language despite some clear differences between LM training and child language acquisition. In this paper we argue that some of the most prominent benchmarks for evaluating the syntactic capacities of LMs may not be sufficiently rigorous. In particular, we show that the template-based benchmarks lack the structural diversity commonly found in the theoretical and psychological studies of language. When trained on small-scale data modeling child language acquisition, the LMs can be readily matched by simple baseline models. We advocate for the use of the readily available, carefully curated datasets that have been evaluated for gradient acceptability by large pools of native speakers and are designed to probe the structural basis of grammar specifically. On one such dataset, the LI-Adger dataset, LMs evaluate sentences in a way inconsistent with human language users. We conclude with suggestions for better connecting LMs with the empirical study of child language acquisition.


Introduction
The growth of neural language models (LMs) for NLP over the past decade has been followed by a growth in research on the potential of these models to provide insights into the cognitive aspects of human language acquisition, representation, and processing (Linzen and Baroni, 2021).Good, even human-like, performance on NLP tasks does not necessarily imply that LMs solve these in humanlike ways, so computational linguists have designed a wide variety of experimental paradigms to probe specific properties of the models' linguistic knowledge (Linzen et al., 2016a;Chowdhury and Zamparelli, 2018;Gulordava et al., 2018;Wilcox et al., 2018;McCoy et al., 2020;Hu et al., 2020;Warstadt et al., 2020;Papadimitriou et al., 2021;Huebner  Human performance is marked by the vertical line.Baby=BabyBERTa, CHI=AO-CHILDES, News=AO-NEWSELA, Wiki=Wikipedia-1. et al., 2021) These range from ways of classifying or extracting structures from internal representations (e.g., Hewitt and Manning, 2019;Tenney et al., 2019;Tucker et al., 2021;Papadimitriou et al., 2021), to building tasks inspired by psycholinguistic processing studies and classic acceptability rating task that theoretical linguists use to infer grammatical knowledge (e.g., Linzen et al., 2016a;Warstadt et al., 2020;Huebner et al., 2021;Sinclair et al., 2022).
Of these approaches, acceptability rating may be the most popular.Large acceptability rating data sets focusing on syntax, semantics, and morphology, such as BLiMP (Warstadt et al., 2020), SyntaxGym (Gauthier et al., 2020), and CoLA (Warstadt et al., 2019) lend themselves to benchmarking, and these sit alongside myriad smaller scale studies focused on specific lingusitic phenomena (e.g., Linzen et al., 2016b;Marvin and Linzen, 2018;Wilcox et al., 2018).Results have been impressive for the most part.It appears, from the logic of these studies, that many state-of-the-art neural models are capable of inducing human-like grammatical knowledge on unannotated data -like children during language acquisition.

Implications for Language Acquisition?
Neural model training differs from human language acquisition in key ways, perhaps most obviously, in that most models are trained on orders of magnitude more input (in plain text form) than humans receive (in spoken or signed form)-BERT was trained on about 3.3B forms, and Chinchilla on 1.4T, while an English-learning child only receives about 10M word per year, for a total vocabulary measured in the hundreds at age three (Fenson et al., 1994;Bornstein et al., 2004).
Recent studies have begun to address this.Can we build models that learn from input on the scale of language acquisition?Would these models then inform our understanding of human language acquisition?Warstadt and Bowman (2022) favor this perspective.They argue that a computational model that performs well on behavioral probing benchmarks when trained on ablated input, that is at least as limited as a human learner's input, is evidence that the model is a good proxy for human linguistic knowledge.Huebner et al. (2021) showed that a specially tuned model trained on only 5M tokens of child-directed speech (CDS) performs well on a purpose-designed data set.And in 2023, an aptlynamed shared task, the CoNLL/CMCL BabyLM Challenge, 1 is asking participants to train on only 100M words (about the input of an adolescent) before testing on acceptability benchmarks.

Goals of the Paper
A push towards extracting performance on smaller training data is a welcome change for the field.In addition to its possible cognitive implications, the drive will also benefit efficient NLP and NLP for low-resource languages.However, while we look forward to the impending engineering advances, we also urge caution in the approaches used to draw scientific conclusions about the nature of neural models' linguistic knowledge.In particular, we take issue with Warstadt and Bowman (2022)'s assertion that "positive results from model learners are more meaningful than negative results." Their reasoning follows that of an existence proof.If a model that strictly lacks any advantages over humans nevertheless succeeds at a task that requires human-like linguistic knowledge, then it is proof that there exists at least one model with human-like linguistic knowledge.A failure only tells us that this model failed for some reason that 1 https://babylm.github.io/may or may not be relevant to the question at hand.However, this line of reasoning requires faith in the evaluation.If there are any potentially unrecognized non-human-like ways to succeed at the task, or if the task does not truly reflect acquisition, or the task does not actually test a relevant structural property of language, then a positive result becomes inconclusive at best.Unexpected shortcuts emerging from unforeseen biases in evaluation abound across NLP (Chao et al., 2018;McCoy et al., 2019;Wang et al., 2022), so this is a realistic concern.Even the underlying reasoning that "if a (neural) model X behaves like cognitive system Y, then it is equivalent to Y" may be fraught (Guest and Martin, 2023).
In this paper,2 we evaluate LMs as models of language acquisition on two benchmarking data sets: the widely used Benchmark of Linguistic Minimal Pairs (BLiMP; Warstadt et al., 2020), which also forms part of the evaluation for the BabyLM Challenge, and Zorro (Huebner et al., 2021), a data set inspired by BLiMP with restricted vocabulary for acquisition-inspired models trained only on CDS.
Section 2 reviews the nature of linguistic knowledge and child language acquisition.In Section 3, we introduce the BLiMP and Zorro benchmarks and subject them to baseline tests by simple nonhuman-like models.These establish several weaknesses in the organization and content of both benchmarks.In Section 4, we evaluate neural models on a more challenging data set derived directly from theoretical linguistics papers.We find that LMs are not necessarily human-like in terms of within-and across-model variability.Finally, Section 5 concludes with a discussion of the logical problem of behavioral probing.We argue for (a) benchmarks that better probe the structural knowledge of syntax, (b) tests that reflect the developmental findings of language acquisition, and (c) more baseline models.

Knowledge of Language and its Acquisition
One of the goals of linguistic theory is to characterize the properties that distinguish grammatical from ungrammatical sentences in a language.The empirical study of grammaticality, however, mainly relies on native speakers' acceptability judgments, which interact with other cognitive and perceptual systems and generally produce gradient results.For example, longer and more complex sentences, even when fully grammatical, are rated as less acceptable than shorter and simpler sentences.Nevertheless, large-scale investigations have established the structural basis of a categorical grammar (Sprouse and Hornstein, 2013).For example, syntactic constraints that prohibit certain transformational processes are shown to have a "super-additive" effect that go beyond acceptability rating due to sentence length and other non-structural factors.Furthermore, acceptability judgments collected at scale are highly consistent with the data reported in the theoretical literature typically gathered informally with few consultants (Sprouse and Almeida, 2012;Sprouse et al., 2013;Sprouse and Almeida, 2017).
The structural basis of language and its uniformity across the linguistic community can be better appreciated from the perspective of child language acquisition.Recent years have seen renewed interest in individual differences across child learners (Kidd et al., 2018), especially with respect to vocabulary acquisition (Frank et al., 2021).It is at least possible that children differ in their cognitive abilities for language and learning, but it is empirically obvious that they differ in their experience with language.Longitudinal records of child language development have made it possible to track both children's vocabulary growth, and the development of the structural aspects of their grammar.In the Providence Corpus (Demuth et al., 2006), for example, six children were recorded at regular intervals from age 1 to 3. On average, fewer than 20% of the first 100 words are shared between any two children.The overlap merely rises to about 40% for the first 1,000, which is the upper limit of a three-year old's vocabulary size (Hart and Risley, 1995;Bornstein et al., 2004).Yet these children's grammars are highly uniform even at this stage.Major syntactic categories, word order and argument structure, and the core morphological rules are firmly established before age three (Brown, 1973) on the basis of at most around 10M words per year (Hart and Risley, 1995) and a vocabulary size of only a few hundred types (Fenson et al., 1994), and all children produce similar grammatical errors during this time.
Recent decades have also seen a convergence between the psychological and formal study of language development and the quantitative study of language variation in early childhood.The sociolin-guist, Bill Labov, remarks that "The end result is a high degree of uniformity in both the categorical and variable aspects of language production, where individual variation is reduced below the level of linguistic significance" (2012).
The acquisition of vocabulary and grammar provide clues for investigating the capacities of LMs.Vocabulary learning is a matter of rote learning.This includes not just the arbitrary pairing of sounds and meanings, but also morphological processes (e.g., irregularity) and syntactic structures (e.g., sub-categorization, collocations, etc.).There is no escape from experience: more data results in better learning.But, the structural aspects of the grammar are different.They require form generalizations over the vocabularies.
The distinction between rote learning and structural learning (words vs. rules) is not well reflected by existing LM benchmarks including those discussed in this paper.In practice, these benchmarks are a mixture of tests for both vocabulary learning and grammar learning.Moreover, they are stochastically generated by templates: as such, a large number of test sentences are immediately available, but they lack the structural diversity that has proven revealing in the theoretical study of grammar.
Furthermore, the sentences are sometimes highly unnatural and semantically/pragmatically uncontrolled, which is precisely the confounding factor that linguists seek to neutralize when attempting to uncover the structural basis of language.Warstadt et al. (2020) are aware that their templates generate unnatural sentences, presenting the BLiMP sentence 'Sam ran around some glaciers.'as an example.We found similar issues in Zorro, such as 'the lie on the foot is flat .,'the first sentence in Zorro's across_prepositional_phrase paradigm (lie is a noun).The BLiMP authors state that this is not a problem because it affects both sentences in a pair, but how can we rule out unintended interactions between the grammatical phenomenon under evaluation and the semantic implausibility?Sprouse et al. (2018) find that this semantic implausibility may affect judgments of sentence wellformedness, even in the Forced Choice (FC) task used to collect the human baselines in BLiMP.
Indeed, there are already a large amount of carefully curated linguistic materials that are not only structurally diverse but also have minimized lexical and semantic confounds.Furthermore, these datasets (e.g., the Adger/LI dataset; Section 4) have been evaluated for acceptability at an individual level by a large pool of native speaker subjects and show very high convergence rates across individuals.They will be especially informative if we are to explore the structural knowledge of LMs.
3 Re-examining the Benchmarks BLiMP (Warstadt et al., 2020) Warstadt et al. (2020) introduce the Benchmark of Linguistic Minimal Pairs (BLiMP)3 as a means of evaluating the linguistic knowledge of neural language models.BLiMP extends the reasoning of earlier studies (e.g., Linzen et al., 2016b;Marvin and Linzen, 2018;Wilcox et al., 2018) which use a minimal pair paradigm to approximate acceptability judgments.Instead of prompting for a acceptability judgments on individual sentences, as is most commonly done for human subjects, they present an LM with two sentences that only differ in one structural or lexical property.For a given minimal pair m i consisting of an acceptable sentence s i,1 and an unacceptable sentence s i,2 , if an LM evaluates P (s i,1 ) > P (s i,2 ), then the model has succeeded on m i .An LM is scored according to the percentage of all the minimal pairs for which it identified the acceptable sentence.The minimal pair approach allows for the direct evaluation of LMs without training a binary classifier on top of them as was necessary for previous acceptability benchmarks (e.g., CoLA; Warstadt et al., 2019).
Minimal pairs need to be carefully constructed to control for length and lexical frequencies.BLiMP aims to accomplish this with automatic generation from templates, but as we discuss, it often yields sentences with low structural diversity and implausible semantics.The benchmark corpus includes data sets for 12 linguistic phenomena, including ANAPHOR AGREEMENT, ARGUMENT STRUCTURE, BINDING, CONTROL/RAISING, and others listed in the Appendix.These are further divided into 67 paradigms, each containing 1000 sentences pairs, which are meant to test variants of the phenomena, for example the phenomenon DETERMINER-NOUN AGR.contains 6 paradigms for adjacent agreement, agreement with irregular nouns, and agreement with adjectives intervening.BLiMP has become a standard NLP benchmark for this task and will be used as part of the test data for the upcoming BabyLM Challenge.
Zorro (Huebner et al., 2021) Huebner et al. (2021) explicitly aim to evaluate the relationship between LMs and the acquisition of grammar.They introduce Baby-BERTa_AO-CHILDES "an acquisition-friendly version of RoBERTa," trained on English childdirected/produced speech (CDS) approximating the total input of a typical English-learning sixyear-old.They train variants on only CDS from AO-CHILDES (Huebner and Willits, 2021), a pre-processed version of English CHILDES (MacWhinney, 1991), as well as variants on larger datasets from other sources.
Because BabyBERTa_AO-CHILDES (henceforth BabyBERTa) was trained on much less text than typical large transformer models are, its vocabulary is much smaller.To mitigate the impact of out-ofvocabulary (OOV) items on their tests, the authors introduce a new grammaticality test suite, Zorro,4 in the style of BLiMP.Sentence pairs are generated for one paradigm each for 11 of BLiMP's 12 phenomena, along with two additional phenomena.However, we show that the Zorro sentences are not only lexically simpler as intended, but their templates are also far less complex and even less varied than the sentences in the corresponding BLiMP phenomena.Full lists of paradigms for each data set can be found in the Appendix, and the full data sets themselves are made available by the benchmarks' original authors.

Linear Baselines
As noted earlier, BLiMP and Zorro tests are stochastically generated with category-based templates.This way, a large number of examples can be generated and tested, but the drawback is that all examples are essentially the same structure.Moreover, many of the structures are simple, falling considerably below the coverage of modern syntactic analyses.In fact, many examples appear solvable by strictly linear methods.The observation that such template-generated examples can be solved this way is not new to to field.For example, Kam et al. (2008) demonstrated that a bigram model will predict the grammatical sentence from template-produced pairs featuring auxiliary inversion (a structural phenomenon) as well as neural models of the time.
To take an example from BLiMP, within its SUBJECT-VERB AGR phenomenon, four of six paradigms evaluate string-adjacent subject verb agreement that could be captured by a bigram model.The remaining two include intervening distractor nouns, but in both these and the stringadjacent paradigms, the target noun is consistently the first/leftmost noun.A single linear rule, albeit a long-distance one, is sufficient to succeed on this phenomenon.In ANAPHORA AGREEMENT, none of the sentences has any distractors at all: the test is solely about whether the anaphor (e.g., himself /herself ) agrees with the first, and only, noun in the sentence preceding it.Success on such simple tests tells us little about the genuine grammatical capacity of LMs and distorts or dilutes summary metrics calculated over the benchmark.
We evaluate this problem quantitatively with two studies of linear rules that do not incorporate structural knowledge.We find that many, but certainly not all, paradigms are solvable with non-humanlike linear approaches.These paradigms therefore do not contribute to the overall goal of evaluating whether an LM possesses linguistic knowledge.Additionally, we find that the paradigms of Zorro tend to be structurally even simpler and less internally varied than the parallel paradigms of BLiMP.It is a weaker benchmark even when accounting for the intended lexical simplicity.

N-Gram Models
The original BLiMP paper reports the accuracy of a 5-gram model trained on the 3.1B token Gigaword Corpus (Graff et al., 2003) in addition to three neural LMs and human performance.They find that the 5-gram model scores above chance (50%) on all but two phenomena but is outclassed by most of the neural LMs on most paradigms.Performance for all LMs can vary widely across paradigms within one phenomenon.In some cases, there is a clear split between the 5-gram and neural models, suggesting that the latter capture some structural property of the paradigm that the 5-gram model does not, but in other cases, the 5-gram model performs well, demonstrating that linear rules can be sufficient for completing those tasks.
Revisiting SUBJECT-VERB AGR. as an illustrative example, the Gigaword 5-gram model performs only slightly behind the neural models on each string-adjacent paradigm but far below chance in the distractor paradigms.However, the neural models also perform up to 20.5 points better in the adjacent paradigms than the distractor paradigms.The two distractor paradigms demonstrate that the neural models have learned a long-distance pattern (whether that be structural or "agree with the leftmost noun"), but the adjacent paradigms cannot show this.They, and about half of the BLiMP paradigms, are uninformative in this way.
We extend this approach to the language acquisition setting by training a 5-gram model only on AO-CHILDES and evaluating on both BLiMP and Zorro.We compare these results to BabyBERTa on these data sets. 5To further manage lexical effects while adding minimal complexity to the model, we evaluate both a 5-gram word model (5-word), and a 5-gram model trained only on POS tags (5-tag).AO-CHILDES was tagged using GPoSTTL, a rulebased POS tagger with tokenizer and lemmatizer based on the Brill Tagger (Brill, 1992).This was used to train sklearn's CRF POS-tagger, which was then used to label the benchmark corpora.This approach was taken to avoid bringing additional knowledge from a tagger trained on larger corpora into the benchmark corpora.The downside is that the tagger is not particularly accurate on the ungrammatical benchmark sentences, which may hurt performance for the 5-tag model.In addition to the 5-word and 5-tag models, we evaluate an oracle which marks a correct prediction if either 5-word or 5-tag makes a correct prediction.Our use of POS is motivated from a developmental perspective.Syntactic categories can be formed purely distributionally as early as infancy (Mintz, 2003;Shi and Melançon, 2010;Reeder et al., 2013) and children almost never make mistakes in their use of syntactic categories (Valian, 1986).It is thus plausible to assume that the acquisition of grammatical knowledge builds on a developmentally prior stage of syntactic category learning.
The results of the 5-gram experiments are summarized in Table 1 and laid out in detail in the Appendix.We draw three conclusions from these.First, the 5-gram models perform surprisingly well relative to the BabyBERTa transformer despite its extremely non-human-like simplicity when trained on the same AO-CHILDES data.Either 5word or 5-tag, trained on the same data as Baby-BERTa, outperformed BabyBERTa on 11 of 23 Zorro paradigms and 21 of 67 BLiMP paradigms.BabyBERTa's performance appears less impressive when presented alongside even this very weak 5 Refer to Appendix for full details.We downloaded the publicly available model checkpoints from the BabyBERTa GitHub repository and replicated the BLiMP and Zorro results hosted on the Zorro GitHub repository baseline.The AO-CHILDES 5-gram models perform more poorly on BLiMP than the Gigaword 5-gram model, but it still achieves high accuracy on several paradigms scattered across the phenomena.Second, 5-gram oracle outperforms 5-word, 5tag, and BabyBERTa.The 5-gram oracle is not a fair direct comparison but provides a summary metric for correlation between 5-word and 5-tag.A high oracle score relative to the two 5-gram models indicates that they do not make the same errors.That is, errors are not necessarily attributable to the string-local limitations of 5-grams per se but rather to 5-gram sparsity or errors in tagging.The high oracle score is another sign that the paradigms often capture surface properties rather than structural properties that would stump 5-gram models.
Third, the 5-gram models outperform Baby-BERTa on proportionately more Zorro paradigms than BLiMP paradigms.Additionally, the AO-CHILDES 5-word model achieved 78.91% performance on Zorro, while the Gigaword 5-gram model only reached 60.5% on BLiMP.If Zorro were merely accounting for the smaller vocabulary in the AO-CHILDES training data, we should expect much more similar performance on both of these measures.Taken together, these suggest that Zorro is a substantially weaker benchmark that BLiMP, and it more greatly overestimates the apparent positive results of the acquisition-inspired BabyBERTa.

Hand-Written Simple Rules
In addition to reporting results on 5-gram models, we created simple hand-written rules which demonstrate that the probes are solvable in principle without reference to linguistic structure.While we do not claim that such rules are akin to the state of knowledge in LMs, it is also difficult to completely rule out this possibility.On the one hand, it is still unclear how to interpret the representa-tion of linguistic knowledge in LMs.On the other, the vast majority of training data, at least childdirected for language acquisition, is structurally simple and can in fact be handled by rule-like pattern matchers.In English CDS, the distribution of anaphora is exceedingly straightforward: almost all instances of himself are preceded in the sentence by the subject pronoun he and a (male) noun phrase with no co-referential competitors.For comparison, Zorro adjunct_island can be solved perfectly by always selecting the sentence where the third-last word is the, and many of the paradigms can be solved by tracking the index of a specific word.Others can be solved by checking for the presence of a certain word.For example, the superlative paradigm can be solved by accepting the sentence that contains either more or fewer.For both Zorro and BLiMP, more than one paradigm can often be solved with the exact same rule.We write simple linear rules for each Zorro and BLiMP paradigm.See the Appendix for a full list of rules.
In summary, these rules yielded 93.97% accuracy on Zorro and solved 14 of 23 Zorro paradigms with 100% accuracy.Each agreement_ paradigm is solved with at least 96% accuracy, with the remainder due to two irregular nouns, feet and children, which do not end in the -s referenced by these rules.The lowest performance is 52.75% on anaphor_agreement-pronoun_gender, a paradigm that requires an LM to 'know' the canonical gender of English names in order to choose himself or herself.The test sentence pairs were not quite balanced, so always guessing himself earns more than 50%.
BLiMP proved more challenging.The rules only yielded 84.35% accuracy on average and achieved perfect scores on 14 of 67 rules.The overall high score of the hand-written simple linear rules suggests that BLiMP suffers from the same issues regarding lack of sentence variety that Zorro does, but the lower accuracy indicates that the problem is not quite as severe.In principle, we could have composed more complex rules which achieved perfect accuracy on all paradigms, however, these simpler rules better illustrate our points.The success of non-human-like simple linear rules on most paradigms on both benchmarks further emphasizes that success on the template-based behavioral task does not necessarily imply that an LM possesses linguistic knowledge.(Adger) phenomena where each minimal pair is lexically matched.We provide an example of each in Table 2.
The LI-Adger dataset improves upon the prior two datasets in three key ways.Firstly, unlike BLiMP and Zorro, the LI-Adger sentences are controlled for semantic implausibility, which has been shown to be a strong confounding factor when eliciting human judgments (Sprouse et al., 2018).Second, the 255 total pairwise and multi-condition phenomena achieve much more diverse coverage of syntactic phenomena than the 67 paradigms in BLiMP, and the 23 paradigms in Zorro.Third, the human judgments were collected using the Magnitude Estimation (ME) task (and Likert Scale (LS) in the case of the LI sentences) in addition to Forced-Choice (FC) as in the BLiMP human baselines.We believe this to be a crucial advantage because the FC task treats sentence acceptability as functionally categorical: A sentence is only acceptable or not relative to its minimal pair counterpart, whereas tasks such as ME allow us to make comparisons within and across minimal pairs, thereby treating sentence acceptability as a gradient measure.
With this dataset, we conduct the following two tests.First, in line with Vázquez Martínez (2021), we sort the LI-Adger dataset into 2391 unique minimal pairs.We then collect pseudo log-likelihood scores for each sentence from several models evaluated by Huebner et al. (2021), and score them using the same criteria as BLiMP and Zorro.As a baseline for the models, we include Log-Likelihood and Syntactic Log-Odds Ratio (SLOR; Pauls and Klein, 2012;Lau et al., 2017) scores by a trigram model trained on the British National Corpus (BNC; 100M words) by Sprouse et al. (2018).
We include the results of this test in Figure 1.We observe that all models are further from the human baseline as compared to those in BLiMP (no human baselines were reported for Zorro).But more importantly, we observe that the trigram model scored using SLOR performs on par with the Baby-BERTa models and approaches the performance of RoBERTa (Liu et al., 2019) trained on 10M words.If we were to adopt the "positive results from model learners are more meaningful than negative results" argument, then the trigram model is as suitable a model of language acquisition as BabyBERTa is.
Raw accuracy notwithstanding, we proceed to conduct a novel test of judgment variability on our collection of LMs.We take advantage of the structure of the LI-Adger dataset in the following way: There are 519 sentence types, and for each type there are eight sentences that retain the same syntactic structure but vary lexical items at the locus of the syntactic structure tested (e.g., the head of a verb phrase from which extraction takes place).These datasets thus allow us to contrast the consistency of human judgment across and within construction types against that of the LMs.
We z-score the LM judgments to make them comparable to the human judgments.Then, for each set of eight sentences, we take the mean and standard deviation of all the judgments for humans and each LM.We find that the models are much more variable in their judgments: The human judgments, on average, vary by 0.288 standard devia- tion (std.dev.) units within a given set of sentences.On the other hand, the LM that least varies is Baby-BERTa Wiki, varying by 0.451 std.dev.units on average.The rest of the models nearly double the variability of the human judgments, ranging from 0.518 for RoBERTa-10M-1 to 0.554 for BERT-large-cased.Variability appears to increase rather than decrease as training size and performance increase.Surprisingly, the trigram model, when scored using log probabilities, is the closest in variability to the human judgments at 0.331 std.dev.units, but also the furthest when scored using SLOR with a variability of 0.599.Once again we find that a positive result on one test or another is not enough to draw positive conclusions.
For further illustration, we correlate the means and standard deviations of 512 sentence types across each LM and humans and plot the results in Figure 2.Both in terms of mean and standard deviations, we observe generally high correlations between the various neural LMs, but much lower correlations between the LMs and humans.This suggests that whatever the LMs are doing, good or bad, does not appear to be human.Interestingly, the BabyBERTa LMs show very high correlations with the naive trigram log-likelihood scores and very low with trigram SLOR scores, raising further suspicions that these small acquisition-inspired LMs behave like a very non-human-like model.

Discussion
It is widely recognized that children acquire language in ways that appear quite different from LM training.There is a growing realization that the cognitive relevance of LMs can only be established in a comparable setting.Bringing down training size requirements stands not to not only improve the applicability of such models to the study of language acquisition but also to efficient NLP on low-resource languages.
However, in this paper, we observed several weaknesses in BLiMP and Zorro, two minimal pair benchmarks for evaluating the linguistic knowledge of neural language models.We believe that it is worth critically revisiting the underlying assumption that positive results on such benchmarks are a demonstration of human-like representations or human-like language acquisition, especially if an evaluation can be solved in unintended ways, or if it does not reflect an adequately broad range of linguistic structures.It is unlikely that a behavioral probe, such as these large binarized benchmarks, can fully capture the complexity of linguistic knowledge.To this end, we made a case for also evaluating with curated benchmarking datasets: the gradient acceptability judgments from human subjects makes these effective probes for the structural basis of grammar.Together with a range of tests, from carefully constructed tests of grammaticality to probes correlating the internal state of LMs with their predictions need to complement theoretical, psycholinguistic, and neurolinguistic studies before a meaningful cognitive claim about the nature of neural language models can be made.
We end with some broader discussion about language acquisition and the cognitive interpretation of computational models.While it is now widely recognized that children learn language with only a fraction of the data needed for large LM training, merely reducing the amount of training data alonesuch as the 100M word threshold in the BabyLM Challenge -still falls short of the requirement for an adequate model of language acquisition.While it is true that a native speaker's knowledge of language can be established on the basis of approximately 100 million words, child language research makes clear that not all aspects of linguistic knowledge are learned at the same time.Some, such as inflectional morphology, case marking, word order, and major transformations are acquired very early in all languages studied so far (e.g., Brown, 1973;Slobin, 2022) at an order of magnitude fewer words of input, while others are learned rather late: These include derivational morphology (Jarmulowicz, 2002), passivization (Pinker et al., 1987), control and cleft structures (Chomsky, 1969) and the dative constructions (Gropen et al., 1989) in the case of English, but these may emerge much earlier in other languages.This suggests that successful learning in the limit (e.g., 100M word) is not sufficient.For example, while a neural model of English past tense (Kirov and Cotterell, 2018) eventually learns the "add -ed" rule, it does so with over 3,000 verb lemmas.By contrast, children learn that rule before or around 3 (Kuczaj, 1977), when their vocabulary only contains around 300 or so verbs (Marcus et al., 1992).To serve as cognitive models of language, it is thus important to compare the training trajectory of LMs as a function of the training data volume against the developmental benchmarks of specific linguistic phenomena which have been amply documented in the empirical literature on child language.

Limitations
Our study is about the limitations of evaluation, so it is to be expected that our study has its limits as well.Most obviously, ours and any study would benefit from testing and reporting on a wider range of neural models and a wider range of baselines.And like most work in this area, our evaluations were only performed on English.We recommend the use of the LI-Adger data set.Like any behavioral probe, including the ones which we criticize, it can be subject to ambiguous interpretation.It has some substantial advantages, as we discuss in this paper, but also a couple of additional drawbacks.
It is smaller than BLiMP or Zorro, and it has not been annotated by phenomenon.Nevertheless, it provides additional insights that those benchmarks do not.As in the paper, we recommend its use in conjunction with a wide range of other evaluation methods.

Table 2 :
Top: Two pairwise phenomena from the Linguistic Inquiry (LI) dataset.Bottom: One multi-condition phenomenon from the Adger dataset.The ME Z-score is the averaged Z-score transformation of the human Magnitude Estimation judgments for each of the sentences across all the experimental participants.

Table 3 :
Word and tag-level 5-gram models trained on AO-CHILDES plus 5-Gram Oracle and Simple Linear Rule Oracle for Zorro.5-Gram and Simple Rule scores that are greater than BabyBERTa_AO-CHILDES are bolded in s iff 3rd last is in {are, were, do} in_simple_question Word 2 right of {are, were} ends in s.Word 2 right of {is, are} does not in_question_with_aux 4th word ends in s iff 2nd is in {are, were, do} across_prep_phrase 2nd word ends in s iff 3rd last is in {are, were, do} agreement_determiner_noun between_neighbors If {these, those} in sentence, next word ends in s.If {this, that} in sentence, next word does not across_1_adjective If {these, those} in sentence, word 2 right ends in s.

Table 4 :
Simple Linear Rule descriptions for Zorro.Rules that require sentences to be compared are marked with an asterisk.

Table 5 :
Word and tag-level 5-gram models trained on AO-CHILDES plus 5-Gram Oracle and Simple Linear Rule Oracle for BLiMP.5-Gram and Simple Rule scores that are greater than BabyBERTa_AO-CHILDES are bolded