Frequency Effects on Syntactic Rule Learning in Transformers

Pre-trained language models perform well on a variety of linguistic tasks that require symbolic reasoning, raising the question of whether such models implicitly represent abstract symbols and rules. We investigate this question using the case study of BERT’s performance on English subject–verb agreement. Unlike prior work, we train multiple instances of BERT from scratch, allowing us to perform a series of controlled interventions at pre-training time. We show that BERT often generalizes well to subject–verb pairs that never occurred in training, suggesting a degree of rule-governed behavior. We also find, however, that performance is heavily influenced by word frequency, with experiments showing that both the absolute frequency of a verb form, as well as the frequency relative to the alternate inflection, are causally implicated in the predictions BERT makes at inference time. Closer analysis of these frequency effects reveals that BERT’s behavior is consistent with a system that correctly applies the SVA rule in general but struggles to overcome strong training priors and to estimate agreement features (singular vs. plural) on infrequent lexical items.


Introduction
Many natural language phenomena are best described as the product of applying rules to abstract symbols, without access to the content of these symbols (Smolensky, 1988;Fodor and Pylyshyn, 1988). Most speakers of English will agree, for example, that if "gorp" is a singular noun, then, regardless of the meaning of "gorp", the utterance "the gorp adds nothing" is grammatical, but "the gorp add nothing" is not.
The success of contemporary neural language models such as BERT (Devlin et al., 2019) on language understanding tasks, as well as in more targeted linguistic evaluations (Marvin and Linzen, * Work done while visiting Google. 2018; Goldberg, 2019), raises the question of whether these systems acquire such symbolic rules. While previous studies have attempted to address such questions, particularly in relation to BERT (Rogers et al., 2020), prior work has generally not analyzed the relationship between the model's pretraining data and its behavior. As a result, it has been difficult to tease apart the many factors that may influence a model's test time performance.
In this paper, we investigate whether pre-trained transformer-based language models learn and apply symbolic rules, focusing on BERT's ability to follow the English subject-verb number agreement rule ( §3) as a case study. On our evaluation stimuli ( §4), we find that BERT achieves high performance, even on subject-verb pairs that never occurred together in the training set ( §5.1- §5.2). In exploratory data analysis, however, we find that this performance is also influenced by effects from both absolute and relative frequency of verb forms in the training data ( §5.3- §5.4). To confirm these phenomena causally, we perform a series of training interventions where we pre-train BERT models on training data for which we have carefully manipulated the frequencies of verb forms ( §6). We further use probing classifiers to attribute observed mistakes either to errors in rule-following or to errors in lexical categorization ( §7).
These experiments reveal several insights about BERT in the context of rule-governed tasks. First, the high performance of BERT on subject-verb combinations that never occurred in the training set is consistent with a model that learns abstract representations of lexical items and patterns, i.e., abstract features and rules. Second, BERT's performance is influenced by absolute frequency effects, but probing classifiers show that this influence can be explained by the model's inability to learn the features of a verb form (singular vs. plural) for infrequent lexical items, rather than a failure to apply the rule when the verb form has been classified. Fi-nally, although BERT generally applies rules with high accuracy, it fails to overcome strong priors during training-when one verb form is much more frequent than another, BERT tends to produce the more common form, even when it is not consistent with the rule.
2 Experimental Logic

Hypotheses
We aim to investigate BERT's ability to reason over abstract symbols. As a case study, we focus on subject-verb agreement (SVA) in English, for which the grammaticality rule of interest is: NUMBER(subject) = NUMBER(verb) We consider three alternative hypotheses about the process underlying BERT's behavior on SVA.
H1: Idealized Symbolic Learner. In theory, symbolic reasoners operate over abstract categories, such as the agreement feature NUMBER, and rules, such as "if NUMBER(subject) = SINGULAR, then NUMBER(verb) = SINGULAR." Early work (Fodor and Pylyshyn, 1988) which discusses the behavior of such symbolic systems often presents an idealized version, for the sake of theoretical argument. Thus, under H1, this system would not make errors such as misclassifying inputs or erroneously parsing the sentence, and is not affected by wordspecific properties (e.g., frequency).
H2: Item-Specific Learner. The antithesis of the idealized symbolic learner is a model that reasons entirely using word co-occurrences. This system does not represent any abstractions over the immediate inputs it receives, and thus cannot reason over features such as singular/plural. Conceptually, it is analogous to early phrase-based MT systems (Brown et al., 1990) that build a literal string lookup table in order to predict the most likely output given an input. By definition, it performs poorly on noun-verb pairs that never co-occurred in training, as the lookup table will not have the relevant entry. 1 H3: Symbolic Learner with Noisy Observations. Both H1 and H2 represent extreme, largely theoretical models of system behavior. In practice, 1 We do not specify whether such a model has access to abstract features other than agreement because such features (e.g., the notion of subject) do not affect the specific hypotheses we consider. For example, a model that does not represent agreement feature and only learns word co-occurrences will perform poorly on unseen items, regardless of whether it has access to correct parses. we expect systems like BERT to display some hybrid of the two. However, to our knowledge, there has been no work to date which proposes a specific hypothesis of what type of hybridization best explains BERT's behavior. In this work, we consider one such hybrid: a system that is symbolic at its core but has noisy observations. 2 That is, under H3, the system represents symbols (e.g. singular/plural word categories) and rules (e.g., SVA) correctly, but can make errors in mapping from inputs to symbols. Conceptually, it is analogous to a BayesNet (Pearl, 1988) that correctly represents nodes and causal connections internally but may nonetheless incorrectly process an input, activating the wrong nodes and thus producing the wrong output. Thus, unlike in H1, systems consistent with H3 make errors when they cannot identify whether a subject or verb is singular or plural, potentially due to frequency effects (present at all levels of processing; Marantz, 2013).

Predictions and Summary of Findings
We use three diagnostics to differentiate the above hypotheses: (1) generalization to unseen noun-verb pairs, (2) the presence of frequency effects when making predictions for seen noun-verb pairs, (3) and correlation between specific types of errors.
Generalization to unseen noun-verb pairs allows us to differentiate H2 from H1 and H3. For instance, since whether the sentence "the section adds nothing" obeys the SVA rule depends only on NUMBER("section") and NUMBER("adds"), a symbolic reasoner's ability to assess grammaticality should not depend on how frequently the words "section" and "adds" have been seen together in the data. Instead, we would expect such a system to learn the correct agreement features of the two words independently and apply a general SVA rule to them. In contrast, an item-specific learner, which does not represent abstract agreement features, would rely on probabilities defined over specific lexical items, and thus may fail to reason correctly about rare or unseen situations, for which such probabilities are poorly calibrated.
The presence of frequency effects in BERT's performance allows us to differentiate H1 from H2 and H3. That is, under both H2 and H3, the model may perform worse on less frequent words (albeit for different reasons). In contrast, a system consistent with H1 should not exhibit any differences in performance due to differences in inputs below the abstraction of singular/plural.
Our experiments show that BERT generalizes well (though not perfectly) to unseen noun-verb pairs ( §5.1- §5.2) and exhibits clear frequency effects ( §6). Together, these results are most consistent with a hybrid system like H3. To confirm this, we use probing classifiers to investigate H3's specific prediction about correlations between types of errors, i.e., that errors on SVA should be explained by errors in classifying singular vs. plural ( §7). We find that the expected error patterns explain some frequency effects (those due to absolute frequency) but not others (those due to relative frequency). Thus, we ultimately conclude that, of the hypotheses considered, H3 is the best model of BERT's behavior, though BERT exhibits additional sensitivity to frequency imbalances between competing word forms that H3 leaves underspecified.

Related Work
Targeted Syntactic Evaluation. We use the targeted syntactic evaluation framework of Linzen et al. (2016) and Marvin and Linzen (2018) to measure the model's ability to learn and apply the SVA rule. Following the setup from Goldberg (2019), each test instance consists of a sentence in which a verb has been masked out, and BERT's masked language modeling (MLM) parameters are used to score whether the singular or plural form of the verb is a better fit for the masked position. For example, given the sentence "The section [MASK] nothing to the info." and set of verb inflections {"add", "adds"}, the model would be considered correct if the MLM prediction assigns a higher score to the singular form "adds" than the plural form "add" since the subject of the masked verb position is "section," which is singular.
Due to the particulars of BERT's MLM task setup, the model is only able to score words that are represented by a single wordpiece. While Goldberg (2019) dealt with this limitation by restricting evaluation to just those verbs that appear in the original BERT model's vocabulary as a single wordpiece, we are able to avoid such compromises because pre-training the models ourselves means that we can add any entries we want to the vocabulary.
Syntactic Reasoning in LMs. There has been substantial prior work on the ability of language models to perform abstract syntactic processing tasks (Hu et al., 2020) (see Linzen and Baroni (2020) for a review). On SVA specifically, Goldberg (2019) found that BERT achieves high accuracy on both natural sentences (97%) and nonce sentences (83%), and that error rate was independent of the number of "distractor" words between the subject and verb; Yu et al. (2020) showed that language models do not exhibit better grammatical knowledge of more frequent nouns. Other work has found that BERT's performance is sensitive to factors that may suggest item-specific learning; Chaves and Richter (2021) found that BERT's performance on number agreement is sensitive to the verb, across seven different verbs, and Newman et al. (2021) found that language models performed better on verbs that they predicted were likely in context. The focus on frequency effects also relates to a more general line of work on understanding the effect of training size and distribution on neural language models' generalization (Warstadt et al., 2020;Lovering et al., 2021). To our knowledge, our present study is the first to investigate these questions via controlled interventions on the model's pre-training data, making it possible to draw stronger conclusions.
Our formulation of the SVA task also relates to work which investigates neural networks' abilities to learn lexical abstractions (Chronis and Erk, 2020;Kim and Smolensky, 2021) and to reason systematically (Lake and Baroni, 2018;Yanaka et al., 2019;Kim and Linzen, 2020;Goodwin et al., 2020). These studies on systematicity, however, run controlled experiments by training small models on toy data. Our work studies the widely-used BERT model, trained on real data and at scale.

Model
Differentiating between the hypotheses presented in §2 requires analyzing model performance on individual items as a function of frequencies in the training data. The original BERT model was trained on both English Wikipedia and BooksCorpus (Zhu et al., 2015). However, BooksCorpus is not publicly available (Bandy and Vincent, 2021), so when we pre-train our BERT models, we use only the Wikipedia data (2.3 billion tokens). Despite this difference in training data, our models Natural Addition of such minor characters seem/seems more promotional than encyclopedic.
Other popular trade items of the area include/includes sandalwood, rubber, and teak. The party that originally buys the securities effectively act/acts as a lender.

Nonce
The astronomer of the first session of the court during that year perform/performs a... The isometry in the gulf market/markets santa catalina island. The sheepdog of basic needs providers ... review/reviews a damaging effect.

Evaluation Stimuli
We evaluate the model's SVA ability on two classes of stimuli: (1) natural sentences, which are generally both syntactically and semantically coherent, and (2) nonce sentences (following Gulordava et al. 2018) which are grammatically valid but not necessarily semantically coherent ("colorless green ideas sleep furiously", Chomsky, 1956). Examples of each are shown in Table 1. Evaluating on natural sentences provides a measurement of how well the model can be expected to perform in realistic settings, but these sentences are not ideal for a targeted SVA evaluation since they often contain additional cues relevant to verb inflection, such as other plural verbs or plural determiners, as in "two [SUBJECT] and their dogs [VERB]," making it difficult to discern whether a model has chosen a particular verb inflection based on the subject. In contrast, performance on synthetic nonce sentences allows us to ensure that the only source of information about the verb's correct inflection is the subject itself.
Natural Stimuli. Following Goldberg (2019), for natural stimuli we use the dataset from Linzen et al. (2016), which comprises 23,298 sentences from Wikipedia. The target verb is plural in 16,232 of these sentences and singular in 7,064 of these sentences. These evaluation sentences span 176 verb lemmas and 329 verb forms.
Nonce Stimuli. For our nonce stimuli, we compiled a list of 200 nouns, 336 verbs, and 56 sentential contexts-sentence templates where we remove the original subject and verb-such that any given (noun, verb, sentential context) triplet yields a grammatically correct nonce sentence. E.g., given the sentence "the investigation of chaperones has a long history", we can create a sentential context: "the [SUBJECT] of chaperones [VERB] a long his-tory." We can then randomly chose a noun and verb from our noun and verb lists (e.g., cities and play) to construct a nonce sentence: "The cities of chaperones play a long history." Considering all possible combinations of nouns, verbs, inflections, and contexts yields a dataset of 7,526,400 sentences which is is both large (c.f., 383 sentences in Gulordava et al. (2018)) and balanced in terms of number form (50% singular and 50% plural).
To ensure that constructed sentences are grammatically correct, we apply several manual filters (e.g., removing verbs that have ambiguous inflections), which are described in detail in Appendix B (with a list of all nouns, verbs, and sentential contexts in Appendix D). To verify the quality of the resulting stimuli, one of the authors manually examined 154 randomly generated nonce sentences in the same way that they would be presented to the model. The verb inflection was correctly predicted in all but one of the instances (with the single error attributed to carelessness), and the annotator confirmed that all generated sentences were grammatically correct. Our stimuli and code are available at https://github.com/ google-research/language/tree/master/ language/bertology/frequency_effects.

Exploratory Analyses: What Factors
Correlate with Error Rates?
We first perform an exploratory analysis of how the model's abilities on the SVA task vary as a function of pre-training frequency. As discussed in §2, we consider generalization to unseen subject-verb pairs to be evidence of symbolic reasoning (H1 or H3), and strong frequency effects to suggest itemspecific learning (H2 or H3). Note that in these experiments, it is not the individual lexical itemsthe subject and verb-that are unseen, only the combination of them in a single sentence. Therefore, this analysis evaluates the model's ability to perform abstract reasoning about individual items for which it has learned representations.  Table 2: Error rate on natural and nonce evaluation stimuli, stratified by whether the subject-verb pair was seen (frequency ≥ 1) or unseen (frequency = 0) during pre-training. Heuristics (argmaxes) show performance for a item-specific learner that memorizes probabilities of specific verbs (V) or subject-verb (SV) pairs. BERT's performance degrades on unseen pairs, but is significantly better than these heuristics.

Overall Performance
Overall, the model's error rate is 3.2% on natural stimuli and 16.8% on nonce stimuli. This is similar to Goldberg (2019)

Unseen Subject-Verb Pairs
Table 2 stratifies error rate by seen and unseen subject-verb pairs. Compared with subject-verb pairs seen at least once during training, error rates on unseen subject-verb pairs are 5% higher on natural sentences and 2% higher on nonce sentences. This degradation, however, is minimal compared with what we might expect from a naive itemspecific learner (H2), represented by the heuristic baselines in Table 2. These results thus suggest that BERT reasons over representations that abstract to some degree over individual words, though it does not meet the definition of a fully-abstract symbolic learner (H1), which would have no degredation in performance.

Frequency of the Target Form
To further examine the effect of frequency, we draw inspiration from the human language processing literature. One of the most widely-observed phenomena in such research is that high-frequency words are learned better (Ambridge et al. 2015): Reduce Error Hypothesis. High-frequency forms reduce errors in contexts where they are the target. subject-verb pairs and more-frequent verbs, consistent with the Reduce Error hypothesis.

Frequency of the Competing Form
Although seeing a verb more often in training often reduces errors when that verb is the target, when high-frequency forms are not the target, they can act as distractors and reduce accuracy: Cause Error Hypothesis. High-frequency forms cause errors when a competing, lower-frequency form is the target (Ambridge et al., 2015).
Is BERT's error rate similarly affected by distractor frequency effects? For instance, the word "combat," which is not only the plural form of the verb "combat" but also a fairly frequent noun, appears 102× more often in the training set than "combats." If word frequency influences BERT's predictions, then such asymmetries may cause a high error rate when the target form is "combats." As Figure 2 shows, error rate is lower when the target form is more frequent relative to the competing form. For nonce sentences, for example, error rate was only 2.2% when the target form was 16 times or more as frequent than the competing form, compared with 62.5% when the competing form was 16 times or more frequent than the target form.

Takeaways
The above exploratory analyses suggest that BERT is influenced by both the absolute frequency of the target form (Reduce Error Hypothesis), as well as the frequency of the target relative to the competing form (Cause Error Hypothesis). Although these results are strong correlational evidence, absolute and relative frequency are highly correlated with one another (i.e., as the absolute frequency of a word increases, so does its frequency relative to other words). 4 Thus, more controlled studies are needed to establish which effects have a causal effect on BERT's rule-learning.

Confirmatory Analysis: Manipulating the Training Data
To better understand the above trends, we design a set of experiments in which we manipulate one variable (absolute or relative frequency of a verb form in pre-training) while holding the other fixed.

Experimental Setup
We select 60 verbs of interest (VOIs) and manipulate their training set frequencies. We choose the VOIs by taking 60 transitive verbs for which both singular and plural forms of each verb occur at least 10 4 times in the corpus (the full list is shown in Appendix D.4). We remove all sentences Frequency of VOI in training set Error rate (%) Nonce Natural Figure 3: Effect of absolute frequency of a verb of interest (VOI) when the ratio between singular and plural forms is held constant at 1:1. The error rate for sixty VOI is shown for BERT models that have seen the sixty VOI at different frequencies in the pre-training dataset.
containing these VOIs from the training set, and, based on the experiment, add them back in such that VOIs appear at a specified (absolute or relative) frequency. We evaluate the model's performance on these VOIs by using both a natural dataset of approximately 100 examples per VOI, as well as by inserting the VOIs into the nonce (noun + sentential context) constructions from §4.2.
We note that the exact size of the training set in our manipulations changes depending on how many sentences containing VOIs are added in (e.g., models which see 10,000 examples per VOI see more total training examples than models that see only 1 example per VOI). The difference in absolute terms, however, is small (less than 1% of the total training set). Thus, we consider it unlikely that any observed difference in performance is due to a difference in the total size of the training corpus. 5

Absolute Frequency of Verb Form
We first examine how the absolute frequency of a verb form affects the model's number agreement ability on that form. For each of nine frequencies n = 1, 3, 10, 30, 100, 300, 1,000, 3,000, and 10,000, we train a new BERT model that sees the verbs n times each during training. To isolate the effect of absolute frequency, we fix the relative frequency to be balanced-for each VOI, n instances are singular and n are plural.
The results of this experiment are shown in Figure 3. When the occurrences of a VOI are balanced between inflections (singular and plural), error rate decreases monotonically when target form is more frequent in training. Error rate overall (%) 10 −2 10 0 10 2 Frequency ratio between competing and target verb form Figure 4: When the absolute frequency of the target verb form is held constant, increasing the relative frequency of the competing verb form increases error for the target form and decreases error for the competing form. This behavior is in the same direction as a heuristic that, at inference time, picks the more frequent verb.

Relative Frequency of Verb Form
We next analyze whether the frequency ratio between a target verb form v and its competing form v affects the model's ability to produce v in context. To balance how often the target v is singular vs. plural, we use the following procedure. We randomly split our 60 VOI into two groups of 30 verbs each, which we denote as S and P. In each experiment, we set the frequency of the singular verbs in S to N vary , while holding the frequency of the plural forms of the verbs in S constant at N constant . Likewise, we set the frequency of the plural verbs in P to N vary , and hold the frequency of the singular form of these verbs constant at N constant . We run experiments with N vary = {1, 10, 30, 100, 300, 1,000, 3,000, 10,000}, and do this twice for N constant set to 100 and 1,000. As our evaluation stimuli are balanced such that both v and v occur as the target in every template for every VOI, we are able to analyze the effect of the v:v frequency ratio-holding the absolute frequency of v fixed-for v:v ranging from 1:100 to 100:1. Figure 4 shows the results. When the competing form occurs more frequently (with respect to the target form), error rate increases for the target form and decreases for the competing form.

Setup
Our goal is to characterize BERT's rule-learning behavior in terms of the three hypotheses H1-H3 described in §2. The frequency effects observed in §6 rule out H1 (Idealized Symbolic Learner). However, BERT's generalization to unseen nounverb pairs ( §5.2) is too good to be explained by H2 (Item-Specific Learner). Hence, the hybrid H3 (Symbolic Learner with Noisy Observations) seems like the most plausible candidate.
H3 is not simply a catch-all compromise between rule-based and item-specific learners-H3 makes specific predictions about the nature of the errors BERT will make. Under H3, BERT represents the SVA rule and the concept of agreement features, and follows the rule as long as it identifies the number of the subject and verb correctly. Thus, H3 predicts that observed errors are due to failures to identify the number of either the subject or verb. Given such a model, we might observe frequency effects because training frequency influences the model's ability to predict the agreement feature for a given verb form. That is, we might observe a trend like the following: if a verb v occurs in fewer than some n training examples, BERT mispredicts the agreement feature (e.g., predicting v to be singular when v is plural); if v occurs more than n times, BERT correctly predicts v's agreement feature and correctly produces v in context. In this scenario, we expect that SVA errors will correlate with frequency, but the frequency of these errors should not exceed the error rate in predicting agreement features.

Predicting Agreement Feature
To test whether the above predicted pattern holds, we use two probing classifiers (see Veldhoen et al., 2016;Ettinger et al., 2016, on probing) which we describe below.
Subject agreement feature probe. Our first probe evaluates whether, given a sentence with the verb masked, the embedding at the masked position contains information as to whether a singular or plural verb is required. This setup actually evaluates two subtasks: identifying the subject of the verb (i.e., parsing the sentence) and predicting the agreement feature of the identified subject. For simplicity, however, we use a single probing classifier because our interpretation does not hinge on differentiating these subtasks. We use our sentential templates for this experiment, for which the only cue for the number of the subject (and hence the verb) is the subject itself. Hence, if the embedding at the masked position can be used to predict number, it follows that both the subject has been identified correctly and that the agreement feature of the subject was identified correctly.
We feed our nonce sentences from §4.2 with the verb masked into the model, retrieve the final hidden state representation of the masked token, and train an MLP to classify the desired verb form (singular or plural). We train the probe using crossvalidation, using sentences constructed from 150 subjects × 50 sentential for training, and the remainder for evaluation. Subjects and sentential contexts in evaluation sentences are not seen by the probe during training.
Verb agreement feature probe. Our second probe predicts the number of a verb from its contextual word embedding. If a probe can predict the number of a verb given its contextual word embedding, we can conclude that the model represents the agreement feature of that verb form and thus its predictions about SVA can, in principle, depend on the agreement feature. We obtain contextual embeddings of verbs by inserting them into the nonce sentential contexts, with the noun masked so that there are no external clues aside from the form of the verb that indicate whether the verb is singular or plural. We then train an MLP to, given the embedding, classify the verb as singular or plural. We use the 331 verbs from our nonce stimuli that were not VOI (×56 sentential contexts per verb) to train the probe and the 60 VOI (×56 sentential contexts per verb) to evaluate it.
Results. We evaluate these two probes as a function of both absolute frequency (using models from §6.2) and relative frequency (using models from §6.3). Results are shown in Figure 5 and Figure 6, respectively.
For absolute frequency ( Figure 5), we see that the accuracy of the verb agreement feature probe is highly dependent on absolute frequency of the VOI. The probe has lower error for models that saw the verb form more often, implying that seeing forms more frequently in training led to embeddings of those forms that better encode the number agreement feature. In constrast, the accuracy of the subject agreement feature probe is constant, which is expected because identifying the number feature of a subject should not be affected by absolute frequency of VOI. Notably, the combined error rate of our two probes falls close to the model's observed overall error rate on the SVA task, as predicted by our "Symbolic Reasoner with Noisy Observations" hypothesis (H3). For relative frequency (Figure 6), on the other hand, we see no clear increase or decrease in the accuracy of predicting the agreement feature for a target verb form v in response to changes in frequency of the competing verb form v . In other words, when one verb is much more frequent than the other, BERT produces the more common verb form despite having access (in principle) to the information (rule + agreement features) that would allow it to infer the correct form. Such behavior is not explicitly accounted for by the "noisy observations" in H3, and thus appears more as evidence of item-specific learning (in line with H2).

Discussion
The goal of this work is to determine whether BERT performs SVA by implicitly representing rules defined over abstract agreement features and characterize the training conditions under which such representations emerge. We differentiate between the representation of the rule ("if x then y") and that of observations (containing the correct agreement features). We draw two conclusions, which suggest a mix of systematic rule-like generalization and unsystematic item-specific inferences.
BERT appears to represent the correct rule, but fails to predict agreement features for lowfrequency verb forms. Although the error rate decreases as a function of frequency of target verb v ( §6.2), BERT's ability to predict the agreement feature of v ( §7.2) follows the same trend. This observed behavior is thus consistent with a model that correctly represents the SVA rule ( §2), but makes mistakes at inference time due to noise in the represented observations for low frequency verb forms (for example, producing "run" in the context "the dog run" because "run" is incorrectly encoded as singular), rather than due to a failure to represent the concept of singular altogether.
BERT fails to apply the rule when doing so requires overcoming strong item-specific priors. Similar to the absolute frequency trend, we see that BERT's error rate on SVA also decreases as a function of the frequency of the target verb v relative to its competing form v ( §6.3). Unlike above, however, we see no effect of the frequency ratio N v : N v on BERT's ability to predict the agreement feature of v when the frequency of v is fixed ( §7.2). These results suggest that BERT is heavily influenced by skewed training distributions, preferring to produce more common verb forms over forms consistent with the rule. Such behavior could either mean that, when P (v) << P (v ), (1) BERT represents the correct SVA rule but it is overridden in favor of the prior, or (2) BERT does not represent the rule at all. Teasing apart these possibilities is a valuable direction for future work.
Open questions. Our results on absolute frequency effects indicate that BERT does not infer agreement features until it sees 10-100 examples of a verb, even though it is possible, in principle, to infer agreement features from a single training example (e.g., "All of the dogs dax" implies "dax" is a plural verb form). Future controlled studies could investigate how the sample efficiency of inferring agreement information depends on factors such as architecture (e.g., access to morphological signals), size of the model, and amount of training data. Analysis of such patterns would elucidate how models like BERT (and by extension, transformers and neural networks more generally) learn and generalize, enabling more principled development and deployment.

Conclusions
We have studied whether BERT's performance on subject-verb agreement exhibits rule-governed behavior. We focus on frequency effects, pre-training multiple BERT instances in order to isolate how the model's predictions are affected by absolute and relative verb frequency. Our results suggest that BERT's behavior is consistent with a system that correctly applies the SVA rule in general but struggles to overcome strong training priors and to estimate agreement features (singular vs. plural) on infrequent lexical items.

A BERT Model
To analyze the effect of word frequency on BERT's ability to follow SVA, we need to know the exact number of occurrences of each word in the dataset. The original BERT checkpoint (Devlin et al., 2019) uses both Wikipedia and BooksCorpus (Zhu et al., 2015), but BooksCorpus is no longer publicly available (Bandy and Vincent, 2021). So we train a replicated version of BERT on only wikipedia data. Our version of BERT largely follows the procedure of the original, differing only in that we use dynamic masking and pre-train for 4 million updates at a learning rate of 3e-4. 6 Table 3 shows the performance of our replicated version of BERT.

B Nonce Stimuli Collection Details
This appendix section details our nonce stimuli collection process. Our goal is to create a large set of evaluation stimuli in which we can analyze how properties of certain stimuli (e.g., subjects, verbs, and sentential contexts) affect the model's ability to perform number agreement. Therefore, we create a list of 200 nouns, 336 verbs, and 56 sentential contexts such that any given (noun, verb, sentential context) triplet where the noun is used as the subject yields a grammatically correct nonce sentence. By considering both singular and plural inflections for possible triplet, we analyze a dataset of 2 · 200 · 336 · 56 = 7,526,400 sentences. To facilitate further use of our dataset, we make a plain-text version available at http://anonymized.

B.1 Nouns
To propose candidate nouns, we first ran a POS tagger (Tsai et al., 2019) over the pre-training dataset, and retrieved all nouns occurring at least 100 times. Then, we randomly sampled 200 nouns from this set of candidate nouns that were evenly distributed into four buckets of training set frequency (100-999, 1,000-9,999, 10,000-99,999, and 100,000+). All nouns were common nouns and were manually validated to have correct, unambiguous singular and plural inflections.

B.2 Verbs
To propose candidate verbs, we similarly retrieved all verbs that occurred at least 100 times in the training set. The masked-LM evaluation procedure for SVA requires that both the singular and plural inflections of the verb exist directly in the model's vocabulary, and so we filtered out verbs that did not match this criteria, leaving us with 379 candidate verbs.
Unlike for nouns, we generally cannot indiscriminately swap out verbs in a sentence while maintaining grammatical correctness, since some verbs are exclusively transitive (used with an object) or intransitive (used without an object). In English, more verbs can be used transitively than intransitively, and so we decided to consider only transitive verbs. We manually filtered out strictly intransitive verbs and ensured that each verb had correct, unambiguous singular and plural inflections, leaving us with 336 verbs that can be used transitively.

B.2.1 Sentential Contexts
Finally, we curated a list of sentential contexts (sentences with the subject and verb removed) that would maintain grammatical validity for both singular and plural forms of any given subject-verb pair from our list of nouns and list of transitive verbs. To get candidate sentential contexts, we randomly sampled 600 sentences from the Linzen et al. (2016) dataset of Wikipedia sentences to be manually examined. We kept only 56 of these 600 candidate sentential contexts, filtering out 544 for the following reasons: • Sentential contexts that contained additional cues for number outside of the subject and verb inflection cannot form grammatical sentences for both singular and plural subject-verb pairs. For instance, the sentential context " [SUBJECT], who thinks roses are red, [VERB] ..." can only be used with singular subjects and verbs because of the modifying clause "who thinks roses are red"; and the sentential context " [SUBJECT] in the park [VERB] ..." can only be used with plural subjects and verbs because there is no determiner for the subject. 381 sentences like the above had such cues for number inflection and were removed.
• 64 sentential contexts contained verb usages that were hostile to swapping in most transitive verbs (e.g., in " [SUBJECT] shows that ...", "shows" could not be replaced with most transitive verbs).
• 15 sentential contexts contained noun usages that were hostile to swapping in most nouns (e.g., in the fact that she likes him ..., the subject fact cannot be replaced with most nouns).
• 21 sentential contexts were ungrammatical or incomprehensible to our human annotator.
• In 36 sentential contexts, the subject and verb parsed by Linzen et al. (2016) was incorrect.
• 27 sentential contexts for which the original verb was used intransitively were removed.

B.2.2 Human evaluation
To check the validity of the test set, the first author manually examined 160 generated nonce sentences in the same fashion that the model would evaluate them. That is, each example comprised either a singular or plural noun inserted into a template, and the first author had to predict the correct number inflection of a given verb. In addition, the first author had to verify that the sentence was grammatical and contained no number inflection cues other than the inflection of the subject. The sentences were presented in random order, with the first author blinded from the ground-truth label. The first author found that 6 sentences (3 templates, since each template had two inflections) contained additional number inflection cues that were missed in the first round of annotation, and so these sentences were removed. In terms of accuracy, the first author correctly predicted the verb inflection 153 of the 154 instances (99.4% accuracy) and attributed the single error to carelessness. We take these manual evaluation results as evidence that our test set is grammatical and tests a syntactic rule that can be consistently applied by humans.

C Additional Figures and Tables
We show several auxiliary experiments that elucidate BERT's performance with respect to various characteristics of evaluation stimuli.

C.1 Relative Frequency of Forms
Following the results from §5.4, we show the error rate for singular and plural forms of all verbs in our nonce stimuli in Figure 7. Additionally, Table 4 shows the five verbs with the highest and lowest error rates, as well as their frequency ratios.

C.2 Comparison with prior work
Because our work proposes a new set of nonce stimuli (which is larger than prior work, e.g., 383 sentences from Gulordava et al. (2018) or 7 verbs from Chaves and Richter (2021)), we run several analyses from prior work on our dataset. The results are largely consistent with conclusions from prior work.
Attractors. As shown in Table 5, BERT did not perform worse on templates with more attractors (clauses between the subject and verb), corroborating Goldberg (2019).
Noun frequency. We find similar evidence, like Yu et al. (2020), that BERT did not perform better on nouns that were more frequent in the training set. Figure 8 shows these results-for each subject (we consider both singular and plural forms as a single subject), we plot that subject's error rate against its frequency in the training data.
High-and low-confidence predictions. As an auxiliary analysis to compare with Newman et al.
(2021), we plot the error rate of BERT with respect to different thresholds for how confident the model was about its prediction. Figure 9 shows error rates for all predictions with confidence above some threshold, and error rates for predictions with confidence below some threshold. This result concurs with Newman et al. (2021)'s finding that model performance drops when testing verbs that the model finds unlikely.

C.3 Absolute versus relative frequency
As additional background, Figure 10 shows the correlation between absolute frequency of a verb form and its frequency relative to its competing form. Figure 7: For all 366 verbs (×2 inflections per verb), we plot the error rate against inflection skew, which is how much more frequent the correct inflection of the verb occurred than the incorrect inflection in the pretraining data. Several of the most skewed words are shown-for instance, long appears 490 times more often in the pre-training data than longs, so its inflection skew is log 10 (490) Figure 3 in the main body showed the performance of models that have seen the VOI n times in training, where n varies from 1 to 10,000. Table 6 stratifies this performance on the nonce evaluation stimuli by seen and unseen subject-verb pairs in the evaluation stimuli. Note, though, that this stratification differs for each VOI frequency. That is, there will be more unseen subject-verb pairs when the VOIs are less frequent in the training set.

D.2 Nouns
The 200 nouns used in our nonce stimuli are listed below (only singular inflections are shown): abuser,