Refining Targeted Syntactic Evaluation of Language Models

Targeted syntactic evaluation of subject-verb number agreement in English (TSE) evaluates language models’ syntactic knowledge using hand-crafted minimal pairs of sentences that differ only in the main verb’s conjugation. The method evaluates whether language models rate each grammatical sentence as more likely than its ungrammatical counterpart. We identify two distinct goals for TSE. First, evaluating the systematicity of a language model’s syntactic knowledge: given a sentence, can it conjugate arbitrary verbs correctly? Second, evaluating a model’s likely behavior: given a sentence, does the model concentrate its probability mass on correctly conjugated verbs, even if only on a subset of the possible verbs? We argue that current implementations of TSE do not directly capture either of these goals, and propose new metrics to capture each goal separately. Under our metrics, we find that TSE overestimates systematicity of language models, but that models score up to 40% better on verbs that they predict are likely in context.


Introduction
As neural language models have emerged as both broadly useful engineering tools (Devlin et al., 2018;Radford et al., 2019) and potential models of human language processing (Linzen and Leonard, 2018;Ettinger et al., 2018;Futrell et al., 2019), evaluations targeting their syntactic ability have been developed to better understand their capabilities.
One such method for syntactic evaluation tests models' knowledge of English subject-verb (S/V) number agreement (Linzen et al., 2016;Gulordava et al., 2018). These studies consider minimal pairs of sentences, such as The keys to the cabinet is/are on the table, that differ only in verb number, and test if models rate grammatical sentences as more probable. The syntactically correct of the two sentences is sampled from natural corpora (Linzen et al., 2016;Kuncoro et al., 2018) or constructed from templates. The use of templates, known as Targeted Syntactic Evaluation (TSE), allows for the fine-grained evaluation of models on specific, often rare, syntactic phenomena (Marvin and Linzen, 2018;Ettinger et al., 2018;Warstadt et al., 2020), but (when evaluating S/V number agreement) relies on researchers hand-specifying a small set of verb lemmas that are substituted into each template.
In this work, we improve the TSE methodology by disentangling its broad objective of evaluating syntactic ability into two distinct goals, and we introduce two variants of TSE to separately capture each goal. These evaluations demonstrate that neural models do not generally consider wellconjugated verbs more likely than their incorrect conjugations, but instead prefer to correctly conjugate verbs they deem likely.
We argue that the objective of evaluating syntactic ability can be decomposed into two goals and that current implementations of TSE do not achieve either of them. The first goal is measuring systematicity: for a specific syntactic construction, does the model correctly conjugate arbitrary verbs with the grammatical number of the subject? TSE currently fails to capture this because it evaluates models using only a small set of verbs for each syntactic construction. If models only conjugate these verbs correctly, they receive a high score, even if they conjugate other verbs incorrectly. The second goal is measuring likely behavior: when we sample verbs from the model in a specific syntactic construction, will they be properly conjugated? TSE fails to directly capture this because the small set of verbs used in evaluation might differ from the verbs that are likely in context under the model. If models conjugate these hand-specified verbs incorrectly, they receive a low score, even if they correctly conjugate more likely verbs.
To motivate these goals and the misspecification of TSE, consider evaluating a language model on The keys to the cabinet on the are + exist are + exist + is + exists 0.7 where for simplicity we assert that the only possible verbs are: is/are (be) and exists/exist (exist). Let the model assign higher probability mass to the correct conjugation for the be pair but not for the exist pair (Table 1).
First, consider evaluating systematicity. To reflect how TSE chooses a small subset of the possible verbs for evaluation, in this toy example let it choose only be. Thus, the model scores 1 out of 1, whereas a test of systematicity should penalize the model for incorrectly conjugating exist. Now, consider evaluating likely behavior. Let this same model generate either of the two correct conjugations (are/exist) with total probability of 0.7 and generate either of the incorrect conjugations with total probability 0.3. Thus, when we sample from the model, it generates a correct conjugation with probability 0.7, but TSE cannot measure this, since it gives a binary score to each verb pair.
The first of our proposed evaluations, equallyweighted syntactic evaluation (EW), addresses systematicity. To better approximate a model's ability to conjugate any verb, EW expands TSE to consider a much larger set of verbs than given in the templates used by prior work.
The second of our proposed evaluations, model-weighted syntactic evaluation (MW), addresses likely behavior. This method computes the probability mass that models put on producing the correct verb conjugation given a particular syntactic context. It rates the syntactic quality of samplesmodels need not conjugate all verbs, but instead be likely to generate some well-conjugated verb. We conduct these evaluations on four pretrained language models using two template datasets: M&L (Marvin and Linzen, 2018) and BLiMP (Warstadt et al., 2020). Overall, we find that the EW scores are lower than the TSE scores, indicating that the verb choices in these templates overestimate models' systematicity with respect to subject-verb number agreement. This lack of systematicity is particularly apparent when we test verb lemmas that models find unlikely, with scores dropping by up to 40%. In contrast, the MW scores are high, suggesting that language models preferentially conjugate verbs they deem likely. Moreover, this ability improves when the tail of the distribution is truncated, as it is in decoding strategies like nucleus sampling (Holtzman et al., 2020). 1

Methods
To define our metrics, we introduce some notation. TSE has two components: the model M to evaluate, and the set of templates T with interesting syntactic phenomena (e.g., from Marvin and Linzen (2018)). In S/V number agreement, each template contains a context c, including the subject that specifies the correct verb inflection; and the verb lemma with correct and incorrect inflections in the third person present tense ( + and − , respectively). M takes c and produces a distribution P M (· | c) over its vocabulary, which we assume includes + and − . We then compute a score for each template and average the scores over all templates to get a final score for M . The TSE score for a template can be expressed as: (1) The crux of our proposal is to use a large set of lemmas, L, while drawing contexts c from a predefined set of templates T . We define two evaluation methods using L: Equally-Weighted (EW) Here we average (1) over all in L, evaluating systematicity.
1 Code available at https://github.com/ bnewm0609/refining-tse Model-Weighted (MW) Here we compute the total probability of generating a correct inflection conditioned on generating a lemma in L: evaluating likely behavior. See Table 1 for how these are computed in the toy example.

Experiments
Data We use S/V number agreement TSE templates from Marvin and Linzen (2018) and BLiMP (Warstadt et al., 2020) (for BLiMP we use the minimal pairs differing in verb, not subject). For our MW and EW evaluations, we only keep templates with unique contexts (i.e., templates not differing solely in verb lemma). We also ensure that all sentences start with a capital letter (for cased models) and end with a sentence-final period (for bidirectional models). Our list of English verb lemmas contains 3,562 lemmas and is compiled by combining the top 1,000 most frequent verb lemmas from COCA, extracting all tokens with the VB part-of-speech tag in the Penn Treebank (1,951 lemmas), and scraping 3,250 lemmas from the Giant Verb List (Davies, 2008;Marcus et al., 1993;Essay, 2015). 2 Masked LMs may assign a different number of tokens to plural and singular forms of the same lemma, and they may not model joint probabilities over multiple tokens. To enable a fairer comparison between LMs and masked LMs, we only consider lemmas where both inflections are in the wordpiece vocabulary of the models. This choice leaves 980 lemmas for BERT cased, 1159 for BERT uncased, and 1265 for GPT2 and RoBERTa (so results are not comparable between models). This verbal variety situates our work between Gulordava et al.  To understand models' performances at the head and tail of their distributions, we additionally restrict L to the lemmas assigned high and low probabilities.
To consider the high-confidence lemmas, for each template in the dataset, we record the MW and EW scores computed using the inflections that fall into the top p percentile of the model's distribution. We choose p ∈ {10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 97, 100}, noting that for each p, the distribution we use is the same as the one used by nucleus sampling (with a nucleus of size p).
Analogously, to focus on the low-confidence lemmas, we consider the lemmas where both inflections fall into the bottom p percentile of the model's distribution. Here, we choose p ∈ {50, 10, 1, 0.1, 0.01, 0.001, 0.0001}. 3

Results
Our results can be found in Table 2. We find that EW scores are almost always lower than TSE  scores, indicating that TSE overestimates systematicity. On the other hand, higher MW scores reveal that sampling from the models is likely to result in correct conjugations. A potential confounder for unidirectional LMs (GPT2) is that they only receive the left context and subject verb pairs sometimes look like noun phrases. For example, a sentence starting with The officer can be continued by experiences joy or by experience is overwhelming. This is not an issue when there are phrases or clauses between the subject and verb, and it may not occur for other English syntactic phenomena or in other languages.
To investigate the extent to which models perform well on likely lemmas and poorly on unlikely lemmas, we plot these scores for the top and bottom p percentiles in Figure 1. In general, the models perform better on lemmas that they assign high probability to in both evaluations.
For example, consider the BERT cased model assessed on object relative clause constructions. The MW plot illustrates that sampling from the top 60% of the distribution will produce a grammatical output with 97% probability, while sampling from the entire distribution only does so with 91% probability. The EW plot shows that the model attains a score under 80% when assessed on verbs in the bottom 0.001% of the model's probability mass, even though considering verbs in the top 90% of the model's probability mass would yield a score over 94%. These observations extend previous work on nucleus sampling, showing that cutting off the tails of the distribution generates more syntactically correct outputs (Holtzman et al., 2020).
There are two additional factors to keep in mind for these plots. First, the heads and tails of the distributions often contain very few lemmas eligible for use in score calculation. Second, models often assign probability mass to other lemma inflections (e.g. the past tense) that do not allow us to assess models' S/V number agreement ability. See the Appendix for related plots.

Qualitative Results
Earlier, we motivated MW with the consideration that the lemmas TSE uses might be unlikely, and therefore give an unrealistic depiction of models' likely syntactic behavior. Two examples where this happens and leads to a deceptively low score on a template for a model (here BERT-large-cased) are in Table 3.
In the first column, the lemma set used by TSE contains like, hate, and love, and the model puts more probability on like than likes, leading to a TSE score of 0.67. However, the most probable lemmas are meet, encounter, see, and face, all of which the model conjugates correctly.
In the second column, there is another example where the MW score rewards models for correct conjugations while TSE does not. Like the last example, the lemma set used by TSE contains like, hate, and love, and like is conjugated incorrectly. However, the more probable lemmas pilot, control, employ, train, use, include, have, order, command, and feature are all conjugated correctly.

Related Work
Evaluating Models Some previous work has focused on using minimal pairs to evaluate syntactic representations of models. Goldberg (2019) et al., 2016)). In these cases, minimal pair evaluations should align with models' performance as language models, which is measured by our MW score.
Psycholinguistics Recent work has also applied experimental procedures from psycholinguistics to compare human and neural model language processing (Futrell et al., 2019). Experiments investigating garden path sentences' surprisal, S/V number agreement, and other specific syntactic phenomena reveal that models and humans have different patterns of errors and processing (Linzen and Leonard, 2018;Ettinger et al., 2018;Wilcox et al., 2020;van Schijndel and Linzen, 2020). Many of these phenomena are rare, so evaluations with templated minimal pairs complement perplexity as a metric for evaluating models' syntactic generalization (Hu et al., 2020). When measuring syntactic ability on arbitrary lemmas, our EW metric would be preferred.

Lexical Choice in Syntactic Evaluation
Prior work also considered how the lexical items in minimal pairs affect the syntactic evaluation of models. Marvin and Linzen (2018) note that certain verbs are preferentially conjugated correctly (they observe RNNs conjugate be correctly more often than swim) and claim that this is due to unigram frequency of the verbs. Similarly, we observe that models succeed on our MW metric indicating that they correctly inflect verbs with high in-context probability under the model. Relatedly, Yu et al. (2020) investigate the nouns used in TSE minimal pairs and find that language model performance at subject-verb number agreement is uncorrelated with unigram probability of the noun. We instead focus on model-estimated in-context probability of the verb in minimal pairs, finding that model performance increases with the model probability.
Finally, Gulordava et al. (2018) argue that the results of syntactic evaluations are influenced by semantic associations between tokens, so they remove these associations by substituting each token with a different randomly selected token with the same syntactic role. Although the resulting minimal pairs are infelicitous, models are still able to predict the correct inflection with above-chance accuracy. Our methods are similar in that some of the verbs in our evaluation set are infelicitous, however the contexts we use are semantically coherent. Rather than avoiding semantic effects by creating infelicitous contexts, we marginalize them out by using a large set of verb lemmas. This makes our metrics less stringent than those of Gulordava et al. (2018), but captures a potentially more realistic setting where we expect our models to perform systematically.

Conclusion
As neural models have proven successful at NLP tasks and as potential psycholinguistic models, we continue to refine our understanding of how and whether they capture human-like language faculties. TSE provides a useful framework to address this question, but its current implementation focuses on a limited group of hand-chosen verbs, so it inaccurately reflects models' syntactic generalization abilities. In response, we propose two minimal pair evaluations: equally-weighted and model-weighted syntactic evaluation. The first focuses on systematicity by expanding the set of verbs TSE considers, and illustrates that language models still struggle with S/V agreement for unlikely verbs. The second focuses on likely behavior by computing the probability of producing a correctly conjugated verb, and illustrates that despite systematic shortcomings, language models generate syntactically valid utterances with high probability. By introducing these metrics, we hope to arrive at a clearer picture of the syntactic abilities of language models. The metrics we propose have been developed specifically with corpora using Standard American English in order to evaluate models' abilities to understand Standard American English syntax. This focus means that models performing well under these evaluations may perform poorly in other English dialects, and they may not understand all syntactic systems, for example in other languages. Finally, our MW metric concerns itself with how models are likely to preform during generative processes (such as beam search or sampling). Performing well on this metric means models will be able to generate more human-like text which has potential downstream harms such as misinformation generation or other inauthentic behavior in situations where written language is the medium used for communication. A Additional Plots  The y-axis is the total probability mass the lemmas account for and the x-axis is the percentile cutoff. Note that even when considering all of the lemmas, (at p = 100%) there is probability mass not covered by our inflections. This probability mass is often put on other inflections of verbs (e.g. past-tense verbs) or other vocabulary items. Figure 4: Above is the proportion of the templates in the datasets where models assign no probability mass to lemmas in the top or bottom p% of the their distributions. The y-axis is the proportion of lemmas that are rejected (i.e. values closer to one mean that the scores are calculated based on fewer templates). The x-axis is again the percentile cutoff. Note that the bottom-most cutoffs often has a large proportion of invalid lemmas, so these scores are based on fewer lemmas.