Rarely a problem? Language models exhibit inverse scaling in their predictions following few-type quantifiers

How well do language models deal with quantification? In this study, we focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models because the sentence components with out the quantifier are likely to co-occur, and 'few'-type quantifiers are rare. We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes. Not only do all the models perform poorly on 'few'-type quantifiers, but overall the larger the model, the worse its performance. This inverse scaling is consistent with previous work suggesting that larger models increasingly reflect online rather than offline human processing, and we argue that the decreasing performance of larger models may challenge uses of language models as the basis for natural language systems.


Introduction
Quantifiers can dramatically alter the meaning of an utterance.Consider the sentences in (1).
(1) (a) Most sharks are harmless.Despite the fact that (a) and (c) have the same content words in the same syntactic arrangement, the statements have starkly different meanings.The same is true of (b) and (d).Being able to successfully comprehend these differences is useful, and in an example such as this one, vitally important1 .
Yet current work suggests that language models deal poorly with quantifiers-they struggle to predict which quantifier is used in a given context (Pezzelle et al., 2018;Talmor et al., 2020), and also perform poorly at generating appropriate continuations following logical quantifiers (Kalouli et al., 2022).This is especially concerning given the recent trend of using large language models (sometimes referred to as 'foundation models'; Bommasani et al., 2021) as general systems that can perform multiple tasks, including question answering, without specific training (Brown et al., 2020;Raffel et al., 2020;Lin et al., 2021;Srivastava et al., 2022;Hoffmann et al., 2022;Rae et al., 2022;Zhang et al., 2022;Chowdhery et al., 2022).It is thus crucial that such systems be able to distinguish among sentences like those in (1) in human-like ways both during training and when generating responses.
The aim of the present study is to evaluate how well language models take into account the meaning of a quantifier when generating the text that follows it, and to investigate whether this scales with model size.We are particularly interested in the question of whether language models exhibit inverse scaling-that is, whether as model size increases, performance decreases rather than increases (Perez et al., 2022;McKenzie et al., 2022a).Inverse scaling is an issue of serious concern for developing and training new language models, since inverse scaling could indicate 'outer misalignment' (Perez et al., 2022)-that the training approach is leading to models that produce undesirable outputs, which may get worse as performance at training objectives increases.Inverse scaling is also a concern for models' ultimate use.As models increase in size and perform better at a wider range of benchmarks (for recent examples, see, e.g., Srivastava et al., 2022;Chowdhery et al., 2022), they may be increasingly assumed to be trustworthy and generalpurpose, and thus able to perform well tasks on which they have not been tested (Raji et al., 2021).This could lead to a range of possible harms, from misidentifying whether something is dangerous or not (as in the opening example), to amplifying biases (Bender et al., 2021).
To test how well language models deal with quantifiers, we follow the approach of Ettinger (2020) in using sentences from a study on human language comprehension to inform our evaluation.Ettinger (2020) found that following a negation, the predictions of BERT BASE and BERT LARGE in simple sentences expressing a proposition with or without negation (from Fischler et al., 1984) do not appear sensitive to negation-for example, BERT LARGE predicts the final word of a robin is a bird to be more likely than a robin is a tree, but also predicts that a robin is not a bird is more likely than a robin is not a tree.In this way, the models' predictions more closely match those made by humans 'online'-that is, incrementally during the process of language comprehension-than our fully-formed 'offline' judgements: in their original study, Fischler et al. (1984) found that the word bird elicited an N400 response of smaller amplitude than tree in both contexts, indicating that it was more strongly predicted.
Similar effects have been reported (Kassner and Schütze, 2020;Kalouli et al., 2022) for other transformers such as Transformer-XL (Dai et al., 2019), RoBERTa (Liu et al., 2019), and ALBERT (Lan et al., 2020), as well as ELMo (Peters et al., 2018).Worse, recent work suggests that as language models increase in size, their ability to deal with negation may degrade: an inverse scaling relationship has been reported for performance at a wide range of tasks when prompts include negation (McKenzie et al., 2022b;Jang et al., 2023), though it is possible that this may reverse at extremely large scales (Wei et al., 2022).
Negation may be particularly challenging for statistical language models because its presence radically alters the meaning of a sentence, but negation occurs in only about 10% of sentences (Jiménez-Zafra et al., 2020).Quantifiers similarly impose radical modulations to meaning while also being relatively infrequent (see Appendix B).In the present study, we focus on quantifiers indicating typicality such as most and few.To the best of our knowledge, only one study has evaluated model predictions following any quantifiers (Kalouli et al., 2022), and it focused on words corresponding to logical quantifiers such as all, every, and some.The few studies involving the quantifiers we address either focus on predicting the quantifier itself (Pezzelle et al., 2018;Talmor et al., 2020), or use RNNs to investigate modeling significant ef-fects on the N400 without any form of evaluation (Michaelov and Bergen, 2020).This study, therefore, represents the first attempt to explicitly evaluate the predictions of language models following most and few-type quantifiers.
In the present study, we carry out two experiments.In the first, following Ettinger (2020), we use the stimuli from a previously published N400 study (Urbach and Kutas, 2010).In it, Urbach and Kutas (2010) found that while most and fewtype quantifiers do impact N400 amplitude, it is not enough to reverse predictions-few farmers grow crops elicits a smaller N400 response than few farmers grow worms, indicating that crops was more strongly predicted than worms, even though experimental participants judged it to be less plausible off-line.We test whether language models show the same pattern of insensitivity towards the quantifiers that humans do in online measures.In this way, we test how closely the predictions of language models correlate with those underlying the human N400 response.
In our second experiment, we extend our study further.Experiment 1 aims to replicate the original N400 results of Urbach and Kutas (2010); however, one thing that it does not account for is that while a given complete sentence (e.g., few farmers grow crops.)can be highly unlikely and implausible, sentences beginning with the same words may not be (for example, in the plausible sentence few farmers grow crops in the winter).Experiment 1 does not distinguish between these possibilities, and while it is important to test the sensitivity of language models to few-type quantifiers, if they fail to show a difference for complete sentences including the final period (e.g., few farmers grow crops.),this is more concerning.Thus, in Experiment 2, we run the same stimuli as Experiment 1, but including a period following the final word (e.g., crops./worms.).

.1 Materials
In this experiment, we use all the stimuli from two experiments carried out by Urbach and Kutas (2010).These are made up of 120 sentence frames with 8 different sentence types falling into 4 experimental conditions, for a total of 960 sentences.The 4 conditions had a 2x2 design-each stimulus was either typical (T) or atypical (A), and had either The quantifiers used in sentences (a)-(d) differed by sentence frame; see Appendix B for a full list.

Language Models
To cover a range of language models with different training data and numbers of parameters, we run our analyses on the GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), GPT-Neo (Black et al., 2021; including GPT-J, Wang and Komatsuzaki, 2021), and OPT (Zhang et al., 2022) language models.We also include an analysis of the first series of InstructGPT models (text-davinci-001 etc.), which were finetuned on human-written and highlyrated model-generated responses (OpenAI, 2023).

Evaluation
For each stimulus sentence, we calculate the surprisal of the critical word, that is, the word for which the N400 response was measured in the original study.Because humans only encounter the context preceding the critical word when processing the word, and because the language models we analyze are all autoregressive, we only consider the surprisal of the critical word given its preceding context.To do this we truncated the sentence before the critical word, and then used the relevant language model to calculate the probability p of the target word given the preceding context, which was then converted to surprisal S following Equation 1.
In previous work of this type (e.g., Ettinger, 2020), only words that were single tokens in the models' vocabularies were used.In this study, all models are autoregressive, so for multi-token words, consecutive sub-word tokens can be predicted, the product of which is a well-defined probability for the whole word.The surprisal of such words, then, is the sum of the surprisals of the subword tokens.Calculating surprisal this way allows us to compare the predictions of all the models for all the stimuli in the original experiment.
In order to evaluate how well each model takes into account the quantifier in its predictions, we compared which of the two possible critical words (typical or atypical) had a lower surprisal, i.e., was more strongly predicted by the model.To align with human plausibility judgements, following a most-type quantifier, the typical continuation was judged to be correct, and following a few-type quantifier, the atypical continuation was judged to be correct.Accuracy was calculated as the fraction of the stimulus pairs for which the model predicted the appropriate critical word-that is, predicted the correct continuation more strongly than the incorrect one.For example, the set of stimuli presented in (2) is made up of 4 pairs of stimuli, and for a model to achieve 100% accuracy (4/4), it would need to predict (a) over (b), (d) over (c), (e) over (f), and (h) over (g).This design intrinsically controls for any differences in unconditioned probability among the final words themselves.
Following Ettinger (2020), we also analyzed model sensitivity to the quantifiers.In the present study, this corresponds to the question of whether, for a given sentence frame, the model makes a different prediction following a few-type quantifier than it does following a most-type quantifier.We defined sensitivity as the proportion of stimuli for which the model correctly predicts the critical word following both the most-type and the few-type quantifier.Thus, the stimuli in each sentence frame provide 2 data points for sensitivity: in (2), sensitivity is calculated for (a)-(d) and for (e)-(h).For the (a)-(d) stimuli, a model would be considered sensitive to the quantifier if it correctly predicted (a) over (b) and (d) over (c).Code and data are available at https://osf.io/vjyw9.

Results
Each model's accuracy at predicting the critical words following most-and few-type quantifiers is shown in Figure 1.All model series show the same general tendencies in accuracy: (1) they perform quite poorly for few-type quantifiers but relatively well for most-type quantifiers; and (2) as model size increases, word prediction following most-type quantifiers improves, but it degrades following fewtype quantifiers.Figure 1 does show small exceptions to this pattern.From GPT-2 762M to 1542M and from InstructGPT 13B to 175B, while mostperformance increases, few-performance does not decrease.Furthermore, from OPT 125M to 350M, and from OPT 2.7B to 6.7B, there is actually a slight improvement.Nonetheless, these differences are small compared to the overall decreases in performance, and the general trends are still clearfor example, no model performs better on few-type quantifiers than a model two or more sizes smaller.
With sensitivity, as shown in Figure 1, some models improve as they increase in size, and some get worse; however, even the greatest distance between the sensitivity of two models in the same series (InstructGPT 2.7B and 13B) is only 3.4%.Thus, other than the general fact that sensitivity is low across all models, there does not appear to be any clear pattern, suggesting that sensitivity does not drive the effects seen in accuracy.All accuracy and sensitivity scores can be found in Appendix A.

Discussion
These results show that contemporary autoregressive transformer models perform poorly on fewtype quantifiers, and that as these models increase in size, they tend to improve at predicting words following most-type quantifiers but get worse at predicting words following few-type quantifiers.In fact, we see that models that better predicted the more typical word after a most-type quantifier were also worse at predicting the less typical word following a least-type quantifier.The fact that models were evaluated on which of the two options they predicted to be more likely, combined with generally poor and largely invariant sensitivity (peaking at 5%), suggests that the larger models generally made predictions increasingly in accordance with typicality, overwhelming any sensitivity to quantifier type.This aligns with previous work on negation and logical quantifiers in language models (Ettinger, 2020; Kassner and Schütze, 2020;Kalouli et al., 2022), as well as the N400 results of the original study by Urbach and Kutas (2010).

Method
The models and evaluation approach were identical to Experiment 1.The materials were identical to Experiment 1 with the single difference that all nouns were followed by a period, and the surprisal of this period was included when calculating the total surprisal of the critical word (e.g., nuts.or nails.for the example presented in ( 2)).Thus, surprisal reflected both the surprisal of the critical word in context and the surprisal of the word being followed by a period, i.e., being the last word in the sentence.For a discussion of modeling the probability of sentence-final words in this way, see Szewczyk and Federmeier (2022).

Results
Results are shown in Figure 2. As in Experiment 1, larger models perform worse overall.However, there is a small improvement in the very largest GPT-3 and InstructGPT models relative to the second-largest models of the same type, both in few-type accuracy and sensitivity.Performance also increases on these metrics between OPT 2.7B and OPT 6.7B; however, this decreases with OPT 13B.All accuracy and sensitivity scores can be found in Appendix A.

Discusion
Overall, the results are similar to those of Experiment 1: Larger models of the same type perform worse than smaller models.Whether the small improvement of the largest GPT-3 and InstructGPT models relative to the second-largest models is a fluctuation like that seen for OPT or the beginnings of a U-shaped curve (see Wei et al., 2022) is a question for further research.

General Discussion
In this study, we investigated whether language models show the same insensitivity towards fewtype and most-type quantifiers observed in the predictions made by humans during language comprehension, as indexed by the N400 response.We find that when tested on the same stimuli, they do, predicting the ostensibly implausible few squirrels gather nuts to be more likely than few squirrels gather nails.Moreover, we find that as language models increase in size, they tend to show this effect to a greater extent, an example of inverse scaling.Based on our analysis of sensitivity and accuracy with most-type quantifiers, we hypothesize that these results are due to a low degree of sensitivity to quantifiers and an increase in sensitivity to typicality.In other words, language models appear to be increasingly sensitive to the fact that squirrels gather nuts is more plausible than squirrels gather nails, but not to the effect on meaning that is caused by a preceding most or few.
It is often assumed that as models increase in size and are trained on more data, their performance on natural language tasks generally improves-indeed, evidence supports this (Brown et al., 2020;Raffel et al., 2020;Lin et al., 2021;Srivastava et al., 2022;Hoffmann et al., 2022;Rae et al., 2022;Zhang et al., 2022;Chowdhery et al., 2022).However, the predictions of larger models and those trained on more data also increasingly correlate with human incremental online predictions, in particular those indexed by N400 amplitude (Frank et al., 2015;Aurnhammer and Frank, 2019a,b;Michaelov and Bergen, 2020;Merkx and Frank, 2021;Michaelov et al., 2021Michaelov et al., , 2022)).The two are often aligned-it is easier for humans to process well-formed sentences with plausible semantics (Frisch and Schlesewsky, 2005;Nieuwland et al., 2020).But in cases such as the present study, the two are not aligned, and we see instead that the predictions of larger models correlate better with human online predictions, even when these are contrary to offline judgements.Thus, the increased performance we see at tasks corresponding to offline human judgements-and note that virtually all manually-annotated tasks are based on offline human judgements-may in fact be a by-product of the models' predictions resembling the online predictions.
Fortunately, the literature boasts a wealth of psycholinguistic studies where metrics of online prediction such as the N400 appear to conflict with offline judgements.Future work could use these to identify phenomena where language models may struggle to make predictions in line with human judgements.Such cases are important to detect as use of LMs becomes more widespread.But by the same token, the present study shows that as language models increase in size, even when augmented by finetuning on desirable responses, they can make predictions that align less and less with explicit human judgements.
This may be a clear indication of an inherent 'outer misalignment' present in language models: while humans might like language models to generate plausible sentences, by their nature they can only generate the most statistically probable ones.Just as there is no guarantee of accuracy or coherence (Bender et al., 2021), there is no guarantee of plausibility.While it may be possible to tailor training to avoid specific known issues, this misalignment between probability and plausibility may pose a fundamental challenge with current approaches that aim to use language models as general-purpose natural language systems.

Limitations
There are two main limitations to our study.The first is that the stimuli used were limited to those provided by Urbach and Kutas's (2010) study.This is because, as stated, we wanted to be able to compare the patterns in the language models' predictions to the patterns in the human N400 response.Thus, we do not look at logical quantifiers like Kalouli et al. (2022), or any others that have previously been studied (in, e.g., Pezzelle et al., 2018;Talmor et al., 2020).
The other (and perhaps more important) limitation is in the models we were able to use.Crucially, we were not able to access models larger than GPT-3 175B such as PaLM 540B (Chowdhery et al., 2022).This is important because recent work has shown that some inverse scaling patterns become U-shaped (i.e., as language model size increases, performance degrades and then improves again) with such larger models (Wei et al., 2022).

Ethics Statement
Our work complies with the ACL Ethics Policy.Beyond this, we are not aware of any way in which the results of this study may be harmful-in fact, if anything, identifying the limitations of large language models is something that is likely to reduce possible harms by demonstrating cases where their use is not suitable.
From an environmental perspective, we did not train any models; we only used pretrained models for analysis, limiting energy consumption.With the exception of the GPT-3 and InstructGPT models and OPT 13B, all analyses were run on an NVIDIA RTX A6000 GPU, taking a total of 43 minutes.OPT 13B was too large to run on this GPU, and thus was run on an Intel Dual Xeon E7-4870 CPU for a total of 22 hours and 39 minutes.Finally, the GPT-3 and the InstructGPT models were run using the OpenAI API, and thus we do not have access to information about the GPUs used.Table 2: In each sentence frame, most and few-type quantifiers were matched based on their meanings as length in number of words (Urbach and Kutas, 2010).Matched quantifiers are shown beside each other.As can be seen, few is matched to both most and many.The frequency of each quantifier is given in terms of the proportion of sentences in WikiText-103 (Merity et al., 2017) that contain it.The total frequencies are the number of sentences in WikiText-103 that contain at least one of either the few-type or most-type quantifiers; not the sum of the individual quantifier frequencies.

B Quantifiers
(b) Most sharks are dangerous.(c) Few sharks are harmless.(d) Few sharks are dangerous.

FewFigure 1 :
Figure 1: Accuracy and sensitivity of all models.

FewFigure 2 :
Figure2: Accuracy and sensitivity of all models on stimuli with added periods (e.g., Few squirrels gather nuts.).

Table 1 :
Accuracy and sensitivity scores for all models.
Table 2 lists all quantifiers used and the proportion of sentences in WikiText-103 that contain them.