Collateral facilitation in humans and language models

Are the predictions of humans and language models affected by similar things? Research suggests that while comprehending language, humans make predictions about upcoming words, with more predictable words being processed more easily. However, evidence also shows that humans display a similar processing advantage for highly anomalous words when these words are semantically related to the preceding context or to the most probable continuation. Using stimuli from 3 psycholinguistic experiments, we find that this is also almost always also the case for 8 contemporary transformer language models (BERT, ALBERT, RoBERTa, XLM-R, GPT-2, GPT-Neo, GPT-J, and XGLM). We then discuss the implications of this phenomenon for our understanding of both human language comprehension and the predictions made by language models.


Introduction
Humans process words more easily when they more contextually predictable, whether predictability is determined by humans (Fischler and Bloom, 1979;Brothers and Kuperberg, 2021) or language models (McDonald and Shillcock, 2003;Levy, 2008;Smith and Levy, 2013). Work on the N400, a neural signal of processing difficulty, has provided evidence that the neurocognitive system underlying human language comprehension preactivates words based on the extent to which they are predictable from the preceding context-thus, predictable words are easier to process because they or their features have already been activated before they are encountered Van Petten and Luka, 2012). This has led many to argue that we should consider the human language comprehension system to be engaging in prediction (DeLong et al., 2005;Kutas et al., 2011;Van Petten and Luka, 2012;Bornkessel-Schlesewsky and Schlesewsky, 2019;Kuperberg et al., 2020;De-Long and Kutas, 2020;Brothers and Kuperberg, 2021).
However, words that are either semantically related to the elements of the preceding context or to the most likely next word are also processed more easily, even if they are semantically implausible and ostensibly unpredictable. These are known as related anomaly effects. For an example of the former, consider the sentences in (1) that were used as experimental stimuli by Metusalem et al. (2012).
(1) My friend Mike went mountain biking recently. He lost control for a moment and ran right into a tree. It's a good thing he was wearing his ______.
Helmet is the most predictable continuation of the sentence, as determined based on cloze probability (Taylor, 1953(Taylor, , 1957-the proportion of people to fill in a gap in a sentence with a specific word. Thus, unsurprisingly, helmet elicited the smallest N400 response, indicating that it is most easily processed. Dirt and table are both implausible continuations, and equally improbable based on human responses (both have a cloze probability of zero). Yet Metusalem et al. (2012) found that dirt, which is semantically related to the preceding context of mountain biking, elicits a smaller N400 response than table, which is not. This suggests that something about dirt's relation to the mountain biking event causes it to be preactivated more than table, despite their seemingly equal implausibility and unpredictability.
The sentences in (2), used as experimental stimuli by Ito et al. (2016), provide an example of the other previously-discussed form of related anomaly-where a word semantically related to the most probable continuation (in this case, that with the highest cloze) is easier to process than one that is not. Even though tail and tyre are both implausible continuations with a cloze probability of zero, Ito et al. (2016) find that tail, which is semantically-related to the highest-cloze continuation dog, elicits a smaller N400 response than tyre, which is not.
(2) Meg will go to the park to walk her ______ tomorrow.
In sum, words related to elements of the preceding context or to the most probable continuation of a sequence appear to be more preactivated in the brain than words that are not, even when both are highly anomalous. This effect has been replicated many times Kutas, 1993;Federmeier and Kutas, 1999;Metusalem et al., 2012;Rommers et al., 2013;Ito et al., 2016;DeLong et al., 2019; for review see DeLong et al., 2019).
The key question, therefore, is whether the same neurocognitive system underlying the predictability effects on the N400 also underlie related anomaly effects. Under one account (DeLong et al., 2019;DeLong and Kutas, 2020), the predictive system that underlies predictability effects also leads to these related anomalous words being 'collaterally facilitated' (DeLong and Kutas, 2020Kutas, , p. 1045 due to their shared semantic features. Under this account, therefore, related anomaly effects can all be explained as by-products of our predictive system and the semantic organization of information in the brain. However, there is no direct evidence that this is the case-in fact, given the metabolic costs of preactivation (Brothers and Kuperberg, 2021), it may intuitively seem unlikely that an efficient predictive system would lead to implausible and otherwise anomalous words being preactivated. In fact, many researchers have argued that one or more associative mechanisms are required to explain related anomaly and other similar effects (Lau et al., 2013;Ito et al., 2016;Frank and Willems, 2017;Federmeier, 2021).
As systems designed specifically to predict the probability of a word given its context, language models offer a means to test the viability of the former hypothesis. If language models calculate that related but anomalous words are more predictable than unrelated anomalous words, this would demonstrate that related anomaly effects can be produced by a system engaged in prediction alone. This would show that it is possible that related anomalies can be 'collaterally facilitated' (DeLong and Kutas, 2020Kutas, , p. 1045) by a predictive mechanism in human language comprehension. Thus, it would remove the need to posit additional associative mechanisms on the basis of related anomaly effects, which could greatly simplify our understanding of human language comprehension.
This is what we test in the present study. We run the stimuli from 3 psycholinguistic experiments carried out in English (Ito et al., 2016;DeLong et al., 2019;Metusalem et al., 2012) through 8 contemporary transformer language models (Devlin et al., 2019;Radford et al., 2019;Liu et al., 2019;Lan et al., 2020;Conneau et al., 2020;Black et al., 2021;Wang and Komatsuzaki, 2021;Lin et al., 2021), calculating the surprisal (negative logprobability) of each word for which the N400 was measured. We then compare whether, in line with the N400 response, anomalous words that are semantically related to the context have significantly lower surprisals than unrelated words.

Related work
There have been a wide range of attempts to computationally model the N400 (Parviz et al., 2011;Laszlo and Plaut, 2012;Laszlo and Armstrong, 2014;Rabovsky and McRae, 2014;Frank et al., 2015;Ettinger et al., 2016;Cheyette and Plaut, 2017;Brouwer et al., 2017;Rabovsky et al., 2018;Venhuizen et al., 2019;Fitz and Chang, 2019;Aurnhammer and Frank, 2019;Michaelov and Bergen, 2020;Merkx and Frank, 2021;Uchida et al., 2021;Szewczyk and Federmeier, 2022;Michaelov et al., 2022). One of the most successful and influential approaches has been to model the N400 using the surprisal calculated from neural language modelssurprisal has been found to be a significant predictor of single-trial N400 data (Frank et al., 2015;Aurnhammer and Frank, 2019;Merkx and Frank, 2021;Michaelov et al., 2021;Szewczyk and Federmeier, 2022;Michaelov et al., 2022), and has been found to be similar to the N400 response in how it is affected by a range of experimental manipulations (Michaelov and Bergen, 2020;Michaelov et al., 2021). A key finding is that better-performing and more sophisticated language models perform better at predicting the N400 (Frank et al., 2015;Aurnhammer and Frank, 2019;Michaelov and Bergen, 2020;Merkx and Frank, 2021;Michaelov et al., 2021Michaelov et al., , 2022. For this reason, we use contemporary transformer language models in the present study. We use experimental stimuli from 3 experiments. Stimuli from one of these experiments (Ito et al., 2016) have been previously used in computational analyses of the N400. This is one of several sets that Michaelov and Bergen (2020) attempt to model using recurrent neural network (RNN) language models, finding that they can indeed calculate that words related to the highest-cloze continuation are more predictable than unrelated words. In the present study, we test whether this result can be replicated on a larger number of language models, and specifically, transformer language models.
There has also been work looking at how language models deal with semantic relatedness to the highest-cloze continuation based on stimuli from other N400 experiments. Michaelov and Bergen (2020), for example, find that in cases where the related and unrelated words are both plausible, the related continuations are more strongly predicted by RNNs (Gulordava et al., 2018;Jozefowicz et al., 2016), in line with the original N400 results (Kutas, 1993). Michaelov et al. (2021) conceptually replicate this finding on a different dataset (Bardolph et al., 2018) using one of the same RNNs (Jozefowicz et al., 2016) and GPT-2 (Radford et al., 2019). However, these prior efforts differ from the present study in that they investigate N400s and surprisal to words that are all plausible continuations of the sentence, and where they both have a low but generally non-zero cloze probability. In the stimuli analyzed in the present study, by contrast, both the related and unrelated words are anomalous-they have a cloze probability of zero, and are implausible continuations. Thus, their preactivation does, at least intuitively, appear to be more clearly 'collateral'.
We are only aware of one previous study that directly compares the predictions of transformers and the human N400 response on related anomaly stimuli. Ettinger (2020) evaluates BERT in terms of its similarity to cloze-because the predictions of a language model, being incremental, may show similar effects to those found in the N400 (see also Michaelov and Bergen, 2020 for discussion). For this reason, Ettinger (2020) tests how good BERT is at predicting the highest-cloze (most probable) continuations in the stimuli over anomalous but seman-tically related continuations, but does not directly look at the related anomaly effect-whether the related anomalous continuations are more strongly predicted than the unrelated anomalous continuations. Thus, to the best of our knowledge, the present study is the first to investigate whether the predictions of transformer language models display related anomaly effects like humans do.
Finally, there has been some work investigating whether language models display priming effects (e.g. Prasad et al., 2019;Misra et al., 2020;Kassner and Schütze, 2020;Lin et al., 2021;Lindborg and Rabovsky, 2021). The effect found by Metusalem et al. (2012)-that words related to the events described in the context are preactivated more strongly than words that are not-is a form of semantic priming, as it results in the increased preactivation of a word based on the semantic content stimulus that has been recently encountered (i.e. the event described in the preceding linguistic context). Thus, our investigation of the patterns in the prediction of the the stimuli from Metusalem et al. (2012) is intended to further our knowledge of priming in language models-specifically, whether there are systematic ways in which context shapes the extent to which anomalous words are predicted.

General Method
In this study, we took the stimuli from a range of experiments (Ito et al., 2016;DeLong et al., 2019;Metusalem et al., 2012) and ran them through a number of transformer language models. We used the transformers (Wolf et al., 2020)  . We chose these models to cover a number of both autoregressive (GPT-2, GPT-Neo, GPT-J, XGLM) and masked (BERT, RoBERTa, ALBERT, XLM-RoBERTa) language model architectures. Given the recent increase in popularity of multilingual language models, we also made sure to include one autoregressive (XGLM) and one masked (XLM-RoBERTa) multilingual language model, in case there is a difference based on the number of languages that a model is trained on.
All experimental stimuli used in the present study have been made available by the original authors of their respective papers as appendices or supplementary materials. In our analysis, we truncated all stimuli to be the preceding context of the critical word (the word for which the N400 was measured). We then used the language models to calculate the probability of the next word, and negative log-transformed (using a logarithm of base 2, following Futrell et al., 2019) these probabilities to calculate the surprisal of each word. For words not present in the vocabulary of each model, we tokenized the word, and then progressively calculated the surprisal of each sub-word token given the preceding context; with the sum of all the surprisals (equivalent to the the negative log-probability of the product of all the probabilities) being used as the total surprisal for the word. In this way, we calculated the surprisal of each critical word given its preceding context only.
All graphs and statistical analyses were created and run in R (R Core Team, 2020) using Rstudio (RStudio Team, 2020) and the tidyverse (Wickham et al., 2019), lme4 (Bates et al., 2015), and lmerTest (Kuznetsova et al., 2017) packages. All reported p-values are corrected for multiple comparisons based on false discovery rate across all statistical tests carried out (Benjamini and Hochberg, 1995). Because of this correction procedure, if any models display related anomaly effects, this is evidence that prediction alone can account for them.
All of the code for running the experiments and carrying out the statistical analyses is provided at https://github.com/jmichaelov/ collateral-facilitation. 4 Experiment 1: Ito et al. (2016)

Introduction
We begin with Ito et al. (2016), who investigated whether relatedness to the highest-cloze continuation of a given sentence impacts the amplitude of the N400 response. They presented human participants with experimental stimuli that included a word that was either the highest-cloze continuation of a sentence, semantically related to that highest-cloze continuation, similar to the highestcloze continuation in terms of their form (e.g. hook and book), or unrelated. For the purposes of the present study, we are interested in semantic relatedness and thus do not consider the formal relatedness condition. Thus, we look at the stimuli from the three experimental conditions exemplified in (3)an example of Predictable, Related, and Unrelated continuations for one sentence frame.
(3) Lydia cannot eat anymore as she is so ______ now.
• full (Predictable) • half (Related) • mild (Unrelated) Ito et al. (2016) find that related continuations elicit a smaller N400 response than unrelated continuations. As stated, this finding was successfully modeled using the surprisal of two RNN language models by Michaelov and Bergen (2020).
In the present study, we aim to investigate whether this can be replicated with contemporary transformer language models. Thus far, only one study (Merkx and Frank, 2021) has directly compared the N400 prediction capabilities of RNNs and transformers while matching number of parameters, training data, and language modeling performance, finding that transformers are better predictors of N400 amplitude overall. We might therefore expect that the transformers used in the present study should model the related anomaly effect found by Ito et al. (2016) at least as well as the RNNs used by Michaelov and Bergen (2020). However, a key feature of Merkx and Frank's (2021) study is that it uses naturalistic stimuli. This makes the experiment more ecologically valid, but as has been pointed out (Michaelov and Bergen, 2020;Brothers and Kuperberg, 2021), this means that we cannot tell whether the higher correlation between surprisal and N400 amplitude is due to any factors that we are interested in investigating- Merkx and Frank (2021) do not consider how relatedness to a previously-mentioned event or to most predictable continuation impacts surprisal and the N400. For this reason, it is in fact far from clear that we should expect this specific related anomaly effect to be modeled as well by transformers as by RNNs. However, if it is, this would demonstrate the effect in two different language model architectures, further strengthening the idea that a predictive system alone can explain related anomaly effects.
Thus, in the present study, we investigate whether the results of Michaelov and Bergen (2020) replicate beyond the two RNNs tested, and crucially, whether the results replicate with transformer language models. Specifically, we test whether the surprisal elicited by implausible stimuli related to the highest-cloze continuation is lower  than the surprisal elicited by implausible stimuli unrelated to the highest-cloze continuation.

Results
The results of the experiment are shown in Figure 1. As can be seen, numerically, related words elicit lower surprisals than unrelated words, indicating that they were more highly predicted by the language models. This in turn suggests that these models do in fact collaterally predict the related continuations.
In order to test this more directly, we ran statistical analyses of the surprisals elicited by the language models. This was done by constructing linear mixed-effects regressions for each language model surprisal with experimental condition as a main effect, and the maximal random effects structure that would successfully converge for all models (see Barr et al., 2013). For all regressions except for that predicting RoBERTa surprisal, this random effects structure was a random intercept of sentence frame and of critical word. For the RoBERTa surprisal regression, the latter random intercept was removed due to it causing a singular fit. As creating null models with only the random effects structure resulted in singular fits for multiple regressions, we were unable to run likelihood ratio tests to test whether experimental condition-that is, whether the word was semantically related or unrelated to the highest-cloze continuation-was a significant predictor of surprisal. For this reason, we instead tested whether experimental condition was a significant predictor of surprisal by running a Type III ANOVA using Satterthwaite's method for estimating degrees of freedom (Kuznetsova et al., 2017) on the aforementioned linear mixed-effects models that included experimental condition as a fixed effect.
The results of the tests are shown in Table 1. As can be seen, condition is a significant predictor of the surprisal from every language model, confirm- ing that language models predict related stimuli to be more likely than unrelated stimuli. The results of this experiment demonstrate that all the language models tested-BERT, ALBERT, RoBERTa, XLM-R, GPT-2, GPT-Neo, GPT-J, and XGLM-display the related anomaly effect in response to the Ito et al. (2016) stimuli. All eight models predict implausible continuations that are related to the most probable continuations to be more likely those that are unrelated.  2019) also investigated the difference between the N400 amplitude elicited by implausible words that are related or unrelated to the most predictable (highest-cloze) continuation. As in Ito et al. (2016), these stimuli were chosen such that both related and unrelated words were highly implausbile-in this case, 'unpredictable words were strategically chosen not to make sense in their given contexts ' (DeLong et al., 2019, p. 4). These stimuli are exemplified by the set shown in (4).
(4) The commuter drove to work in her ______ after breakfast.

Results
As in Experiment 1, we ran the stimuli from the original experiment through the 8 language models and calculated the surprisal of each critical word. The results of the experiment are shown in Figure 2.
In all models except BERT, related stimuli all elicit numerically lower surprisals than unrelated stimuli, indicating that they were more highly-predicted by the language models.
We again ran the same statistical test as in Experiment 1, testing whether experimental condition (related or unrelated to the highest-cloze continuation) is a significant predictor of the surprisal elicited by the stimuli in each language model. The ALBERT, XLM-R, GPT-2, GPT-Neo, and GPT-J regressions had random intercepts of sentence frame and critical word, while the BERT, RoBERTa, and XGLM regressions had only random intercepts for sentence frame. The results of the Type III ANOVA are shown in Table 2. Condition is a significant predictor of the surprisal of every model except BERTin these models, related stimuli are predicted to be more likely continuations of the sentence than unrelated stimuli. Thus, with the exception of BERT, we replicate the findings of Experiment 1.
6 Experiment 3: Metusalem et al. (2012) 6.1 Introduction Metusalem et al. (2012) investigated the extent to which relatedness to the event described in the preceding context impacts the amplitude of the N400 response. Metusalem et al. (2012) presented human participants with experimental stimuli that included either the most probable (highest-cloze) continuation of a sentence, an implausible continuation that was related to the event described, or an implausible continuation that was unrelated to the event described. All of the implausible stimuli also had a cloze probability of zero. The stimuli are exemplified by the set for a single sentence frame shown in (5).
(5) We're lucky to live in a town with such a great art museum. Last week I went to see a special exhibit. I finally got in after waiting in a long ______.
• line (Predictable) • painting (Related) • toothbrush (Unrelated) Metusalem et al. (2012) found that despite their implausibility and improbability (based on cloze), critical words related to the event described in the context preceding them elicited smaller N400 responses than words that were unrelated to the event, a clear example of a related anomaly effect.

Results
As in Experiments 1 and 2, we ran the stimuli from the original experiment through the 8 language models and calculated the surprisal of each critical word. The results of the experiment are shown in Figure 3. As in Experiment 1, numerically, in all models related stimuli elicit lower surprisals than unrelated surprisals, indicating that they were more highly predicted by the language models.   We again ran the same statistical analyses as in Experiments 1 and 2, constructing linear mixedeffects regression models, all of which had random intercepts of sentence frame and critical word. Using a Type III ANOVA, we tested whether experimental condition (related or unrelated to the event described in the preceding context) is a significant predictor of N400 amplitude. The results are shown in Table 3. As can be seen, experimental condition was a significant predictor of the surprisal of all models.

Summary of Results
In all but one specific case-BERT in Experiment 2-experimental condition significantly predicted language model surprisal in the same direction as human N400 responses. The results of Experiments 1 and 2, therefore demonstrate convincingly that, like humans, language models do tend to predict that anomalous words related to the most probable continuation are more probable than anomalous words that are not. The results of Experiments 3, analogously, demonstrate that like humans, language models tend to predict that anomalous words related to a relevant event described in the preceding context are more probable than anomalous words that are not. Thus, like the human language comprehension system, language models exhibit related anomaly effects.

Psycholinguistic implications
These results have clear implications for psycholinguistic research on the effects of related anomalies on human language processing. First, a predictive system can display the effects-in fact, there is only one set of stimuli for which not all models do. This demonstrates the sufficiency of a predictive system for preactivating related anomalous stimuli to a greater degree than unrelated anomalous stimuli. In other words, based on a parsimony criterion, there is no need to posit that related anomaly effects on human language processing require something beyond a predictive system such as an associative system, either instead of or in addition to a predictive one.
Second, both kinds of related anomaly effect explored-the reduction in N400 amplitude correlated with relatedness to the most probable continuation and that correlated with relatedness to the event in the preceding context-are explainable by a single mechanism. This may seem counterintuitive, given how intuitively different the effects may seem. Yet this finding is consistent with the idea in the literature that the two effects can be considered different variants of the same phenomenon (DeLong et al., 2019;DeLong and Kutas, 2020).
Given that this study is based on computational modeling, we should note that the results do not constitute direct proof of a neurocognitive predictive system or of the lack of the involvement of an additional associative mechanism. However, they are consistent with such accounts, and open the door for future research, both computational and experimental. For example, it may be the case that other phenomena that have been argued to constitute evidence for a separate associative mechanism (see Federmeier, 2021, for review) may also be explainable on the basis of prediction. On the other hand, the approach we use here can also be used to design stimuli that do not differ in probability in order to further test whether prediction can explain all related anomaly effects.

Implications for NLP
The results of the present study demonstrate that related anomaly effects occur in contemporary transformer language models. Based on the present study, this does not appear to be impacted by whether the model is an autoregressive or masked language model; or by whether the model is mono-lingual or multilingual. In fact, the only model that does not show the effect every time is BERT, the least powerful model tested (all other models are either larger, trained on more data, or both). Thus, in line with previous research showing that higher-quality language models better predict human processing metrics (Merkx and Frank, 2021), the present results suggest that better language models are also more likely to display human-like patterns of prediction.
The results of this study also have several implications for understanding how the predictions of humans and language models relate. As has been previously discussed, some researchers have argued that we should evaluate the predictions of language models based on cloze probability (Ettinger, 2020). In fact, some have suggested training models on cloze probabilities (Eisape et al., 2020). However, the results of this study, along with others (Frank et al., 2015;Aurnhammer and Frank, 2019;Michaelov and Bergen, 2020;Aurnhammer and Frank, 2019;Merkx and Frank, 2021;Szewczyk and Federmeier, 2022;Michaelov et al., 2022), suggest that the predictions of language models are highly correlated with N400 amplitude; and recent work has argued that that the activation states of transformers are highly correlated with activation in the brain during language comprehension more generally (Schrimpf et al., 2020). Thus, while it may be useful for certain tasks to have cloze-like predictions, it may be the case that we are generally more likely to get N400-like predictions from language models.
If so, this is a cause for both optimism and pessimism. Given that humans are the gold-standard in natural language tasks generally, if a language model can make predictions that closely match those that humans make as part of language comprehension, this may also suggest that the representations learned are at least in some ways functionally similar to those that humans use to generate the same predictions. On the other hand, by the same token, it may suggest a limit to the possibilities of language modeling alone-there is much more to language comprehension than the kinds of prediction that underlie the N400 response (see, e.g., Ferreira and Yang, 2019;DeLong and Kutas, 2020;Kuperberg et al., 2020).

Conclusion
In order to better understand related anomaly effects in humans, we investigated whether contemporary transformer language models display them. We found that in all but one case, they do, suggesting that related anomaly effects in both humans and language models may be driven by prediction alone.