Surprisal Estimators for Human Reading Times Need Character Models

While the use of character models has been popular in NLP applications, it has not been explored much in the context of psycholinguistic modeling. This paper presents a character model that can be applied to a structural parser-based processing model to calculate word generation probabilities. Experimental results show that surprisal estimates from a structural processing model using this character model deliver substantially better fits to self-paced reading, eye-tracking, and fMRI data than those from large-scale language models trained on much more data. This may suggest that the proposed processing model provides a more humanlike account of sentence processing, which assumes a larger role of morphology, phonotactics, and orthographic complexity than was previously thought.


Introduction and Related Work
Expectation-based theories of sentence processing (Hale, 2001;Levy, 2008) posit that processing difficulty is determined by predictability in context. In support of this position, predictability quantified through surprisal has been shown to correlate with behavioral measures of word processing difficulty (Goodkind and Bicknell, 2018;Hale, 2001;Levy, 2008;Shain, 2019;Smith and Levy, 2013). However, surprisal itself makes no representational assumptions about sentence processing, leaving open the question of how best to estimate its underlying probability model.
In natural language processing (NLP) applications, the use of character models has been popular for several years (Al-Rfou et al., 2019;Kim et al., 2016;Lee et al., 2017). Character models have been shown not only to alleviate problems with out-of-vocabulary words but also to embody morphological information available at the subword level. For this reason, they have been extensively used to model morphological processes (Elsner et al., 2019;Kann and Schütze, 2016) or incorporate morphological information into models of syntactic acquisition . Nonetheless, the use of character models has been slow to catch on in psycholinguistic surprisal estimation, which has recently focused on evaluating largescale language models that make predictions at the word level (e.g. Futrell et al. 2019;Goodkind and Bicknell 2018;Hale et al. 2018;Hao et al. 2020). This raises the question of whether incorporating character-level information into an incremental processing model will result in surprisal estimates that better characterize predictability in context.
To answer this question, this paper presents a character model that can be used to estimate word generation probabilities in a structural parser-based processing model. 1 The proposed model defines a process of generating a word from an underlying lemma and a morphological rule, which allows the processing model to capture the predictability of a given word form in a fine-grained manner. Regression analyses on self-paced reading, eye-tracking, and fMRI data demonstrate that surprisal estimates calculated from this character-based structural processing model contribute to substantially better fits compared to those calculated from large-scale language models, despite the fact that these other models are trained on much more data and show lower perplexities on test data. This finding deviates from the monotonic relationship between test perplexity and predictive power observed in previous studies (Goodkind and Bicknell, 2018;Wilcox et al., 2020). Furthermore, it suggests that the character-based structural processing model may provide a more humanlike account of processing difficulty and may suggest a larger role of morphology, phonotactics, and orthographic complexity than was previously 3747 thought.

Background
The experiments presented in this paper use surprisal predictors (Shannon, 1948) calculated by an incremental processing model based on a leftcorner parser (Johnson-Laird, 1983;van Schijndel et al., 2013). This incremental processing model provides a probabilistic account of sentence processing by making a single lexical attachment decision and a single grammatical attachment decision for each input word.
Surprisal. Surprisal can be defined as the negative log ratio of prefix probabilities of word sequences w 1..t at consecutive time steps t − 1 and t: These prefix probabilities can be calculated by marginalizing over the hidden states q t of the forward probabilities of an incremental processing model: These forward probabilities are in turn defined recursively using a transition model: (3) Left-corner parsing. The transition model presented in this paper is based on a probabilistic leftcorner parser (Johnson-Laird, 1983;van Schijndel et al., 2013). Left-corner parsers have been used to model human sentence processing because they define a fixed number of decisions at every time step and also require only a bounded amount of working memory, in keeping with experimental observations of human memory limits (Miller and Isard, 1963). The transition model maintains a distribution over possible working memory store states q t at every time step t, each of which consists of a bounded number D of nested derivation fragments a d t /b d t . Each derivation fragment spans a part of a derivation tree from some apex node a d t lacking a base node b d t yet to come. Previous work has shown that large annotated corpora such as the Penn Treebank (Marcus et al., 1993) do not require more than D = 4 of such fragments (Schuler et al., 2010).
At each time step, a left-corner parsing model generates a new word w t and a new store state q t in two phases (see Figure 1). First, it makes a lexical decision t regarding whether to use the word to complete the most recent derivation fragment (match), or to use the word to create a new preterminal node a t (no-match). Subsequently, the model makes a grammatical decision g t regarding whether to use a predicted grammar rule to combine the node constructed in the lexical phase a t with the next most recent derivation fragment (match), or to use the grammar rule to convert this node into a new derivation fragment a g t /b g t (no-match): 2 Thus, the parser creates a hierarchically organized sequence of derivation fragments and joins these fragments up whenever expectations are satisfied.
In order to update the store state based on the lexical and grammatical decisions, derivation fragments above the most recent nonterminal node are carried forward, and derivation fragments below it are set to null (⊥): where the indicator function ϕ = 1 if ϕ is true and 0 otherwise, and d = argmax d {a d t−1 ⊥} + 1 − m t − m g t . Together, these probabilistic decisions generate the n unary branches and n − 1 binary branches of a parse tree in Chomsky normal form for an n-word sentence.

Processing Model
The processing model extends the above left-corner parser to maintain lemmatized predicate information by augmenting each preterminal, apex, and base node to consist not only of a syntactic category label c p t , c a d 3748 a) lexical decision t b) grammatical decision g t category. 3 Each 0 or 1 element of this vector represents a unique predicate context, which consists of a predicate, role pair that specifies the content constraints of a node in a predicate-argument structure. These predicate contexts are obtained by reannotating the training corpus using a generalized categorial grammar of English (Nguyen et al., 2012), 4 which is sensitive to syntactic valence and non-local dependencies. Lexical decisions. Each lexical decision of the parser includes a match decision m t and decisions about a syntactic category c t and a predicate context vector h t that together specify a preterminal node p t . The probability of generating the match decision and the predicate context vector depends on the base node b d t−1 of the previous derivation fragment (i.e. its syntactic category and predicate context vector). The first term of Equation 4 can therefore be decomposed into the following: where FF is a feedforward neural network, and δ i is a Kronecker delta vector consisting of a one at element i and zeros elsewhere. Depth d = argmax d {a d t−1 ⊥} is the number of non-null derivation fragments at the previous time step, and E L is a matrix of jointly trained dense embeddings for each syntactic category and predicate context. The syntactic category and predicate context vector 3 The valence of a category is the number of unsatisfied syntactic arguments it has. Separate vectors for syntactic arguments are needed in order to correctly model cases such as passives where syntactic arguments do not align with predicate arguments. 4 The predicates in this annotation scheme come from words that have been lemmatized by a set of rules that have been manually written and corrected in order to account for common irregular inflections. together define a complete preterminal node p t for use in the word generation model: and a new apex node a t for use in the grammatical decision model: Grammatical decisions. Each grammatical decision includes a match decision m g t and decisions about a pair of syntactic category labels c g t and c g t , as well as a predicate context composition operator o g t , which governs how the newly generated predicate context vector h t is propagated through its new derivation fragment a g t /b g t . The probability of generating the match decision and the composition operators depends on the base node b d−m t t−1 of the previous derivation fragment and the apex node a t from the current lexical decision (i.e. their syntactic categories and predicate context vectors). The third term of Equation 4 can accordingly be decomposed into the following: where E G is a matrix of jointly trained dense embeddings for each syntactic category and predicate context. The composition operators are associated with sparse composition matrices A o g t which can be used to compose predicate context vectors associated with the apex node a g t : and sparse composition matrices B o g t which can be used to compose predicate context vectors associated with the base node b g t :

Character-based Word Model
The baseline version of the word model P(w t | q t−1 t ) uses relative frequency estimation with backoff probabilities for out-of-vocabulary words trained using hapax legomena. A character-based test version of this model instead applies a morphological rule r t to a lemma x t to generate an inflected form w t . The set of rules model affixation through string substitution and are inverses of lemmatization rules that are used to derive predicates in the generalized categorial grammar annotation (Nguyen et al., 2012). For example, the rule %ay→%aid can apply to the word say to derive its past tense form said. There are around 600 such rules that account for inflection in Sections 02 to 21 of the Wall Street Journal corpus of the Penn Treebank (Marcus et al., 1993), which includes an identity rule for words in bare form and a 'no semantics' rule for generating certain function words.
For an observed input word w t , the model first generates a list of x t , r t pairs that deterministically generate w t . This allows the model to capture morphological regularity and estimate how expected a word form is given its predicted syntactic category and predicate context, which have been generated as part of the preceding lexical decision. In addition, this lets the model hypothesize the underlying morphological structure of out-of-vocabulary words and assign probabilities to them. The second term of Equation 4 can thus be decomposed into the following: The probability of generating the lemma sequence depends on the syntactic category c p t and predicate context h t resulting from the preceding lexical decision t : where x t,1 , x t,2 , ..., x t,I is the character sequence of lemma x t , with x t,1 = s and x t,I = e as special start and end characters. W X and b X are respectively a weight matrix and bias vector of a softmax classifier. A recurrent neural network (RNN) calculates a hidden state x t,i for each character from an input vector at that time step and the hidden state after the previous character x t,i−1 : where E X is a matrix of jointly trained dense embeddings for each syntactic category, predicate context, and character. Subsequently, the probability of applying a particular morphological rule to the generated lemma depends on the syntactic category c p t and predicate context h t from the preceding lexical decision as well as the character sequence of the lemma: (15) where W R and b R are respectively a weight matrix and bias vector of a softmax classifier. r t,I is the last hidden state of an RNN that takes as input the syntactic category, predicate context, and character sequence of the lemma x t,2 , x t,3 , ..., x t,I−1 without the special start and end characters: where E R is a matrix of jointly trained dense embeddings for each syntactic category, predicate context, and character. Finally, as the model calculates probabilities only for x t , r t pairs that deterministically generate w t , the word probability conditioned on these variables P(w t | q t−1 t x t r t ) is deterministic.

Experiment 1: Effect of Character Model
In order to assess the influence of the characterbased word generation model over the baseline word generation model on the predictive quality of surprisal estimates, linear mixed-effects models containing common baseline predictors and one or more surprisal predictors were fitted to self-paced reading times. Subsequently, a series of likelihood ratio tests were conducted in order to evaluate the relative contribution of each surprisal predictor to regression model fit.

Response Data
The first experiment described in this paper used the Natural Stories Corpus (Futrell et al., 2018), which contains self-paced reading times from 181 subjects that read 10 naturalistic stories consisting of 10,245 tokens. The data were filtered to exclude observations corresponding to sentenceinitial and sentence-final words, observations from subjects who answered fewer than four comprehension questions correctly, and observations with durations shorter than 100 ms or longer than 3000 ms. This resulted in a total of 768,584 observations, which were subsequently partitioned into an exploratory set of 383,906 observations and a held-out set of 384,678 observations. The partitioning allows model selection (e.g. making decisions about predictors and random effects structure) to be conducted on the exploratory set and a single hypothesis test to be conducted on the held-out set, thus eliminating the need for multiple trials correction. All observations were log-transformed prior to model fitting.

Predictors
The baseline predictors commonly included in all regression models are word length measured in characters and index of word position within each sentence. 5 In addition to the baseline predictors, surprisal predictors were calculated from two variants of the processing model in which word generation probabilities P(w t | q t−1 t ) are calculated using relative frequency estimation (FreqWSurp) and using the character-based model described in Section 3.2 (CharWSurp). Both variants of the processing model were trained on a generalized categorial grammar (Nguyen et al., 2012) reannotation of Sections 02 to 21 of the Wall Street Journal (WSJ) corpus of the Penn Treebank (Marcus et al., 1993). Beam search decoding with a beam size of 5,000 was used to estimate prefix probabilities and surprisal predictors for both variants.
To account for the time the brain takes to process and respond to linguistic input, it is standard practice in psycholinguistic modeling to include 'spillover' variants of predictors from preceding words (Rayner et al., 1983;Vasishth, 2006). However, as including multiple spillover variants of predictors leads to identifiability issues in mixed- effects modeling , Char-WSurp and FreqWSurp were both spilled over by one position. All predictors were centered and scaled prior to model fitting, and all regression models included by-subject random slopes for all fixed effects as well as random intercepts for each word and subject-sentence interaction, following the convention of keeping the random effects structure maximal in psycholinguistic modeling (Barr et al., 2013).

Likelihood Ratio Testing
A total of three linear mixed-effects models were fitted to reading times in the held-out set using lme4 (Bates et al., 2015); the full model included the fixed effects of both CharWSurp and FreqW-Surp, and the two ablated models included the fixed effect of either CharWSurp or FreqWSurp. This resulted in two pairs of nested models whose fit could be compared through a likelihood ratio test (LRT). The first LRT tested the contribution of CharWSurp by comparing the fit of the full regression model to that of the regression model without the fixed effect of CharWSurp. Similarly, the second LRT tested the contribution of FreqWSurp by comparing the fit of the full regression model to that of the regression model without its fixed effect.

Results
The results in Table 1 show that the contribution of CharWSurp in predicting reading times is statistically significant over and above that of FreqWSurp (p < 0.0001), while the converse is not significant (p = 0.8779). This demonstrates that incorporating a character-based word generation model to the structural processing model better captures predictability in context, subsuming the effects of the processing model without it.

Experiment 2: Comparison to Other Models
To further examine the impact of the characterbased word generation model, CharWSurp and Fre-qWSurp were evaluated against surprisal predictors calculated from a number of other large-scale pretrained language models and smaller parser-based models. To compare the predictive power of surprisal estimates from different language models on equal footing, we calculated the increase in loglikelihood (∆LL) to a baseline regression model as a result of including a surprisal predictor, following recent work (Goodkind and Bicknell, 2018;Hao et al., 2020).

Surprisal Estimates from Other Models
A total of three pretrained language models were used to calculate surprisal estimates at each word. 6 • GLSTMSurp (Gulordava et al., 2018): A twolayer LSTM model trained on ∼80M tokens of the English Wikipedia.
• vSLCSurp (van Schijndel et al., 2013): A leftcorner parser based on a PCFG with subcategorized syntactic categories (Petrov et al., 2006), trained on a generalized categorial grammar reannotation of Sections 02 to 21 of the WSJ corpus.

Procedures
The set of self-paced reading times from the Natural Stories Corpus after applying the same data exclusion criteria as Experiment 1 provided the response variable for the regression models. In addition to the full dataset, regression models were also fitted to a 'no out-of-vocabulary (No-OOV)' version of the dataset, in which observations corresponding to out-of-vocabulary words for the LSTM language model with the smallest vocabulary (i.e. Gulordava et al., 2018) were also excluded. This exclusion criterion was included in order to avoid putting the LSTM language models that may have unreliable surprisal estimates for out-of-vocabulary words at an unfair disadvantage. This resulted in a total of 744,607 observations in the No-OOV dataset, which were subsequently partitioned into an exploratory set of 371,937 observations and a held-out set of 372,670 observations. All models were fitted to the held-out set, and all observations were log-transformed prior to model fitting.
The predictors included in the baseline linear mixed-effects model were word length, word position in sentence, and unigram surprisal. Unigram surprisal was calculated using the KenLM toolkit (Heafield et al., 2013) with parameters trained on the Gigaword 4 corpus (Parker et al., 2009). In order to calculate the increase in log-likelihood (∆LL) attributable to each surprisal predictor, a 'full' linear-mixed effects model, which includes one surprisal predictor on top of the baseline model, was fitted for each surprisal predictor. As with Experiment 1, the surprisal predictors were spilled over by one position. All predictors were centered and scaled prior to model fitting, and all regression models included by-subject random slopes for all fixed effects and random intercepts for each word and subject-sentence interaction.
Additionally, in order to examine whether any of the models fail to generalize across domains, their perplexity on the entire Natural Stories Corpus was also calculated.

Results
The results show that surprisal from the characterbased structural model (CharWSurp) made the biggest contribution to model fit compared to surprisal from other models on both full and No-OOV sets of self-paced reading times ( Figure 2; the difference between the model with CharWSurp and other models is significant with p < 0.001 by a paired permutation test using by-item errors). The exclusion of OOV words did not make a notable difference in the overall trend of ∆LL across models. This finding, despite the fact that the pretrained language models were trained on much larger datasets and also show lower perplexities on test data, 7 suggests that this model may provide a more humanlike account of processing difficulty. In other words, accurately predicting the next word alone does not fully explain humanlike processing costs that manifest in self-paced reading times. The analysis of residuals grouped by the lowest base category of the previous time step (c b d t−1 ) from manual annotations (Shain et al., 2018) shows that the improvement of CharWSurp over GPT2Surp was broad-based across categories (see Figure 3).

Experiment 3: Eye-tracking Data
In order to examine whether these results generalize to other latency-based measures, linear-mixed effects models were fitted on the Dundee eyetracking corpus (Kennedy et al., 2003) to test the contribution of each surprisal predictor, following similar procedures to Experiment 2.

Procedures
The set of go-past durations from the Dundee Corpus (Kennedy et al., 2003) provided the response variable for the regression models. The Dundee Corpus contains gaze durations from 10 subjects that read 20 newspaper editorials consisting of 51,502 tokens. The data were filtered to exclude unfixated words, words following saccades longer than four words, and words at starts and ends of sentences, screens, documents, and lines. This resulted in the full set with a total of 195,296 observations, which were subsequently partitioned into an exploratory set of 97,391 observations and a held-out set of 97,905 observations. As with Experiment 2, regression models were also fitted to a No OOV version of the dataset, in which observations corresponding to out-of-vocabulary words for the Gulordava et al. (2018) model were also excluded. This resulted in a subset with a total of 184,894 observations (exploratory set of 92,272 observations, held-out set of 92,622 observations). All models were fitted to the held-out set, and all observations were log-transformed prior to model fitting.
The predictors included in the baseline linear mixed-effects models were word length, word position, and saccade length. In order to calculate the increase in log-likelihood from including each surprisal predictor, a full model including one sur- prisal predictor on top of the baseline model was fitted for each surprisal predictor. All surprisal predictors were spilled over by one position, and all predictors were centered and scaled prior to model fitting. All regression models included by-subject random slopes for all fixed effects and random intercepts for each word and sentence.

Results
The results in Figure 4 show that as with Experiment 2, surprisal from the character-based structural model (CharWSurp) made the biggest contribution to model fit on both full and No-OOV sets of go-past durations (the difference between model with CharWSurp and other models is significant with p < 0.001 by a paired permutation test using by-item errors). In contrast to Natural Stories, surprisal from the two left-corner parsing models (i.e. vSLCSurp and JLCSurp) did not contribute to as much model fit compared to other models. The exclusion of OOV words again did not make a notable difference in the general trend across different models, although it led to an increase in ∆LL for GLSTMSurp and RNNGSurp. Residuals grouped by the lowest base category from the previous time step show that, similarly to Natural Stories, the improvement of CharWSurp over GPT2Surp was broad-based across different categories (see Figure 5). These results provide further support for the observation that language models that are trained to predict the next word accurately do not fully explain processing cost in the form of latency-based measures.

Experiment 4: fMRI Data
Finally, to examine whether a similar tendency is observed in brain responses, we analyzed the time series of blood oxygenation level-dependent (BOLD) signals in the language network, which were identified using functional magnetic resonance imaging (fMRI). To this end, the novel statistical framework of continuous-time deconvolutional regression (CDR; Shain and Schuler, 2019) was employed. As CDR allows the data-driven estimation of continuous impulse response functions from variably spaced linguistic input, it is more appropriate for modeling fMRI responses, which are typically measured in fixed time intervals. Similarly to the previous experiments, the increase in CDR model log-likelihood as a result of including a surprisal predictor on top of a baseline CDR model was calculated for evaluation.

Procedures
This experiment used the same fMRI data used by , which were collected from 78 subjects that listened to a recorded version of the Natural Stories Corpus. The functional regions of interest (fROI) corresponding to the domainspecific language network were identified for each subject based on the results of a localizer task that they conducted. This resulted in a total of 202,295 observations, which were subsequently partitioned into an exploratory set of 100,325 observations and a held-out set of 101,970 observations by assigning alternate 60-second intervals of BOLD series to different partitions for each participant. All models were fitted to the BOLD signals in the held-out set.
The predictors included in the baseline CDR model were the index of current fMRI sample within the current scan, unigram surprisal, and the deconvolutional intercept which captures the influence of stimulus timing. Following , the CDR models assumed the twoparameter HRF based on the double-gamma canonical HRF (Lindquist et al., 2009). Furthermore, the two parameters of the HRF were tied across predictors, modeling the assumption that the shape of the blood oxygenation response to neural activity is identical in a given region. However, to allow the HRFs to have differing amplitudes, a coefficient that rescales the HRF was estimated for each predictor. The models also included a by-fROI random effect for the amplitude coefficient and a by-subject random intercept.
To calculate the increase in log-likelihood from including each predictor, a full CDR model including the fixed effects of one surprisal predictor was also fitted for each surprisal predictor. All surprisal predictors were included without spillover, 8 and all predictors were centered prior to model fitting.

Results
The results in Figure 6 show that surprisal from GPT-2 (GPT2Surp) made the biggest contribution to model fit in comparison to surprisal from other models (difference between model with GPT2Surp and other models significant with p < 0.001 by a paired permutation test using by-item errors). Most 8 As CDR estimates continuous HRFs from variably spaced linguistic input, consideration of spillover variants of surprisal predictors was not necessary. notably, in contrast to self-paced reading times and eye-gaze durations, CharWSurp did not contribute as much to model fit on fMRI data, with a ∆LL lower than those of the LSTM language models. This differential contribution of CharWSurp across datasets suggests that latency-based measures and blood oxygenation levels may capture different aspects of online processing difficulty.

Conclusion
This paper presents a character model that can be used to estimate word generation probabilities in a structural parser-based processing model. Experiments demonstrate that surprisal estimates calculated from this processing model generally contribute to substantially better fits to human response data than those calculated from large-scale pretrained language models or other incremental parsers. These results add a new nuance to the relationship between perplexity and predictive power reported in previous work (Goodkind and Bicknell, 2018;Wilcox et al., 2020). In addition, they suggest that structural parser-based processing models may provide a more humanlike account of sentence processing, and may suggest a larger role of morphology, phonotactics, and orthographic complexity than was previously thought.