Analyzing Wrap-Up Effects through an Information-Theoretic Lens

Numerous analyses of reading time (RT) data have been undertaken in the effort to learn more about the internal processes that occur during reading comprehension. However, data measured on words at the end of a sentence–or even clause–is often omitted due to the confounding factors introduced by so-called “wrap-up effects,” which manifests as a skewed distribution of RTs for these words. Consequently, the understanding of the cognitive processes that might be involved in these effects is limited. In this work, we attempt to learn more about these processes by looking for the existence–or absence–of a link between wrap-up effects and information theoretic quantities, such as word and context information content. We find that the information distribution of prior context is often predictive of sentence- and clause-final RTs (while not of sentence-medial RTs), which lends support to several prior hypotheses about the processes involved in wrap-up effects.


Introduction
Reading puts the unfolding of linguistic input in the hands-or, really, the eyes-of the reader.Consequently, it presents a unique opportunity to gain a better understanding of how humans comprehend written language.The rate at which humans choose to read text (and process its information) should be determined by their goal of understanding it.Ergo, examining where a reader spends their time should help us to understand the nature of language comprehension processes themselves.Indeed, studies analyzing reading times have been employed to explore a number of psycholinguistic theories (e.g., Smith and Levy, 2013;Futrell et al., 2020;Van Schijndel and Linzen, 2021).
One behavior revealed by such studies is the tendency for humans to spend more time1 on the last word of a sentence or clause.While the existence of such wrap-up effects is well-known (Just et al., 1982;Hill and Murray, 2000;Rayner et al., 2000;Camblin et al., 2007), the cognitive processes giving rise to them are still not fully understood.This is likely (at least in part) due to the dearth of analyses targeting naturalistic sentence-final reading behavior.First, most studies of online processing omit data from these words to explicitly control for the confounding factors wrap-up effects introduce (e.g., Smith and Levy, 2013;Goodkind and Bicknell, 2018).Second, the few studies on wrap-up effects rely on small datasets, none of which analyze naturalistic text (Just and Carpenter, 1980;Rayner et al., 2000;Kuperberg et al., 2011).This work addresses this gap, using several large corpora of reading time data.Specifically, we study whether informationtheoretic concepts (such as surprisal) provide insights into the cognitive processes that occur at a sentence's boundary.Notedly, informationtheoretic approaches have been proven effective for analyzing sentence-medial reading time behavior.We follow the long line of work that has connected information-theoretic measures and psychometric data (Frank et al., 2015;Goodkind and Bicknell, 2018;Wilcox et al., 2020;Meister et al., 2021 , inter alia), employing similar methods to build models of sentence-and clause-final RTs.Using surprisal estimates from state-of-the-art language models, we search for a link between wrapup effects and the information content within a sentence.We find that the distribution of surprisals of prior context is often predictive of sentence-and clause-final reading times (RTs), while not adding significant predictive power to models of sentencemedial RTs.This result suggests that the nature of cognitive processes involved during the reading of these boundary words may indeed be different than those at other positions.Such findings lend support to several prior hypotheses regarding which processes may underlie wrap-up effects (e.g., the resolution of prior ambiguities) while providing evidence against other speculations (e.g., that the time spent at sentence boundaries can be quantified with a constant factor, independent of the processing difficulty of the text itself).

The Process of Reading
Decades of research on reading behavior have improved our understanding of the cognitive processes involved in reading comprehension (Just and Carpenter, 1980;Rayner and Clifton, 2009 , inter alia).Here, we will briefly describe overarching themes that are relevant for understanding wrap-up effects.

Incrementality and its Implications
It is widely accepted that language processing is incremental, i.e., readers process text one word at a time (Hale, 2001(Hale, , 2006;;Rayner and Clifton, 2009;Boston et al., 2011 , inter alia).Consequently, much can be uncovered about reading comprehension via studies that analyze cognitive processing at the word level.Many pyscholinguistic studies make use of this notion, taking per-word RTs in self-paced reading (SPR) or eye-tracking studies to be a direct reflection of the processing load of that word (e.g., Smith and Levy, 2013;Van Schijndel and Linzen, 2021).This RT-processing effort relationship then allows us to identify relationships between a word's processing load and its attributes (e.g., surprisal or length)-which in turn hints at the underlying cognitive processes involved in comprehension.One prominently studied attribute is word predictability; a notion naturally quantified by surprisal (also known as Shannon's (1948) information content).Formally, the surprisal of a word w is defined as s(w) def = − log p(w | w <t ), i.e., a unit's negative log-probability given the prior sentential context w <t .Notedly, this operationalization provides a way of quantifying how our prior expectations can affect our ability to process a linguistic signal.
There are several hypotheses about the mathematical nature of the relationship between perword surprisal and processing load. 2 While there has been much empirical proof that surprisal estimates serve as a good predictor of word-level RTs (Smith and Levy, 2013;Goodkind and Bicknell, 2018;Wilcox et al., 2020), the data observed from sentence-final words appears not to follow the same relationship.Specifically, in comparison to sentence-medial words, sentence-or clause-final words are associated with increased RTs in selfpaced studies (Just et al., 1982;Hill and Murray, 2000) and both increased fixation and regression times in eye-tracking studies (Rayner et al., 2000;Camblin et al., 2007).Such behavior has also been observed in controlled settings-for example, Rayner et al. (1989) found that readers fixated longer on a word when it ended a clause than when the same word did not end a clause.
Such widespread experimental evidence suggests sentence-final and sentence-medial reading behaviors differ from each other, and that other cognitive processes (besides standard word-level processing) effort may be at play.Yet unfortunately, these wrap-up effects have received relatively little attention in the psycholinguistic community: Most reading time studies simply exclude sentence-final (or even clause-final) words from their analyses, claiming that the (poorly understood) effects are confounding factors in understanding the reading process (e.g., Frank et al., 2013Frank et al., , 2015;;Wilcox et al., 2020).Rather, we believe this data can potentially provide new insights in their own right.

Wrap-up Effects
It remains unclear what exactly occurs in the mind of the reader at the end of a sentence or clause.Which cognitive processes are encompassed by the term wrap-up effects?Several theories have been posited.First, Just and Carpenter (1980) hypothesize that wrap-up effects include actions such as "the constructions of inter-clause relations."Second, Rayner et al. (2000) suggest they might involve attempts to resolve previously postponed comprehension problems, which could have been deferred in the hope that upcoming words would resolve the problem.Third, Hirotani et al. (2006) posit the hesitation when crossing clause boundaries is out of efficiency (Jarvella, 1971); readers do not want to have to return to the clause later, so they take the extra time to make sure there are no inconsistencies in the prior text.
While some prior hypotheses have been largely dismissed (see Stowe et al., 2018 for a more detailed summary) due to, e.g., the wide-spread support of theories of incremental processing, most others lack formal testing in naturalistic reading studies.We attempt to address this gap.Concretely, we posit the relationship between text's information-theoretic attributes and its observed wrap-up times can indicate the presence (or lack) of several cognitive processes that are potentially a part of sentence wrap-up.For example, highsurprisal words in the preceding context may correlate with the presence of ambiguities in the text; they may also correlate with complex linguistic relationships of the current text with prior sentenceswhich are two driving forces in the theories given above.Consequently, in this work, we ask whether the reading behavior observed at the end of a sentence or clause can be described (at least partially) by the distribution of information content in the preceding context,3 as this may give insights for several prior hypotheses about wrap-up effects.

Language Models as Predictors of Psychometric Data
Formally, a language model q is a probability distribution over natural language sentences, i.e., over V * for an alphabet V of linguistic units (typically words).In the case when q is locally normalized, which is the predominant case for today's neural language models, q is defined as the product of conditional probability distributions, i.e., for w ∈ V * , we have q(w) = q(EOS | w) |w| t=1 q(w t | w <t ), where each q(• | w <t ) is a distribution over V ∪ {EOS}.The symbol EOS is a special end-ofstring token not in V. Consequently, we can use q to estimate the probability of individual words in context.The model parameters are typically estimated by minimizing the negative log-likelihood of a corpus of natural language strings C, i.e., minimizing L(q) = − w∈C log q(w) with respect to q.
One widely embraced technique in informationtheoretic psycholinguistics is the use of these language models to estimate the probabilities required for computing surprisal (Hale, 2001;Demberg and Keller, 2008;Mitchell et al., 2010;Fernandez Monsalve et al., 2012).It has even been observed that a language model's perplexity4 correlates negatively with the psychometric predictive power provided by its surprisal estimates (Frank and Bod, 2011;Goodkind and Bicknell, 2018;Wilcox et al., 2020).If these language models keep improving at their current fast pace (Radford et al., 2019;Brown et al., 2020), exciting new results in computational psycholinguistics may follow, connecting reading behavior to the statistics of natural language.
Predicting Reading Times.In the computational psycholinguistics literature, the RT-surprisal relationship is typically studied using predictive models: RTs are predicted using surprisal estimates (along with other attributes such as number of characters) for the current word.The predictive power of these models, together with the structure of the model itself (which defines a specific relationship between RTs and surprisal), is then used as evidence of the studied effect.While this paradigm is successful in modeling sentence-medial RTs (Smith and Levy, 2013;Goodkind and Bicknell, 2018;Wilcox et al., 2020), its effectiveness for modeling sentence-and clause-final times is largely unknown due to the omission of this data from the majority of RT analyses.
A priori, we might expect per-word surprisal to be a similarly powerful predictor of sentence and clause-final RTs.5 Yet in Fig. 1, we see that when our baseline linear model (described more precisely in §4) is fit to sentence-medial RTs, the residuals for predictions of clause-final RTs appear to be neither normally distributed nor centered around 0.
Further, these trends appear to be different for eyetracking and SPR data, where the latter are skewed towards lower values for all datasets.6These results provide further confirmation that clause-final data does not adhere to the same relationship with RT as sentence-medial data, a phenomenon that may perhaps be accounted for by additional factors at play in the comprehension of clause-final words.Thus, we ask whether taking into account information from the entire prior context can give us a better model of these clause-final RTs.
To this end, we operationalize the information content INF in text w (of length T ) as:7 where w may be an entire sentence or only its first T words.8Notably, the case of k = 0 returns T ; under k = 1, we get the total information content of w.For k > 1, moments of high surprisal will disproportionately drive up the value of

Experiments
Data.We use reading time data from 5 corpora over 2 modalities: the Natural Stories (Futrell et al., 2018), Brown (Smith and Levy, 2013), and UCL (SP) (Frank et al., 2013) Corpora, which contain SPR data, as well as the Provo (Luke and Christianson, 2018), Dundee (Kennedy et al., 2003) and UCL (ET) (Frank et al., 2013) Corpora, which contain eye movements during reading.All corpora are in English.For eye-tracking data, we take reading time to be the sum of all fixation times on that word.We provide an analysis of regression (a.k.a.go-past) time in App.B. We provide further details regarding pre-processing in App. A.
We compute per-word surprisal as the sum of subword surprisals, when applicable.Additionally, punctuation is included in these estimates, although see App.B for results omitting punctuation, which are qualitatively the same.More details are given in App. A.
Evaluation.Following Wilcox et al. (2020) and Meister et al. (2021), we quantify the predictive power of a variable of interest as the mean difference in log-likelihood ∆LL of data points under models with and without access to that variable.A positive ∆LL value indicates the model with this predictor fits the observed data more closely than a model without this predictor.We use 10-fold crossvalidation to compute ∆LL values, taking the mean across the held-out folds as our final metric.
Our baseline model for predicting RTs contains predictors for surprisal, unigram log-frequency, character length, and the interaction of the latter two.These values, albeit computed on the previous word, are also included to account for spill-over effects (Smith and Levy, 2013).Surprisal from two words back is included for SPR datasets.Unless otherwise stated, GPT-2 estimates are used for baseline surprisal estimates in all models.
Results.Here we explore the additional predictive power that INF (k) gives us when modeling clause-final RTs.In Fig. 2, we observe that often the additional information provided by INF (k) (w) indeed leads to better models of clause-final RTs.Note that the estimated coefficients for INF (k) are always positive when ∆LL > 0 (see App. B.2), suggesting that higher values of INF (k) (w) correspond to longer wrap-up times.This finding is in line with other information-theoretic analyses of RTs (discussed in §2.1), which have consistently found positive relationships between information content and RT.
In most cases, INF (k) at some value of k > 0 leads to larger gains in predictive power than k = 0. Ergo, the information content of the preceding text is more indicative of wrap-up behavior than length alone.Further, while often within standard error, INF (k) (w) at k > 1 provides more predictive power than at k = 1 across the majority of datasets.This indicates that unevenness in the distribution of surprisal is stronger than the total surprisal content alone as a predictor of clause-final RTs.The Figure 2: Mean ∆LL as a function of the exponent k in INF (k) for models of sentence and clause-final (top row) and sentence-medial (bottom row) RTs using surprisal estimates from different language models.The shaded region connects standard error estimates.Vertical intercepts at k = 0, 1 are for reference.We see that our informationtheoretic predictors contribute much less modeling power to the prediction of sentence-medial RTs in comparison to sentence-and clause-final RTs.
same experiments for sentence-medial words show these quantities are less helpful when modeling their RTs.Note that these effects hold above and beyond the spill-over effects from the window immediately preceding the sentence boundary.The effect of the distribution of surprisal throughout the sentence is stronger for eye-tracking data than for SPR; further, the trends are even more pronounced when measuring regression times for eye-tracking data (see App. B).
Notably, we see some variation in trends across datasets.Due to the nature of psycholinguistic studies, it is natural to expect some variation due to, e.g., data collection procedures or inaccuracies from measurement devices.Another (perhaps more influential) factor in the difference in trends comes from the variation in dataset sizes.We see that with the smaller datasets (e.g., UCL and Provo), there may not be enough data to learn accurate model parameters.This artifact may manifest as the noisiness or a lack of a significant increase in log-likelihood (on a held-out test set) over the baseline that we observe in some cases.
When considering prior theories of wrap-up processes, these results have several implications.For example, they can be interpreted as supporting and extending Rayner et al.'s (2000) hypothesis, which suggests the extra time at sentence boundaries is spent resolving prior ambiguities.In this case, the observed correlation between wrap-up times and INF (k) (w) may potentially be linked to two factors: (1) contextual ambiguities increasing variation in per-word information content, and (2) contextual ambiguities being resolved at clause ends.On the other hand, these results provide evidence against the hypothesis that the cognitive processes occurring during the comprehension of sentence-medial and clause-final words are the same.Further, it also goes against Hirotani et al.'s (2006) hypothesis (discussed in §2.2), as the differences in sentence-medial and clause-final times cannot be purely quantified by a constant factor.

Conclusion
We attempt to shed light on the nature of wrap-up effects by exploring the relationship between clause-final RTs and information-theoretic attributes of text.We find that operationalizations of the information contained in the preceding context lead to better predictions of these RTs, while not adding significant predictive power for sentence-medial RTs.This suggests that information-theoretic attributes of text can shed light on the cognitive processes happening during the comprehension of clause-final words.Further, these processes may indeed be different in nature than those required for sentence-medial words.In short, our results provide evidence (either in support or against) about several theories of the nature of wrap-up processes.

A Experimental Setup
A.1 Data Pre-processing We use the Moses decoder 9 tokenizer and punctuation normalizer to pre-process all text data.Some of the Hugging Face tokenizers for respective neural models performed additional tokenization; we refer the reader to the library documentation for more details.We determine clause-final words as all those ending in punctuation.Capitalization was kept intact albeit the lowercase versions of words were used in unigram probability estimates.We estimate unigram log-probabilities on WikiText-103 using the KenLM (Heafield, 2011) library with default hyperparameters.We removed outlier wordlevel reading times (specifically those with a zscore > 3 when the distribution was modeled as log-linear).

A.2 Surprisal Estimates
We use pre-trained neural language models to compute most surprisal estimates.For reproducibility, we employ the model checkpoints provided by Hugging Face (Wolf et al., 2020).Specifically, for GPT-2, we use the default OpenAI version (gpt2); for TransformerXL, we use a version of the model (architecture described in Dai et al. (2019)) that has been fine-tuned on WikiText-103 (transfo-xl-wt103); for BERT, we use the bert-base-cased version.Notably, BERT models the probability of a word given both prior and later context, which means it can only give us pseudo estimates of surprisal.Both GPT-2 and BERT use sub-word tokenization.We additionally use surprisal estimates from a 5-gram model trained on WikiText-103 using the KenLM (Heafield, 2011) library with default hyperparameters for Kneser-Essen-Ney smoothing. 9http://www.statmt.org/moses/

Figure 1 :
Figure 1: Distributions of residuals when predicting either clause-final or non-clause-final times using our baseline linear models.Models are fit to (the log-transform of) non-clause-final average RTs.Outlier times (according to log-normal distribution) are excluded.The top-level datasets contain eye-tracking data while the bottom contain SPR data.Full distributions of RTs are shown in App.B, where we also show models fit to regression times, rather than full reading times.

Figure 3 :
Figure 3: Distributions of average RTs for clause-final and non-clause-final words.Outlier times (according to log-normal distribution) are excluded from averages for both graphs.The top-level datasets contain eye-tracking data while the bottom contain SPR data.

Figure 4 :
Figure 4: Version of Fig. 1 where surprisal estimates do not include the surprisal assigned to punctuation, which is often a large contributor to clause-final surprisal estimates.We see very little qualitative difference with Fig. 1.

Figure 5 :Figure 6 :Figure 7 :
Figure 5: Version of (a) Fig. 3 and (b) Fig. 1 for regression times for clause-final and non-clause-final words.Only applicable for eye-tracking datasets