Revisiting the Uniform Information Density Hypothesis

The uniform information density (UID) hypothesis posits a preference among language users for utterances structured such that information is distributed uniformly across a signal. While its implications on language production have been well explored, the hypothesis potentially makes predictions about language comprehension and linguistic acceptability as well. Further, it is unclear how uniformity in a linguistic signal—or lack thereof—should be measured, and over which linguistic unit, e.g., the sentence or language level, this uniformity should hold. Here we investigate these facets of the UID hypothesis using reading time and acceptability data. While our reading time results are generally consistent with previous work, they are also consistent with a weakly super-linear effect of surprisal, which would be compatible with UID’s predictions. For acceptability judgments, we find clearer evidence that non-uniformity in information density is predictive of lower acceptability. We then explore multiple operationalizations of UID, motivated by different interpretations of the original hypothesis, and analyze the scope over which the pressure towards uniformity is exerted. The explanatory power of a subset of the proposed operationalizations suggests that the strongest trend may be a regression towards a mean surprisal across the language, rather than the phrase, sentence, or document—a finding that supports a typical interpretation of UID, namely that it is the byproduct of language users maximizing the use of a (hypothetical) communication channel.


Introduction
The uniform information density (UID) hypothesis (Fenk and Fenk, 1980;Levy and Jaeger, 2007) states that language users prefer when information content (measured information-theoretically as 1 Analysis pipeline is publicly available and can be found at https://github.com/rycolab/revisiting-uid. Figure 1: Correlation coefficient between (negative) sum of surprisals raised to the k th power and linguistic acceptability judgments of a sentence. The higher correlation when k > 1 implies sentences with a more uniform distribution of information are more acceptable.
surprisal) is distributed as smoothly as possible throughout an utterance. The studies adduced in support of this hypothesis in language production span levels of linguistic structure: from phonetics (Aylett and Turk, 2004) to lexical choice (Mahowald et al., 2013), to syntax (Jaeger, 2010), and to discourse (Torabi Asr and Demberg 2015) (though see Levy 2018, 2019). Despite this evidence, there are several aspects of the UID hypothesis that lack clarity or unity. For example, there is a dearth of converging evidence from studies in language comprehension. Furthermore, multiple candidate operationalizations of UID have been proposed, each without formal justification for their choices (Collins, 2014;Jain et al., 2018;Meister et al., 2020;Wei et al., 2021).
In this work, we attempt to shed light on these issues: we first study the relationship between the distribution of information content throughout a sentence and native speakers' (i) sentence-level reading times and (ii) sentence acceptability judgments. While our results for sentence-level reading times do not contradict previous word-level reading time analyses (e.g., Smith and Levy 2013;Goodkind and Bicknell 2018a), which have shown a linear effect of surprisal, they suggest that a slight super-linear effect may likewise be a plausible explanationwhich is in line with predictions of the UID hypothesis. For sentence acceptability judgments, we see more concrete signs of a super-linear effect of sentence-level surprisal (see Fig. 1), consistent with a preference for UID in language. Given these findings, we next ask how we can best measure UID. We review previous results supporting UID, in search of an operationalization and find that in most of these studies, adherence to UID is measured via an analysis of individual linguistic units, without direct consideration for the information content carried by surrounding units (Frank and Jaeger, 2008;Jaeger, 2010;Mahowald et al., 2013). Such a definition fails to account for the distribution across the signal as a whole.
Consequently, we present and motivate a set of plausible operationalizations-either taken from the literature or newly proposed. Given our earlier results, we posit that good operationalizations of UID should provide strong explanatory power for human judgments of linguistic acceptability and potentially reading times. In this search, we additionally explore with respect to which linguistic unit-a phrase, sentence, document, or language as a whole-uniformity should be measured. Our results provide initial evidence that the best definition of UID may be a super-linear function of word surprisal. Further, we see that a regression towards the mean information content of the entire language, rather than a local information rate, may better capture the pressure for UID in natural language, a theory that falls in line with its information-theoretical interpretation, i.e., that language users maximize the use of a hypothetical noisy channel during communication.

Processing Effort in Comprehension
In psycholinguistics, there are a number of theories that explain how the effort required to process language varies as a function of some perceived linguistic unit. Several of these are founded in information theory (Shannon, 1948), using the notion of language as a communication system in order to build computational models of processing. Under such a framework, linguistic units convey information, and the exact amount of information a unit carries can be quantified as its surprisal-also termed Shannon information content. Formally, let us consider a linguistic signal u = u 1 , . . . , u N as a sequence of linguistic units, e.g., words or morphemes; the standard definition of surprisal is then s(u n ) def = − log p(u n | u <n ), i.e., a unit's negative log-probability conditioned on its prior context. Note that under this definition, low probability items are seen as more informative, which reflects the intuition that unpredictable items convey more information than predictable ones. With this background in mind, we now review two prominent examples of information-theoretic models of language processing: surprisal theory and the uniform information density hypothesis.

Surprisal Theory
Surprisal theory (Hale, 2001) posits that the incremental load of processing a word is directly related to how unexpected the word is in its context, i.e., its surprisal. Mathematically formulated, the processing effort required for the word u n follows a linear relationship with respect to its surprisal: Over the years, surprisal theory has been further motivated and received wide empirical support (Levy, 2008;Brouwer et al., 2010). 2 Notably, a number of works give evidence that this relationship (between processing effort and surprisal) is indeed linear (equivalently, logarithmic in probability; Smith and Levy 2013;Frank et al. 2013; Goodkind and Bicknell 2018b, though see Brothers and Kuperberg 2021).

Uniform Information Density
Given the formal definition of surprisal, the information content of the entire linguistic signal u can be quantified as the sum of individual surprisals. Following Eq. (1), the effort to process u would thus be proportional to this sum, i.e.: But this has a counter-intuitive consequence. Suppose a speaker has a fixed number of bits of information to convey. Eq.
(2) predicts that all ways of distributing that information in an utterance would involve equal processing effort: packing it all into a single, short utterance; spreading it out thinly in an extremely long utterance; dispersing it in a highly uneven profile throughout an utterance. The theory of uniform information density (UID;Fenk and Fenk 1980;Genzel and Charniak 2002;Bell et al. 2003;Aylett and Turk 2004;Levy and Jaeger 2007) attempts to reconcile the role of surprisal in determining processing effort with the intuition that perhaps not all ways of distributing information content have equal effect on overall processing effort. Rather, UID predicts that communicative efficiency is maximized when informationagain quantified as per-unit surprisal-is distributed as uniformly as possible throughout a signal. One way of deriving this prediction is to hypothesize that the processing effort for a sentence is an additive function of (i) a super-linear function of surprisal; and (ii) utterance length: 3 for some constant c > 0 and k > 1. The above equation implies that high surprisal instances require disproportionately high processing effort from the language user. Rather, a uniform distribution of s(u n )-which for fixed N and total information is the unique minimizer of Eq.
(3)-would incur the least processing effort. Proof given in App. A. Due to its support by a number of studies, the UID hypothesis has received considerable recognition in the cognitive science community. Such verifications, though, derive mostly from the tendencies implied by Eq. (3)-as opposed to its direct verification. Take the original Levy and Jaeger (2007) as an example: while they propose a formal operationalization of UID, they evaluate their hypothesis by analyzing a surprisal vs. sentence length trade-off rather than assessing the operationalization directly. Furthermore, most UID studies investigate individual word surprisals, without regard for their distribution within the sequence (Aylett and Turk, 2004;Mahowald et al., 2013, inter alia).

Quantifying Linguistic Uniformity
UID is, by its definition, a smoothing effect; it can be seen as a regression to a mean information rate- either measured as the surprisal per lexical unit (in written text, as we analyze here), or surprisal per time unit (in speech data). However, there are multiple ways the hypothesis may be interpreted. As a concrete example, we turn to Collins's (2014) fourth figure, which we recreate here in Fig. 2. In its perhaps better-known form, UID suggests that language transmission should happen at a roughly constant rate, close to the channel capacity, i.e., there is a fixed (and perhaps cross-linguistic;Coupé et al. 2019;) value from which a unit's information density should never heavily deviate. Under this interpretation, S1 (red) adheres more closely to UID, as information content per word varies less-in absolute terms-across the sentence. We can formalize this notion of UID using an inverse relationship to some per-unit distance metric ∆(·, ·) as follows: where µ c is a target (mean) information ratepresumably at a theoretical channel's capacity. This mathematical relationship reflects the intuition that the further the units in a linguistic signal are from the average information rate µ c , the less the signal adheres to UID. We may, however, also interpret UID as a pressure to avoid rapidly shifting from information dense (and therefore cognitively taxing) sections to sections requiring minimal processing effort. Rather, in an optimal setting, there should be a smooth transition between information sparse and dense components of a signal. Under this interpretation, we might believe S2 (blue) to adhere more closely to UID, as local changes are gradual. We can formalize this version of UID as The difference between these two is concisely summarized as minimizing global vs. local variability. The former definition has arguably received more attention; studies such as Frank and Jaeger (2008), among others, analyze UID through regression towards a global mean. Yet, there are arguments that variability should instead be measured locally (Collins, 2014;Bloem, 2016).

Regressing to Which Mean?
Notably, there is an aspect of the global variability presented in Eq. (4) that remains underspecified: what exactly is µ c ? A mean information rate may be with respect to a phrase, a sentence or even a language as a whole; this rate could even span across languages, a definition that nicely aligns with recent cross-linguistic experiments on spoken language data that argue for a universal channel capacity (Pellegrino et al., 2011;Coupé et al., 2019). Yet, the former definitions likewise seem plausible.
To motivate this argument, consider the relationship between cadence in literary writing and UID. We loosely define cadence as the rhythm and speed of a piece of text, which should have a close relationship to the dispersion of information. When writing prose, authors typically vary cadence across sentences, interspersing short, impactful (i.e., high information) sentences within series of longer sentences to avoid repetitiveness. We have done so here, in this paper. Yet, intuitively, this practice does not lead to particularly high processing costs, at least for a native speaker. Indeed, some would argue that such fluctuations make text easier to read. This example motivates a pull towards a more context-dependent-perhaps sentence-levelrather than language-level mean information rate.
While a number of findings undoubtedly demonstrate a pressure against high (and sometimes even inordinately low) surprisal-which aligns with the first (global) interpretation of the UID hypothesistheir experimental setups, in general, do not provide evidence for or against a more local interpretation, such as the one just described. 4 We now define a number of UID operationalizations that encompass these different interpretations, subsequently analyzing them in §4.3.

Operationalizing UID
The first operationalization on which we will focus follows from Eq. (3), suggesting a super-linear 4 We attribute this to the fact that most of these analyses were performed at the word-rather than sequence-level. effect of surprisal on processing effort: where k controls the strength of super-linearity. A second operationalization, similar to Eq. (4), implies a pressure for mean regression: Note that we may take µ from a number of different contexts. For example, µ sent = 1 N N n=1 s(u n ) for sentence u 1 , . . . , u N implies a sentence-level mean regression, whereas average surprisal over an entire language µ lang suggests a regression to a (perhaps language-specific) channel capacity. Both definitions more closely align with our global interpretation of UID, i.e., that S1 (red) of Fig. 2 may exhibit a more "uniform" distribution of information.
Similarly, we can compute the local variance in a sentence as 5 which, in contrast to Eq. (6), aligns more with our local interpretation of UID.
We may also interpret UID as a pressure to minimize a signal's maximum per-unit surprisal, as this may be a point of inordinately high cognitive load for the comprehender: For completeness, we further propose another potential measure of UID compliance inspired by the information-theoretic nature of UID. We consider the Rényi entropy (Rényi, 1961) of a probability distribution p, defined as: where X is the support of the distribution p. Notably, the Rényi entropy, which is maximized when p is uniform, becomes the Shannon entropy in the limit as k → 1. 6 However, for k > 1, high probability items contribute disproportionately to this sum, which in our context, would translate to an emphasis on low-surprisal items. Thus, we do not expect it to be a good operationalization of (inverse) UID. However, the opposite holds for k < 1, where Rényi entropy can be seen as producing an extra cost for low-probability, i.e., high-surprisal items. Thus, in terms of UID, we take: where p is a distribution over u 1 , . . . , u N normalized to sum to 1. 7

UID, Effort and Acceptability
We now revisit the processing effort of a sentence, rewriting it in terms of our UID operationalizations i.e., processing effort is proportional to the interaction between (i.e., multiplication by) UID −1 and sentence length. Note that when using our operationalization of UID from Eq. (6), this equation reverts to Levy's (2005) original Eq. (3). Further, this equation with k = 1 and c = 0 recovers the hypothesis under surprisal theory. Following previous work (Frank and Bod, 2011; Goodkind and Bicknell, 2018a, inter alia), we then model reading time as ReadingTime(u) ∝ Effort(u); in words, (proportionally) more time is taken to read more cognitively demanding sentences. We further consider the relationship between UID and linguistic acceptability; we posit that i.e., the linguistic acceptability of a sentence has an inverse relationship with processing effort (withholding the additional penalty for length). Intuitively, sentences that are easier to process are more probably acceptable sentences, and vice versa. While not comprehensive, there is evidence that this simple model (at least to some extent) captures the relationship between these two variables (Topolinski and Strack, 2009). Given these models, we now evaluate our different operationalizations based on their predictive power of psychometric variables.

Experiments
Data. We employ reading time data in English from 4 corpora over 2 modalities: the Natural Stories (Futrell et al., 2018) and Brown (Smith and Levy, 2013) Corpora, which contain self-paced reading time data, as well as the Provo (Luke and Christianson, 2018) and Dundee Corpora (Kennedy et al., 2003), which contain eye movements during reading. 8 For acceptability judgments, also in English, we use the Corpus of Linguistic Acceptability (CoLA; Warstadt et al. 2019) and the BNC dataset (Lau et al., 2017). Notably, Natural Stories and CoLA by design contain wide coverage of syntactic and semantic phenomena. We provide further details of each of these datasets, including pre-processing, statistics and data-gathering processes, in App. B.

Estimating Surprisal
Since we do not have access to the ground-truth values of conditional probabilities of observing linguistic units given their context (i.e., surprisals), we must instead estimate these probabilities. This is typical practice in psycholinguistic studies (Demberg and Keller, 2008;Mitchell et al., 2010;Fernandez Monsalve et al., 2012). For example, Hale (2001) uses a probabilistic context-free grammar; Smith and Levy (2013) use n-gram language models.
In general, the psychometric predictive power of surprisal estimates from a model correlates highly with model quality (Frank and Bod, 2011;Fossum and Levy, 2012;Goodkind and Bicknell, 2018a, as traditionally measured by perplexity;). Further, Transformer-based models appear to have superior psychometric predictive power in comparison to other architectures (Wilcox et al., 2020). We employ GPT-2 (Radford et al., 2019), TransformerXL (Dai et al., 2019), and BERT (Devlin et al., 2019)-state-of-the-art language models 9 . We additionally include results using a 5-gram model, estimated using Modified Kneser-Essen-Ney Smoothing (Ney et al., 1994), to allow for an easier comparison with results from earlier works exploring UID in reading time data. All probability estimates are computed at the word-level. 10 Further details are given in App. B.

Assessing Predictive Power
In our experiments, we analyze the ability of different functions of surprisal to predict psychometric data, namely the total time spent reading sentences in self-paced reading and eye tracking studies (see App. B)-and perceived linguistic acceptability, 11 in order to better understand the relationship of surprisal with language processing. For reading times, we use the sum across word-level times as our sentence-level metric. Notably for eye movement datasets, our analysis of sentence-level reading times is novel: previous work has generally focused on how long readers spend on a word before progressing beyond it (often called the "first pass;" (Rayner, 1998)), but sentence-level measures include time re-reading content after having progressed beyond it. Linguistic acceptability data are available and assessed only at the sentence-level.
As we are interested in the relationship between UID and both reading times and acceptability judgments-in particular, the relationships described by Eqs. (12) and (13)-we turn to linear regression models. 12 For reading time data, as our baseline models, we specifically use linear mixed-effects models, with random effect terms (slopes for total word count at the sentence-level and intercepts at the word-level) for each subject to control for individual reading behaviors. 13 We additionally control for other variables known to influence reading time: at the sentence-level, our fixed effects include total word count and number 10 Given the hierarchical structure of language, there is not a single "correct" choice of linguistic unit over which language processing should be analyzed. Here we consider the primary units in a linguistic signal to be words, where we take a sentence to be a complete linguistic signal. We believe similar analyses at the morpheme, subword or phrase level-which we leave for future work-may shed further light on this topic. 11 Language models are trained to predict the probability of a sentence; the concept of linguistic acceptability is not explicitly part of their objective. As such, probability under a language model alone does not necessarily correlate well with acceptability (Lau et al., 2017). 12 While, for example, a multi-layer perceptron may provide more predictive power given the same variables, we may not be able to interpret the learned relationship as additional transformations of our independent variables would likely be learned. Using linear regression allows us to directly assess which functions of surprisal more accurately explain data under our linearity assumptions in Eqs. (12) and (13). 13 Mixed-effects models allow us to incorporate both fixed and random effects into the modeling process, helping bring the conditional independence assumptions of the regression analysis better in line with the grouping structure of repeated-measures data. of words with recorded fixations (per subject and sentence); 14 results including fixed effects for sums of both individual word character lengths and word unigram log-probabilities (as estimated from Wiki-Text 103; Merity et al. 2017) are given in App. C. At the word-level (only our last set of experiments), our fixed effects include linear terms for word logprobability, unigram log-probability, and character length, and the interaction of the latter two. We additionally include the same predictors from the previous word, a common practice due to known spillover effects observed in both types of measurement. These are standard predictors in reading time analyses (Smith and Levy, 2013;Goodkind and Bicknell, 2018b;Wilcox et al., 2020). For linguistic acceptability data, we use logistic regression models with solely an intercept term as our baseline predictor; results when including summed unigram log-probability or sentence length as predictors yielded similar trends (see App. C).
We evaluate each model relative to a baseline, containing only the control features just mentioned. Specifically, performance assessments are computed between models that differ by solely a single predictor; for reading time data, we include both a fixed and (per-subject) random slope for this predictor. Following Wilcox et al. (2020), we report ∆LogLik: the mean difference in log-likelihood of the response variable between the two models. A positive ∆LogLik value indicates that a given data point is more probable under the comparison model, i.e., it more closely fits the observed data. To avoid overfitting, we compute ∆LogLik solely on held-out test data, averaged over 10-fold cross validation. See App. B for evaluation details.

Results
Evidence of UID in Reading Times and Acceptability Judgments. We first assess the ability of our processing cost model (Eq. (3)) to predict reading times. In a similar fashion, we use Eq. (13) with Eq. (6) to predict acceptability scores. Recall from §2 that if the true relationship between surprisal and sequence-level processing effort is expressed by Eq. (3) with k > 1, then there must exist a pressure towards uniform information density. Thus, if we observe that a linear model using N n=1 s(u n ) k as a predictor explains the observed data better when k > 1, it suggests a preference for Figure 3: Mean ∆LogLik as a function of the exponent k for the sentence-level predictor (Eq. (3)) of reading time and linguistic acceptability. Shaded region connects standard error estimates from each point. We observe that often, our predictor with k > 1 explains the data at least as well as k = 1. Baseline models against which ∆LogLik is computed are specified in §4.2. For reading times, the augmented models additionally contain fixed effects and per-subject random effects slopes for the UID operationalization; for acceptability judgments, only a fixed effect for the UID operationalization is added.
the uniform distribution of information in text. 15 We report results for multiple corpora in Fig. 3. 16 We see that in general, the best fit to the data is achieved not when our cost equations use k = 1, but rather a slightly larger value of k (see also Tab. 1). Notably for reading time data, a conclusion that k > 1 is optimal contradicts a number of prior works that have judged the relationship between surprisal and reading time to be linear. We discuss this point further in §5. Yet for the reading time datasets, the k = 1 predictor is typically still within the standard error of the best predictor, meaning that the linear hypothesis is not ruled out. For acceptability data, we see more distinctly that k > 1 leads to the best predictor, especially when using true surprisal estimates (i.e., models aside from BERT). This result suggests that a more uniform distribution of information more strongly correlates with linguistic acceptability (see also Fig. 1 for explicit correlation analysis).
We perform hypothesis tests to formally test whether our models of processing cost and linguistic acceptability have higher predictive power-as measured by ∆LogLik-when using a super-linear vs. linear function of surprisal. Specifically, we take our null hypothesis to be that k = 1 provides better or equivalent predictive power to k > 1. We use a paired t-tests, where we aggregate sentencelevel data across subjects for reading time datasets so as not violate independence assumptions. We use a Bonferroni correction to account for the consideration of multiple models with k > 1. We find that we consistently reject the null hypothesis at significance level α = 0.001 for acceptability data experiments (aside from under the n-gram model). For reading time data, we never reject the null hypothesis, again confirming that the linear hypothesis may hold true in this setting.
Another important observation is that the pseudo log-probability estimates from a cloze language model (BERT) work remarkably well when used to predict acceptability judgments, yet remarkably poorly for reading time estimates. We also see a less super-linear effect (higher predictive power for k ≈ 1) of surprisal in sentence acceptability for cloze than for auto-regressive models. 17 Evaluating Operationalizations of UID. We next ask: what are appropriate measures of UID in a linguistic signal? In an effort to answer this question, we explore the predictive power of the different operationalizations of UID proposed in §3 for our psycholinguistic data; given our evidence of UID in the prior section, we posit that better operationalizations should likewise provide stronger explanatory power than poor ones. We again fit linear models using Eqs. (12) and (13), albeit with each analyzed UID operationalization as our predictor. We use surprisal estimates from .38 (±0.14) Table 1: ∆LogLik in 10e-2 nats when adding different UID operationalizations as predictors of reading time and linguistic acceptability. Surprisal estimates from GPT-2 are used. We use the same paradigm for baseline and augmented models as in Fig. 3. Other setups show similar trends (App. C).
GPT-2, as it was consistently the autoregressive language model with the best predictive power. Results in Tab. 1 show that, in general, the family of Super-Linear (Eq. (6)) operationalizations (for k ≥ 1) and a language-wide notion of Variance (Eq. (7)) provide the largest increase in explanatory power relative to the baseline models, suggesting they may be the best quantifications of UID. While the Max (Eq. (9)) and Variance (Eq. (7)) predictors also provide good explanatory power, they are consistently lower across datasets. Further, language-level Variance seems to produce stronger predictors for psychometric data than sentencelevel and Local Variance-an observation driving our next set of experiments. Notably, the Entropy predictors do quite poorly in comparison to other operationalizations, especially for k ≥ 1. 18 These results suggest that a sentence-level notion of entropy may not capture the UID phenomenon well, which is perhaps surprising, given that it is a natural measure of the uniformity of information.
Exploring the Scope of UID's Pressure. Each of our operationalizations in §3 are computed at the sequence-level. Thus, it is natural to ask, what should be the scope of a sequence when considering information uniformity? In an effort to answer this question, we explore how the predictive power of our UID operationalizations change as we vary the window sizes over which they are computed. Specifically, we will look at ability to predict perword reading times; we make use of the Variance operationalization as our predictor (which demon- Figure 4: Per-token ∆LogLik when changing the scope over which UID variance is computed (see Eq. (14)). Surprisal estimates from GPT-2 are used. Baseline predictors are specified in §4.2 strated good performance in our sentence-level experiments) albeit with a word-level version: where µ is mean surprisal computed across the previous 1, 2, 3, 4 or n words or across the sentence, document, or language as a whole (as with unigram probabilities, µ lang is computed per model over WikiText 103). Tab. 1 and Fig. 4 show evidence that the pressure for uniformity may in fact be at a more global scale. Under each corpus, the higher-level predictors of UID appear to provide better explanatory power of reading times than more local predictors.

Discussion
Most previous works investigating UID have looked for its presence in language production (Bell et al., 2003;Aylett and Turk, 2004;Levy and Jaeger, 2007;Mahowald et al., 2013, inter alia), while comprehension has received little attention. Collins (2014) and Sikos et al. (2017) are perhaps the only other works to find results in support of UID in this setting. Our findings are complementary to theirs; we take different analytical approaches but both observe a preference for the uniform distribution of information in a linguistic signal, although a similar analysis should be performed in the spoken domain before stronger conclusions can be drawn. While our reading time results do not refute previous work showing linear effects of surprisal on word-level reading times (Smith and Levy, 2013;Goodkind and Bicknell, 2018b;Wilcox et al., 2020), 19 we see some suggestions that a super-linear hypothesis is also plausible, especially in the Provo corpus. Notably, most of these works did not test a parametric space of non-linear functional forms, instead confirming using visual inspection of the results of nonparametric fits. One exception, Smith and Levy (2013), explored the effects of adding a quadratic term for surprisal as a predictor of per-word reading times. Yet, if the true k that describes the reading times-surprisal relationship were only slightly greater than 1, as our results suggest, this quadratic test might be too restrictive. Our approach, which explores a more fine-grained range of k, is potentially more comprehensive, and indeed we find that values of k slightly greater than 1 often fit the data at least as well as k = 1, and can certainly not be ruled out. Other potential virtues of our analysis are (1) Our analysis is performed at the sentence-(rather than word-) level. This is arguably a better method for analyzing a sequence-level phenomenon, i.e., UID, and (2) specifically for eye movement data, we include re-reading times after the first pass.
Limitations and Future Directions. A major limitation of this work is that the experimental analysis is limited to English (and Dutch, in the Appendix); while the pressure for uniformitysince explained by a cognitive process-should hold across languages, further experiments should be performed to verify these findings, especially since the relationship between model quality and psychometric predictive power has recently been 19 Notably, Brothers and Kuperberg (2021) have recently reported a linear effect of word probability on (self-paced) reading times in a controlled experiment where within each experimental item the target word was held constant and predictability was manipulated across a wide range by varying the preceding context. Motivated by this result, we repeated our analytic pipeline testing a range of values of k but replacing surprisals with negative raw probabilities. The resulting regression model fits are not as good as those achieved when using surprisals (Fig. 11; compare y-axis ranges with Fig. 3). called into question (Kuribayashi et al., 2021). As such, while we find convincing preliminary evidence in our analyzed languages, we are not able to fully test the hypothesis that the pressure for UID is at the language-level. Further, we have no evidence as to whether there may be pressure towards a cross-linguistic µ c , which would be relevant to cross-linguistic interpretations of UID .
Another important limitation of this work is the restriction to psychometric data from the written domain. To fully grasp the effects of the distribution of information in linguistic signals on language comprehension, spoken language data should be similarly analyzed. Of course, different factors are likely at play in language comprehension in the spoken domain, including e.g., the cognitive load of the speaker (Pijpops et al., 2018); such factors may make it even more difficult to disentangle the contribution of different effects to comprehension. We leave this analysis for future work.

Conclusion
In this work, we revisit the UID hypothesis, providing both a quantitative and qualitative assessment of its various interpretations. We find suggestions that the UID formulation proposed in Levy (2005) may better predict processing effort in language comprehension than alternative formulations since proposed. We additionally find that a similar model explains linguistic acceptability judgments well, confirming a preference for UID in written language. We subsequently evaluate different operationalizations of UID, observing that a super-linear function of surprisal best explains psychometric data. Further, operationalizations associated with global interpretations of UID appear to provide better explanatory power than those of local interpretations, suggesting that perhaps the most accurate interpretation of UID should be the regression towards the mean information rate of a language.

A Theory
We use the standard definition of surprisal s(u n ) def = − log p(u n | u <n ), and define s(u) = N n=1 s(u n ) as the total surprisal of the entire signal u.
Theorem A.1. Assume a fixed k > 1 and c > 0, and assume N ≥ 1. Then, i) The objective N n=1 s(u n ) k +c·N , i.e. Eq.
(3), subject to the constraint of a fixed s(u) = N n=1 s(u n ), is minimized when information is uniformly distributed, i.e. s(u 1 ) = s(u 2 ) = · · · = s(u N ) = s(u)/N ; ii) Furthermore, this minimal value is found for either one or two choices of finite N .
Proof. We prove i) and ii) separately.
i). This was proven in the Appendix of Levy and Jaeger (2007) as a simple application of Jensen's inequality, which we reproduce here in largely similar form (adapting to our notation). First note that the function (·) k is convex on the interval [0, ∞) for k > 1; as surprisal can only take on positive values, this is the interval we operate over. Since N n=1 1 N = 1 and 1 N ≥ 0, we have that N n=1 s(un) k N is a convex combinations of the exponentiated surprisals s(u n ) k . Thus, as we have a convex combination of convex functions, we may invoke Jensen's inequality, which yields Multiplying both sides by N gives The lower bound of Eq. (16) tells us that uniformly distributed information, i.e. where each s(u n ) = s(u)/N is the lowest cost manner to distribute total surprisal over the utterance. Conversely, when 0 < k < 1, (·) k is concave on the interval [0, ∞). Therefore, the same logic gives us the opposite result: Uniform information density is the highest possible cost way to distribute total surprisal over the utterance.
ii). As shown in the previous step, regardless of the value of N , Effort is minimized when information density is uniform-that is, when s(u n ) = s(u)/N -giving us: We now consider the question of what value of N minimizes Effort. A continuous extension of Effort to real-valued N has the following first and second derivatives: We can use these derivatives to inspect the behavior of the function. First, the second derivative is strictly positive, thus processing effort is strictly convex in N so it has at most one global minimum. Second, we can find the minimizing value of N by setting the first derivative to zero, giving us: However, since this is a constrained optimization problem (N ≥ 1), we arrive at the solution which is true because the first derivative will be strictly positive for any value of N above its global minimum k−1 c 1 k s(u). Now, to address the finiteness of N , we observe that as N → ∞, we have ∂Effort ∂N → c > 0 so the function cannot achieve its minimum as N → ∞. Returning to integer-valued N , we have that processing effort is minimized either at floor(N ), ceiling(N ), or both. Finally, it is important to highlight that if the first derivative (i.e., Eq. (18a)) is positive at N = 1, we arrive at the result that processing effort is minimized at N = 1. This will happen when s(u) is sufficiently small and/or c is sufficiently large: the amount of information to be communicated is not worth the cost of using more than a minimal-length utterance.
Note also that for 0 < k < 1, when (·) k is concave, we obtain a different, and counter-intuitive  result: the first derivative is always positive, meaning that processing effort is minimized at N = 1 regardless of s(u) or c.

B Datasets and Language Models
Data pre-processing. Text from all corpora was pre-processed using the Moses decoder 20 tokenizer and punctuation normalizer. Additional preprocessing was performed by the Hugging Face tokenizers for respective neural models. Capitalization was kept intact albeit the lowercase version of words were used in unigram probability estimates. We estimate the unigram distribution following Nikkarinen et al. (2021). Sentences were delimited using the NLTK sentence tokenizer. 21 For reading time datasets, we removed outlier word-level data points (specifically those with a z-score > 3 when the distribution of reading times was modeled as log-linear). We omitted the sentence-level reading time for a specific subject from our analysis if it contained any outlier data points. The Natural Stories consists of a series of English texts that were hand-edited to contain lowfrequency syntactic constructions while still sounding fluent to native speakers. It contains 10 stories with a total of 485 sentences. Self-paced reading data from these texts was collected from 181 native English speakers. The appeal of this corpus lies in that it provides psychometric data on unlikelybut still grammatically correct-sentences, which in theory should provide broader coverage of the sentence processing spectrum.
The Provo Corpus consists of 55 paragraphs of English text (with a total of 2,689 sentences) taken from various sources and genres, including online news articles, popular science, and fiction. Eye movement data while reading from 84 native speakers of American English was collected us-20 http://www.statmt.org/moses/ 21 https://www.nltk.org/api/nltk.tokenize.html ing a high-resolution eye tracker (1000 Hz). We specifically use the IA-DWELL-TIME attribute as our measure of per word reading time; specifically, we use the summation of the duration across all fixations on that word. We find noisier trends when using IA-FIRST-RUN-DWELL-TIME and IA-FIRST-FIXATION-DURATION (see App. C).
The English portion of the Dundee Corpus contains eye-tracking recordings (1000 Hz) of 10 native English-speakers each reading 20 newspaper articles from The Independent, with a total of 2,377 sentences. Unlike in previous studies (e.g. Goodkind and Bicknell (2018b)) we did not exclude any words from the dataset, as we were interested in sentence-level measures. As with the Provo corpus, we use total dwell time as our dependent variable. The Brown Corpus consists of self-paced reading data for selections from the Brown corpus of American English. Moving-window self-paced reading times were measured for 35 UCSD undergraduate native English speakers, each reading short (292-902 word) passages drawn from the Brown corpus of American English (total of 1,800 unique sentences). Data from participants were excluded if comprehension-question performance was at chance. Further details about the procurement of the dataset are described in (Smith and Levy, 2013).
The Dutch portion of the GECO-Ghent Eye-Tracking Corpus-contains eye-tracking recordings from bilingual (Dutch/English) participants reading a portion of a novel, presented in paragraphs on the screen.
For CoLA, sentences are taken from published linguistics literature and labeled by expert human annotators. According to the authors, "unacceptable sentences in CoLA tend to be maximally similar to acceptable sentences and are unacceptable for a single identifiable reason," which implies that differentiability should be nuanced rather than, e.g., from a blatant disregard for grammaticality. We also utilize the BNC dataset (Lau et al., 2017), which consists of 2500 sentences taken from the British National Corpus. Each sentence is roundtrip machine-translated and the resulting sentence is annotated with acceptability judgments through crowd-sourcing. Two rating systems are provided for this corpus: MOP2 and MOP4. The former provides binary judgments of acceptability while the latter provides a score from 1-4. We employ the former in our predictive power experiments so as to share the same setup for the CoLA dataset; we use the latter in computations of correlation.
For probability estimates from neural models, we use pre-trained models provided by Hugging Face (Wolf et al., 2020). Specifically, for GPT-2, we use the default OpenAI version (gpt2). The model was trained on the Web-Text dataset (a diverse collection of approximately 8 million websites); it uses byte-pair encoding (Sennrich et al., 2016) with a vocabulary size of 50,257. For the TransformerXL, we use a version of the model (architecture described in Dai et al. (2019)) that has been finetuned on WikiText-103 (transfo-xl-wt103). We use the bert-base-cased version of BERT. In all cases, per-word surprisal is computed as the sum of subword surprisals. We additionally train a 5-gram model on WikiText-103 using the KenLM (Heafield, 2011) library with default hyperparemters for Kneser-Essen-Ney smoothing.
Evaluation. For our evaluation metric, we use ∆LogLik: the mean difference in log-likelihood of the response variable between a baseline model and a model with an additional predictor. A positive ∆LogLik value indicates that a given data point is more probable under the comparison model, i.e., the comparison model more closely fits the observed data. To compute ∆LogLik for each data point, we split our corpus into 10 folds. Folds are chosen randomly, i.e., they are not based on subject or sentence for mixed-effects models. The same splits are used for each model. We take the ∆LogLik value for a data point to be the difference in log-likelihood between models trained on the 9 folds that do not contain that data point, so as to avoid overfitting. We then take the mean ∆LogLik over the corpus as our final metric.   Shannon −0.01 (±0) 0 (±0.01) 0.01 (±0) 0 (±0) 0 (±0) 7.9 (±0.14) Entropy (k = 2) Renyi −0.01 (±0) 0 (±0.01) 0 (±0.01) 0 (±0) 0 (±0) 8.38 (±0.14) Table 3: ∆LogLik in 10e-2 nats, as in Tab. 1 albeit with different baseline predictors for reading time data and with using BERT for surprisal estimates for acceptability judgments. Along with the predictors specified in Tab. 1, models for reading times here also contain predictors for unigram log-probability, total character length, and the interaction of the two (reading times). We see largely the same trends as in Tab. 1. Figure 7: Same graph as in Fig. 3 for Provo albeit using (the sum of) first fixation duration times as our reading time metric.   . Note that the magnitude of ∆LogLik is smaller than when using surprisal, indicating the superior predictive power of the latter. This stands in contrast to the experimental findings of Brothers and Kuperberg (2021).