CDRNN: Discovering Complex Dynamics in Human Language Processing

The human mind is a dynamical system, yet many analysis techniques used to study it are limited in their ability to capture the complex dynamics that may characterize mental processes. This study proposes the continuous-time deconvolutional regressive neural network (CDRNN), a deep neural extension of continuous-time deconvolutional regression (Shain & Schuler, 2021) that jointly captures time-varying, non-linear, and delayed influences of predictors (e.g. word surprisal) on the response (e.g. reading time). Despite this flexibility, CDRNN is interpretable and able to illuminate patterns in human cognition that are otherwise difficult to study. Behavioral and fMRI experiments reveal detailed and plausible estimates of human language processing dynamics that generalize better than CDR and other baselines, supporting a potential role for CDRNN in studying human language processing.


Introduction
Central questions in psycholinguistics concern the mental processes involved in incremental human language understanding: which representations are computed when, by what mental algorithms (Frazier and Fodor, 1978;Just and Carpenter, 1980;Abney and Johnson, 1991;Tanenhaus et al., 1995;Almor, 1999;Gibson, 2000;Coltheart et al., 2001;Hale, 2001;Lewis and Vasishth, 2005;Levy, 2008, inter alia)? Such questions are often studied by caching out a theory of language processing in an experimental stimulus, collecting human responses, and fitting a regression model to test whether measures show the expected effects (e.g. Grodner and Gibson, 2005). Regression techniques have grown in sophistication, from ANOVA (e.g. Pickering and Branigan, 1998) to newer linear mixed-effects approaches (LME, Bates et al., 2015) that enable direct word-by-word analysis of effects in naturalistic human language processing (e.g. Demberg and Keller, 2008;Frank and Bod, 2011). However, these methods struggle to account for delayed effects. Because the human mind operates in real time and experiences computational bottlenecks of various kinds (Bouma and De Voogd, 1974;Just and Carpenter, 1980;Ehrlich and Rayner, 1981;Mollica and Piantadosi, 2017), delayed effects may be pervasive, and, if left uncontrolled, can yield misleading results (Shain and Schuler, 2018).
Continuous-time deconvolutional regression (CDR) is a recently proposed technique to address delayed effects in measures of human cognition (Shain andSchuler, 2018, 2021). CDR fits parametric continuous-time impulse response functions (IRFs) that mediate between word features and response measures. An IRF maps the time elapsed between a stimulus and a response to a weight describing the expected influence of the stimulus on the response. CDR models the response as an IRF-weighted sum of preceding stimuli, thus directly accounting for effect latencies. Empirically, CDR reveals fine-grained processing dynamics and generalizes better to human reading and fMRI responses than established alternatives. However, CDR retains a number of simplifying assumptions (e.g. that the IRF is fixed over time) that may not hold of the human language processing system.
Deep neural networks (DNNs), widely used in natural language processing (NLP), can relax these strict assumptions. Indeed, psycholinguistic regression analyses and NLP systems share a common structure: both fit a function from word features to some quantity of interest. However, psycholinguistic regression models face an additional constraint: they must be interpretable enough to allow researchers to study relationships between variables in the model. This requirement may be one reason why black box DNNs are not generally used to analyze psycholinguistic data, despite the tremendous gains DNNs have enabled in natural language tasks (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Brown et al., 2020, inter alia), in part by better approximating the complex dynamics of human cognition as encoded in natural language (Linzen et al., 2016;Gulordava et al., 2018;Tenney et al., 2019;Hewitt and Manning, 2019;Wilcox et al., 2019;Schrimpf et al., 2020).
This study proposes an attempt to leverage the flexibility of DNNs for psycholinguistic data analysis. The continuous-time deconvolutional regressive neural network (CDRNN) is an extension of CDR that reimplements the impulse response function as a DNN describing the expected influence of preceding events (e.g. words) on future responses (e.g. reading times) as a function of their properties and timing. CDRNN retains the deconvolutional design of CDR while relaxing many of its simplifying assumptions (linearity, additivity, homosketasticity, stationarity, and context-independence, see Section 2), resulting in a highly flexible model. Nevertheless, CDRNN is interpretable and can shed light on the underlying data generating process. Results on reading and fMRI measures show substantial generalization improvements from CDRNN over baselines, along with detailed insights about the underlying dynamics that cannot easily be obtained from existing methods. 1

Background
Psycholinguists have been aware for decades that processing effects may lag behind the words that trigger them (Morton, 1964;Bouma and De Voogd, 1974;Rayner, 1977;Erlich and Rayner, 1983;Mitchell, 1984;Rayner, 1998;Vasishth and Lewis, 2006;Smith and Levy, 2013), possibly because cognitive "buffers" may exist to allow higher-level information processing to catch up with the input (Bouma and De Voogd, 1974;Baddeley et al., 1975;Just and Carpenter, 1980;Ehrlich and Rayner, 1981;Mollica and Piantadosi, 2017). They have also recognized the potential for non-linear, interactive, and/or time-varying relationships between word features and language processing (Smith and Levy, 2013;Baayen et al., 2017Baayen et al., , 2018. No prior regression method can jointly address these concerns in non-uniform time series (e.g. words with variable duration) like naturalistic psycholinguistic experiments. Discrete-time methods (e.g. lagged/spillover regression, Sims, 1971;Erlich and Rayner, 1983;Mitchell, 1984) ignore potentially meaningful variation in event duration, even if some (e.g. generalized additive models, or GAMs, Hastie and Tibshirani, 1986;Wood, 2006) permit non-linear and non-stationary (time-varying) feature interactions (Baayen et al., 2017). CDR (Shain andSchuler, 2018, 2021) addresses this limitation by fitting continuous-time IRFs, but assumes that the IRF is stationary (time invariant), that features scale linearly and combine additively, and that the response variance is constant (homoskedastic). By implementing the IRF as a time-varying neural network, CDRNN relaxes all of these assumptions, incorporating the featural flexibility of GAMs while retaining the temporal flexibility of CDR.
Previous studies have investigated latency and non-linearity in human sentence processing. For example, Smith and Levy (2013) attach theoretical significance to the functional form of the relationship between word surprisal and processing cost, using GAMs to show that this relationship is linear and arguing on this basis that language processing is highly incremental. This claim is under active debate (Brothers and Kuperberg, 2021), underlining the importance of methods that can investigate questions of functional form. Smith and Levy (2013) also investigate the timecourse of surprisal effects using spillover and find a more delayed surprisal response in self-paced reading (SPR) than in eye-tracking. Shain and Schuler (2021) support the latter finding using CDR, and in addition show evidence of strong inertia effects in SPR, such that participants who have been reading quickly in the recent past also read more quickly now. However, this outcome may be an artifact of the stationarity assumption: CDR may be exploiting its estimates of rate effects in order to capture broad non-linear negative trends (e.g. task adaptation, Prasad and Linzen, 2019) in a stationary model. Similarly, the generally null word frequency estimates reported in Shain and Schuler (2021) may be due in part to the assumption of additive effects: word frequency and surprisal are related, and they may coordinate interactively to determine processing costs (Norris, 2006). Thus, in general, prior findings on the timecourse and functional form of effects in human sentence processing may be influenced by method- Figure 1: CDRNN model. Subscripts omitted to reduce clutter. The IRF g(τ ) at an event computes the expected contribution of each feature of the event vector h (0) to each element of the parameter vector s of the predictive distribution for a particular response value. The first layer of the IRF depends non-linearly on the properties of the event via h in and (optionally) on context via h RNN , which requires the recurrent connections in gray. Elements with random effects have dotted outlines. For variable definitions, see Appendix A.
ological limitations: the GAM models of Smith and Levy (2013) ignore variable event duration, the CDR models of Shain and Schuler (2021) ignore non-linearity, and both approaches assume stationarity, context-independence, constant variance, and additive effects. By jointly relaxing these potentially problematic assumptions, CDRNN stands to support more reliable conclusions about human language comprehension, while also possibly enabling new insights into cognitive dynamics.

Architecture
This section presents a high-level description of the model design (for formal definition, see Appendix A). The CDRNN architecture is represented schematically in Figure 1. The primary goal of estimation is to identify the deep neural IRF g(τ ) (top) that computes the influence of a preceding event on the predictive distribution over a subsequent response as a function of their distance in time τ . As shown, the IRF is a feedforward projection of τ into a matrix that defines a weighted sum over the values of input vector x, which is concatenated with a bias to capture general effects of stimulus timing (rate). This matrix multiplication determines the contribution of the stimulus event to the parameters of the predictive distribution (e.g. the mean and variance parameters of a Gaussian predictive distribution). Defining the IRF as a function of τ ensures that the model has a continuous-time definition.
To capture non-linear effects of stimulus features, the IRF projection is itself parameterized by a projection of a hidden state h. The dependence on h permits non-linear influences of the properties of the stimulus sequence on the IRF itself. To generate h, the predictors x are concatenated with their timestamps t and submitted to the model as input. Inputs are cast to a hidden state for each preceding event as the sum of three quantities: a feedforward projection h in of each input, a forwarddirectional RNN projection h RNN of the events up to and including each input, and random effects h Z containing offsets for the relevant random effects level(s) (e.g. for each participant in an experiment). In this study, the recurrent component is treated as optional (gray arrows). Without the RNN, the model is non-stationary (via input t) but cannot capture contextual influences on the IRF.
The summation over IRF outputs at the top of the figure ensures that the model is deconvolutional: each preceding input contributes to the response in some proportion, with that proportion determined by the features, context, and relative timing of that input. Because the IRF depends on a deep neural projection of the current stimulus as well as (optionally) the entire sequence of preceding stimuli, it implicitly estimates all interactions between these variables in governing the response. Predictors may thus coordinate in a non-linear, non-additive, and time-varying manner.
The CDRNN IRF describes the influence over time of predictors on all parameters of the predictive distribution (in these experiments, the mean and variance parameters of a Gaussian predictive distribution). Such a design (i.e. modeling dependencies on the predictors of all parameters of the predictive distribution) has previously been termed distributional regression (Bürkner, 2018). Despite their flexibility and task performance (Section 5), CDRNN models used in this study have few parameters (Table A1) by current deep learning standards because they are relatively shallow and small (Supplement S1).

Objective and Regularization
Given (1) an input configuration C containing predictors X, input timestamps t, and response timestamps t , (2) CDRNN parameter vector w, (3) output distribution p, (4) random effects vector z, and (5) response vector y, the model uses gradient de-3721 scent to minimize the following objective: In addition to random effects shrinkage governed by λ z and any arbitrary additional regularization penalties L reg (see Supplement S1), models are regularized using dropout (Srivastava et al., 2014) with drop rate d h at the outputs of all feedforward hidden layers. Random effects are also dropped at rate d z , which is intended to encourage the model to find population-level estimates that accurately reflect central tendency. Finally, the recurrent contribution to the CDRNN hidden state (h RNN above) is dropped at rate d r , which is intended to encourage accurate IRF estimation even when context is unavailable.

Effect Estimation
Because it is a DNN, CDRNN lacks parameters that selectively describe the size and shape of the response to a specific predictor (unlike CDR), and indeed individual parameters (e.g. individual biases or connection weights) are not readily interpretable. Thus, from a scientific perspective, the quantity of general interest is not a distribution over parameters, but rather over the effect of a predictor on the response. The current study proposes to accomplish this using perturbation analysis (e.g. Ribeiro et al., 2016;Petsiuk et al., 2018), manipulating the input configuration and quantifying the influence of this manipulation on the predicted response. 2 For example, to obtain an estimate of rate effects (i.e. the base response or "deconvolutional intercept," see Shain and Schuler, 2021), a reference stimulus can be constructed, and the response to it can be queried at each timepoint over some interval of interest. To obtain CDR-like estimates of predictor-wise IRFs, the reference stimulus can be increased by 1 in the predictor dimension of interest (e.g. word surprisal) and requeried, taking the difference between the obtained response and the reference response to reveal the influence of an extra unit of the predictor. 3 This study uses the 2 Perturbation analyses is one of a growing suite of tools for black box interpretation. It is used here because it straightforwardly links properties of the input to changes in the estimated response, providing a highly general method for querying aspects of the the non-linear, non-stationary, non-additive IRF defined by the CDRNN equations.
3 Note that 1 is used here to maintain comparability of effect estimates to those generated by methods that assume training set mean of x and t as a reference, since this represents the response of the system to an average stimulus. The model also supports arbitrary additional kinds of queries, including of the curvature of an effect in the IRF over time and of the interaction between two effects at a point in time. Indeed, the IRF can be queried with respect to any combination of values for predictors, t, and τ , yielding an open-ended space of queries that can be constructed as needed by the researcher.
Because the estimates of interest all derive from the model's predictive distribution, uncertainty about them can be measured with Monte Carlo techniques as long as training involves a stochastic component, such as dropout (Srivastava et al., 2014) or batch normalization (Ioffe and Szegedy, 2015). This study estimates uncertainty using Monte Carlo dropout (Gal and Ghahramani, 2016), which recasts training neural networks with dropout as variational Bayesian approximation of deep Gaussian process models (Damianou and Lawrence, 2013). At inference time, an empirical distribution over responses to an input is constructed by resampling the model (i.e. sampling different dropout masks). 4 As argued by Shain and Schuler (2021) for CDR, in addition to intervals-based tests, common hypothesis tests (e.g. for the presence of an effect) can be performed in a CDRNN framework via bootstrap model comparison on held out data (e.g. of models with and without the effect of interest).

Methods
Following Shain and Schuler (2021), CDRNN is applied to naturalistic human language processing data from three experimental modalities: the Natural Stories self-paced reading corpus (∼1M instances, Futrell et al., 2020), the Dundee eye-tracking corpus (∼200K instances, Kennedy linearity of effects (especially CDR), but that 1 has no special meaning in the non-linear setting of CDRNN modeling, and effects can be queried at any offset from any reference. Results here show that deflections move relatively smoothly away from the reference, even at smaller steps than 1, and that IRFs queried at 1 are similar to those obtained from (linear) CDR, indicating that this method of effect estimation is reliable. Note finally that because predictors are underlyingly rescaled by their training set standard deviations (though plotted at the original scale for clarity), 1 here corresponds to 1 standard unit, as was the case with the CDR estimates discussed in Shain and Schuler (2021). 4 Initial experiments also explored uncertainty quantification by implemententing CDRNN as a variational Bayesian DNN. Compared to the methods advocated here, the variational approach was more prone to instability, achieved worse fit, and yielded implausibly narrow credible intervals.   (2021). Further details about datasets and preprocessing are given in Supplement S2.
For reading data, CDRNN is compared to CDR as well as lagged LME and GAM baselines equipped with four spillover positions for each predictor (values from the current word, plus three preceding words), since LME and GAM are well established analysis methods in psycholinguistics (e.g. Baayen et al., 2007;Demberg and Keller, 2008;Frank and Bod, 2011;Smith and Levy, 2013;Baayen et al., 2017;Goodkind and Bicknell, 2018, inter alia). Because the distribution of reading times is heavy-tailed (Frank et al., 2013), following Shain and Schuler (2021) models are fitted to both raw and log-transformed reading times. For fMRI data, CDRNN is compared to CDR as well as four existing techniques for analyzing naturalistic fMRI data: pre-convolution with the canonical hemodynamic response function (HRF, Brennan et al., 2012;Willems et al., 2015;Henderson et al., 2015Henderson et al., , 2016Lopopolo et al., 2017), linear interpolation (Shain and Schuler, 2021), binning (Wehbe et al., 2020), and Lanczos interpolation (Huth et al., 2016). Statistical model comparisons use paired permutation tests of test set error (Demšar, 2006). Models use predictors established by prior psycholinguistic research (e.g. Rayner, 1998;Demberg and Keller, 2008;van Schijndel and Schuler, 2013;Staub, 2015;Shain and Schuler, 2018, inter alia): unigram and 5-gram surprisal, word length (reading only), saccade length (eye-tracking only), and previous was fixated (eye-tracking only). Predictor definitions are given in Appendix C. The deconvolutional intercept term rate (Shain andSchuler, 2018, 2021), an estimate of the general influence of observing a stimulus at a point in time, independently of its properties, is implicit in CDRNN (unlike CDR) and is therefore reported in all results. Reading models include random effects by subject, while fMRI models include random effects by subject and by functional region of interest (fROI). Unlike LME, where random effects capture linear differences in effect size between e.g. subjects, random effects in CDRNN capture differences in overall dynamics between subjects, including differences in size, IRF shape, functional form (e.g. linearity), contextual influences on the IRF, and interactions with other effects.
Two CDRNN variants are considered in all experiments: the full model (CDRNN-RNN) containing an RNN over the predictor sequence, and a feedforward only model (CDRNN-FF) with the RNN ablated (gray arrows removed in Figure 1). This manipulation is of interest because CDRNN-FF is both more parsimonious (fewer parameters) and faster to train, and may therefore be preferred in the absence of prior expectation that the IRF is sensitive to context. All plots show means and 95% credible intervals. Code and documentation are available at https://github.com/coryshain/cdr.

Results
Since CDRNN is designed for scientific modeling, the principal output of interest is the IRF itself and the light it might shed on questions of cognitive dynamics, rather than on performance in some task (predicting reading latencies or fMRI measures are not widely targeted engineering goals). However, predictive performance can help establish the trustworthiness of the IRF estimates. To this end, as a sanity check, this section first evaluates predictive performance on human data relative to existing regression techniques. While results may resemble "bake-off" comparisons familiar from machine learning (and indeed CDRNN does outperform all baselines), their primary purpose is to establish that the CDRNN estimates are trustworthy, since they describe the phenomenon of interest in a way that generalizes accurately to an unseen sample. Baseline models, including CDR, are as reported  in Shain and Schuler (2021). 5 Although CDR struggles against GAM baselines on Dundee, CDRNN has closed the gap. This is noteworthy in light of speculation in Shain and Schuler (2021) that CDR's poorer performance on Dundee might be due in part to non-linear effects, which GAM can estimate but CDR cannot. CDRNN performance supports this conjecture: once the model can account for non-linearities, it overtakes GAMs.

Model Validation: Baseline Comparisons
Results from fMRI are shown in Table 2, where both CDRNN variants yield substantial improvements to training, dev, and test set error. These results indicate that the relaxed assumptions afforded by CDRNN are beneficial for describing the fMRI response, which is known to saturate over time (Friston et al., 2000;Wager et al., 2005;Vazquez et al., 2006;Lindquist et al., 2009  tation test that pools across all datasets covered by a given baseline (reading data for LME and GAM, fMRI data for canonical HRF, linearly interpolated, averaged, and Lanczos interpolated, and both for CDR). 7 Results are given in Table 3. As shown, both variants of CDRNN significantly improve over all baselines, and CDRNN-RNN significantly improves over CDRNN-FF. Notwithstanding, CDRNN-FF may be preferred in applications: simpler, faster to train, better at recovering synthetic models (Supplement S3), more reliable in noisy domains like fMRI, and close in performance to CDRNN-RNN. Results overall support the reliability of patterns revealed by CDRNN's estimated IRF, which is now used to explore and visualize sentence processing dynamics.

Effect Latencies in CDRNN vs. CDR
CDR-like IRF estimates can be obtained by increasing a predictor by 1 (standard deviation) relative to the reference and observing the change in the response over time. Visualizations using this approach are presented in Figure 2 alongside CDR estimates from Shain and Schuler (2021). In general, CDRNN finds similar patterns to CDR. This suggests both (1) that CDRNN is capable of recovering estimates from a preceding state-of-the-art deconvolutional model for these domains, and (2)  , while CDRNN more plausibly shows more uncertainty in its estimates for the noisier fMRI data.
As noted in Section 2, Shain and Schuler (2021) report negative rate effects in reading -i.e., a local decrease in subsequent reading time at each word, especially in SPR. This was interpreted as an inertia effect (faster recent reading engenders faster current reading), but it might also be an artifact of non-linear decreases in latency over time (due to task habituation, e.g. Baayen et al., 2017;Harrington Stack et al., 2018;Prasad and Linzen, 2019) that CDR cannot model. CDRNN estimates nonetheless thus support the prior interpretation of rate effects as inertia, at least in SPR: a model that can flexibly adapt to non-linear habituation trends finds SPR rate estimates that are similar in shape and magnitude to those estimated by CDR.
In addition, CDRNN finds a slower response to word surprisal in self-paced reading than in eye-tracking. This result converges with word-discretized timecourses reported in Smith and Levy (2013), who find more extensive spillover of surprisal effects in SPR than in eye-tracking. Results thus reveal important hidden dynamics in the reading response (inertia effects), continuous-time delays in processing effects, and influences of modality the continuous dynamics of sentence processing, all of which are difficult to estimate using existing regression techniques. Greater response latency and more pronounced inertia effects in self-paced reading may be due to the fact that a gross motor task (paging via button presses) is overlaid on the sentence comprehension task. While the motor task is not generally of interest to psycholinguistic theories, controlling for its effects is crucial when using self-paced reading to study sentence comprehension (Mitchell, 1984).

Linearity of Surprisal Effects
CDRNN also allows the analyst to explore other aspects of the IRF, such as functional curvature at a point in time. For example, in the context of reading, Smith and Levy (2013) argue for a linear increase in processing cost as a function of word surprisal. The present study allows this claim to be assessed across modalities by checking the curva- ture of the 5-gram surprisal response (in raw ms) at a timepoint of interest (0ms for reading and ∼5s for fMRI). As shown in the top row of Figure 3, reading estimates are consistent with a linear response (the credible interval contains a straight line), as predicted, but are highly non-linear in fMRI, with a rapid peak above the mean (zero-crossing) followed by a sharp dip and plateau, and even an estimated increased response at values below the mean (though estimates at the extremes have high uncertainty). This may be due in part to ceiling effects: blood oxygen levels measured by fMRI are bounded, but reading times are not. While this is again a property of experimental modality rather than sentence comprehension itself, understanding such influences is important for drawing scientific conclusions from experimental data. For example, due to the possibility of saturation, fMRI may not be an ideal modality for testing scientific claims about the functional form of effects, and the linearity assumptions of e.g. CDR and LME may be particularly constraining.
The curvature of effects can also be queried over time. If an effect is temporally diffuse but linear, its curvature should be roughly linear at any delay of interest. The second row of Figure 3 shows visualizations to this effect. These plots in fact subsume the kinds of univariate plots shown above: univariate IRFs to 5-gram surprisal like those plotted in Figure 2 are simply slices taken at a predictor value (1 sample standard deviation above the mean), whereas curvature estimates in the first row of Figure 3 are simply slices taken at a time value (0s for reading and 5s for fMRI). Plots are consistent with the linearity hypothesis for reading, but again show strong non-linearities in the fMRI domain that are consistent with saturation effects Delay (s) as discussed above.

Effect Interactions
In addition to exploring multivariate relationships of a predictor with time, relationships between predictors can also be studied. Such relationships constitute "interactions" in a CDRNN model, though they are not constrained (cf. interactions in linear models) to be strictly multiplicative -indeed, a major advantage of CDRNN is that interactions come "for free", along with estimates of their functional form. To explore effect interactions, a CDRNN-FF version of the full model in Shain et al.
(2020) is fitted to the fMRI dataset. The model contains more predictors to explore than models considered above, including surprisal computed from a probabilistic context-free grammar (PCFG surprisal, see Appendix C for details). Univariate IRFs are shown in the top left panel of Figure 4, and pairwise interaction surfaces at a delay of 5s (near the peak response) are shown in the remaining panels. Plots show that the response at any value of the other predictors is roughly flat as a function of sound power (i.e. signal power of the auditory stimulus, middle row). This accords with prior arguments that the cortical language system, whose activity is measured here, does not strongly register low-level perceptual effects (Fedorenko et al., 2010;Braze et al., 2011).  The estimate for unigram surprisal (middle left) shows an unexpected non-linearity: although activity increases with higher surprisal (lower frequency words), it also increases at lower surprisal (higher frequency words), suggesting the existence of high frequency items that nonetheless engender a large response. The interaction between PCFG surprisal and unigram surprisal possibly sheds light on this outcome, since it shows a sharper increase in the PCFG surprisal response in higher frequency (lower unigram surprisal) regions. This may be because the most frequent words in English tend to be function words that play an outsized role in syntactic structure building (e.g. prepositional phrase attachment decisions).
In addition, 5-gram surprisal interacts with PCFG surprisal, showing a non-linear increase in response for words that are high on both measures. This is consistent with a unitary predictive mechanism that experiences strong error signals when both string-level (5-gram) and structural (PCFG) cues are poor. All these interactions should be interpreted with caution, since the uncertainty interval covers much weaker degrees of interaction.

IRFs of the Response Variance
As discussed in Section 3, CDRNN implements distributional regression and thus also contains an IRF describing the influence of predictors on the variance of the predictive distribution as a function of time. IRFs of the variance can be visualized identically to IRFs of the mean.
For example, Figure 5 shows the estimated change in the standard deviation of the predictive distribution over time from observing a stimulus. 8 Estimates show stimulus-dependent changes 8 Because standard deviation is a bounded variable and the IRF applies before the constraint function (softplus), the relationship between the standard deviation and the y axis of the plots is not straightforward. Estimates nonetheless clearly indicate the shape and relative contribution to the response in variance across datasets whose shapes are not straightforwardly related to that of the IRFs of the mean (Figure 2). For example, both reading datasets (left and center) generally show mean and standard deviation traveling together, with increases in the mean corresponding to increases in standard deviation. In Dundee, the shapes of these changes resemble each other strongly, whereas in Natural Stories the IRFs of the standard deviation (especially rate) differ substantially from the IRFs of the mean. By contrast, in fMRI (right), the IRFs of the standard deviation look roughly like inverted HRFs (especially for rate and 5-gram surprisal), indicating that BOLD variance tends to decrease with larger values of the predictors. While detailed interpretation of these patterns is left to future work, these results demonstrate the utility of CDRNN for analyzing a range of links between predictors and response that are otherwise difficult to study.

Conclusion
This study proposed and evaluated CDRNN, a deep neural extension of continuous-time deconvolutional regression that relaxes implausible simplifying assumptions made by widely used regression techniques in psycholinguistics. In so doing, CDRNN provides detailed estimates of human language processing dynamics that are difficult to obtain using other measures. Results showed plausible estimates from human data that generalize better than alternatives and can illuminate hitherto understudied properties of the human sentence processing response. This outcome suggests that CDRNN may play a valuable role in analyzing human experimental data.

A Mathematical Definition
This appendix formally defines the CDRNN model. CDRNN assumes the following quantities as input: 9 • X ∈ N: Number of predictor observations (e.g. word exposures) • Y ∈ N: Number of response observations (e.g. fMRI scans) • Z ∈ N: Number of random grouping factor levels (e.g. distinct participants) • K ∈ N: Number of predictors • X ∈ R X×K : Design matrix of X predictor observations of K dimensions each.
• y ∈ R Y : Vector of Y response observations • Z ∈ {0, 1} Y ×Z : Boolean matrix indicating random grouping factor levels associated with each response observation • t ∈ R X : Vector of timestamps associated with each observation in X • t ∈ R Y : Vectors of timestamps associated with each observation in y • S ∈ N: Number of parameters in predictive distribution (e.g. 2 for a normal distribution: mean and variance) For simplicity of exposition, X and y are assumed to contain data from a single time series (e.g. a single participant performing a single experiment).
9 Throughout these definitions, vectors and matrices are notated in bold lowercase and uppercase, respectively (e.g. u, U). Objects with indexed names are designated using subscripts (e.g. vr). Vector and matrix indexing operations are notated using subscript square brackets, and slice operations are notated using * (e.g. X [ * ,k] denotes the k th column of matrix X). Hadamard (pointwise) products are notated using . The notations 0 and 1 designate conformable column vectors of 0's and 1's, respectively. Superscripts are used for indexation and do not denote exponentiation.
The definition below can be applied without loss of generality to data containing multiple time series by concatenating the output of the model as applied to multiple X, y pairs. X, y and their associated satellite data Z, t, t must be temporally sorted.
Given these inputs, CDRNN estimates a latent impulse response function that relates timestamped predictors to all parameters of the assumed predictive distribution. For example, assuming a univariate normally distributed response, CDRNN learns an IRF with two output dimensions, one for the predictive mean, and one for the predictive variance. Regressing all parameters of the predictive distribution in this way has previously been called distributional regression (Bürkner, 2018).
CDRNN contains a recurrent neural network (RNN), neural projections that map inputs and RNN states to a hidden state for each preceding event, and neural projections that map the hidden states to predictions about (1) the influence of each event on the response (IRF) and (2)  The following values are deterministically assigned: • D IRF(L IRF ) = S(K + 1) (the IRF generates a convolution weight for every predictor dimension, plus the timestamp, for each parameter of the predictive distribution) • D in(0) = K + 1 (input is predictors + time) In these definitions, integers x, y respectively refer to row indices of X, y. Let z y be the vector Z [y, * ] of random effects associated with the response at y. Let W h,Z ∈ R D h ×Z , W IRF(1),Z ∈ R 2D IRF(1) ×Z , and W s,Z ∈ R S×Z be an embedding matrix for z y . Random effects offsets at response step y for the hidden state (h Z y ), the weights and biases of the first layer of the IRF (w IRF(1),Z y , b IRF(1),Z y ), and the parameters of the predictive distribution (e Z y , i.e. random intercepts and variance parameters) are generated as follows: Following prior work in mixed effects models (Bates et al., 2015), to ensure that population-level estimates reliably encode central tendency, each output dimension of W h,Z , W IRF(1),Z , and W s,Z is constrained to have mean 0 across the levels of each random grouping factor (e.g. across participants in the study). The neural IRF is applied to a temporal offset τ representing the delay at which to query the response to an input (e.g. τ = 1 queries the response to an input 1s after the input occurred). The output of the neural IRF g x,y (τ ) ∈ R D IRF( ) applied to τ at layer is defined as: b IRF(1) x,y , and s IRF( ) are respectively the th IRF layer's weight matrix at predictor timestep x and response timestep y, bias vector at time x, y, and squashing function, and g (0) x,y (τ ) = τ . w IRF(1) , b IRF(1) are respectively globally applied initial weight and bias vectors for the first layer of the IRF, which transforms scalar τ , each of which is shifted by its corresponding random effects. W IRF(1) ∆ , B IRF(1) ∆ are respectively weight matrices used to compute additive modifications to W IRF(1) from CDRNN hidden state h x,y , similar in spirit to a residual network (He et al., 2016). Non-initial IRF layers are treated as stationary (i.e. their parameters are independent of x, y). The final output of the IRF is given by: x,y (τ ), (S, K + 1) (9) The hidden state h x,y is computed as the squashed sum of several quantities: a global bias h bias , random effects h Z , a neural projection h in x,y of the inputs at x, y, and a neural projection h RNN x,y of the hidden state of an RNN over the sequence of predictors up to and including timestep x: The IRF g x,y is therefore feature-dependent via the neural projection h in x,y of the input at x, y and context-dependent via the neural projection h RNN x,y of an RNN over the input up to x for the response at y. This design relaxes stationarity assumptions while also sharing structure across timepoints. The definitions of h in x,y and h RNN x,y are given below. Let t x be the element t [x] and x x be the x th predictor vector X [x, * ] . The inputs h (0) x,y to the CDRNN model are defined as the vertical concatenation of the predictors x x and the event timestamp t x : The output of the input projection at layer l and time x, y is defined as: where h x,y . At the final layer, s in(L in ) is identity and b in(L in ) = 0, since h x,y already has a bias. The final output of the input projection is given by: h in Note that h in x,y is already non-stationary by virtue of its dependence on the event timestamp t [x] , which allows the IRF to differ between timepoints (see e.g. Baayen et al., 2017, for development of a similar idea using generalized additive models). While this model of non-stationarity can be complex and non-linear, it is still limited by contextindependence. That is, the change in the IRF over time depends only on the amount of time elapsed since the start of the time series, independently of which events preceded. However, it is possible that the contents of the events in a time series may influence the IRF, above any deterministic change in response over time (for example, if several difficult preceding words have already taxed the processing buffer, additional processing costs may become larger). To account for this possibility, an RNN is built into the CDRNN design. 10 Any variant of RNN can be used (this study uses a long shortterm memory network, or LSTM, Hochreiter and Schmidhuber, 1997). The th RNN hidden state at x, y is designated by h RNN( ) x,y . To account for the possibility of random variation in sensitivity to context, the initial hidden and cell states h RNN( ) 0,y , c RNN( ) 0,y depend on the random effects: The hidden state of the final RNN layer is linearly projected to the dimensionality of the CDRNN hidden state: h RNN To apply the CDRNN model to data, a mask F ∈ {0, 1} Y ×X admits only those observations in X that precede each y [y] : Letting τ x,y denote the temporal offset between the predictors at x and the response at y, i.e. τ x,y def = t [y] − t [x] . A total of S(K + 1) sparse convolution matrices G s,k ∈ R Y ×X are defined to contain the predicted response to each preceding event for the k th dimension of h (0) x,y and the s th parameter of the predictive distribution, masked by F: 3733 The convolved design matrix X (s) ∈ R Y ×(K+1) for the s th parameter of the predictive distribution is then computed as: Vector s ∈ R S contains global, population-level estimates of the parameters of the predictive distribution. Under the univariate normal predictive distribution assumed in this study, s contains the predictive mean (µ, i.e. the intercept) and variance (σ 2 ): Matrix S Z contains random predictive distribution parameter estimates for the y th response s Z y : The vector of values for each response y for the s th predictive distribution parameter is given by summing the population value, random effects values, and convolved response values: S [ * ,s] def = f constraint(s) X (s) 1 + S Z [ * ,s] + s [s] (23) where f constraint(s) enforces any required constraints on the s th parameter of the predictive distribution. In the Gaussian predictive distribution assumed here, f constraint(1) (the constraint function for the mean) is identity and f constraint(2) (the constraint function for the variance) is the softplus bijection: softplus(x) def = ln(e x + 1) Given an assumed distributional family F (here assumed to be univariate normal), the response in the CDRNN model is distributed as:

B Asynchronously Measured Predictor Dimensions
As discussed in Shain andSchuler (2018, 2021), CDR applies straightforwardly to time series with asynchronous predictor vectors and response values (i.e. measured at different times, such as word onsets that do not align with fMRI scan times). The CDR implementation of Shain and Schuler (2021) also supports asynchronously measured dimensions of the predictor matrix, simply by providing each predictor dimension with its own vector of timestamps. This allows e.g. Shain et al. (2020) to regress linguistic features (which are word-aligned) and sound power (which in their definition is measured at regular 100ms intervals) in the same model. Supporting asynchronously measured predictor dimensions is more challenging in CDRNN, especially if the RNN component is used. The solution used in CDR is not available because input dimensions that do not align in time are (1) arbitrarily grouped together and (2) erroneously treated as steps in the RNN input sequence. A more principled solution is to interleave the predictors in time order and pad irrelevant dimensions with zeros. For example, in a model with predictor A and predictor B that are sampled at different times, the values of A and B are temporally sorted together into a single time series, with the B value of A events set to zero and the A value of B events set to zero. This approach carries a computational cost: unlike CDR, the number of inputs to the convolution scales linearly on the number of asynchronously measured sets of predictors in the model.

C Predictors
The following predictors are common to all models presented here: • Rate (CDR/NN only): The deconvolutional intercept, i.e. the base response to a stimulus, independent of its features. In CDR, rate is estimated explicitly by fitting an IRF to intercept vector (Shain and Schuler, 2021) (i.e., implicitly, the response when all predictors are 0). In CDRNN, rate is a reference response, computed by taking the response to an average stimulus (since the zero vector may unlikely for a given input distribution, using it as a reference may not reliably reflect the model's domain knowledge). In this study, all other IRF queries subtract out rate in order to show deviation from the reference.
• Unigram surprisal: The negative log of the smoothed context-independent probability of a word according to a unigram KenLM model (Heafield et al., 2013) trained on Gigaword 3 (Graff et al., 2007). While this quantity is typically treated on a frequency or log probability scale in psycholinguistics, it is treated here on