Discourse Context Predictability Effects in Hindi Word Order

We test the hypothesis that discourse predictability influences Hindi syntactic choice. While prior work has shown that a number of factors (e.g., information status, dependency length, and syntactic surprisal) influence Hindi word order preferences, the role of discourse predictability is underexplored in the literature. Inspired by prior work on syntactic priming, we investigate how the words and syntactic structures in a sentence influence the word order of the following sentences. Specifically, we extract sentences from the Hindi-Urdu Treebank corpus (HUTB), permute the preverbal constituents of those sentences, and build a classifier to predict which sentences actually occurred in the corpus against artificially generated distractors. The classifier uses a number of discourse-based features and cognitive features to make its predictions, including dependency length, surprisal, and information status. We find that information status and LSTM-based discourse predictability influence word order choices, especially for non-canonical object-fronted orders. We conclude by situating our results within the broader syntactic priming literature.


Introduction
Grammars of natural languages have evolved over time to factor in cognitive pressures related to production (Hawkins, 1994(Hawkins, , 2000) ) and comprehension (Hawkins, 2004(Hawkins, , 2014)), learnability (Christiansen and Chater, 2008) and communicative efficiency (Jaeger and Tily, 2011;Gibson et al., 2019).In this work, we test the hypothesis that maximization of discourse predictability (quantified using lexical repetition surprisal and adaptive LSTM surprisal) is a significant predictor of Hindi syntactic choice, when controlling for information status, dependency length, and surprisal measures estimated from n-gram, LSTM and incremental constituency parsing models.
Our hypothesis is inspired by a solid body of evidence from studies based on dependency treebanks of typologically diverse languages which show that grammars of languages tend to order words by minimizing dependency length (Liu, 2008;Futrell et al., 2015) and maximizing their trigram predictability (Gildea and Jaeger, 2015).Parallel to this line of work on sentence-level word order, another strand of work has focused on discourse-level estimates of entropy starting from the Constant Entropy Rate hypothesis (CER;Genzel and Charniak, 2002).To overcome the major difficulty of estimating sentence probabilities conditioned on the previous discourse context, Qian and Jaeger (2012) approximated discourse-level entropy using lexical cues from the previous context.In contrast, we leverage modern computational psycholinguistic neural techniques to obtain word and sentencelevel estimates of inter-sentential discourse predictability and study the impact of these measures on Hindi word order choices.We conclude that discourse-level priming influences Hindi word order decisions and interpret our findings in the light of the factors outlined by Reitter et al. (2011).
Hindi (Indo-Aryan language; Indo-European language family) has a rich case-marking system and flexible word order, though it mainly follows SOV word order (Kachru, 2006) as exemplified below.
amar ujala-ko Amar Ujala-ACC yah it sukravar-ko friday-on daak-se post-INST prapt receive hua be.PST.SG Amar Ujala received it by post on Friday.
sukravar-ko yah amar ujala-ko daak-se prapt hua To test ordering preferences, we generated meaning-equivalent grammatical variants (Examples 1b and 1c above) of reference sentences (Example 1a) from the Hindi-Urdu Treebank corpus of written text (HUTB; Bhatt et al., 2009) by permuting their preverbal constituent ordering.Subsequently, we used a logistic regression model to distinguish the original reference sentences from the plausible variants based on a set of cognitive predictors.We test whether fine-tuning a neural language model on preceding sentences improves predictions of preverbal Hindi constituent order in later sentences over other cognitive control measures.The motivation for our fine-tuning method is that, during reading, encountering a syntactic structure eases the comprehension of subsequent sentences with similar syntactic structures as attested in a wide variety of languages (Arai et al., 2007;Tooley and Traxler, 2010) including Hindi (Husain and Yadav, 2020).Our cognitive control factors are motivated by recent works which show that Hindi optimizes processing efficiency by minimizing lexical and syntactic surprisal (Ranjan et al., 2019) and dependency length (Ranjan et al., 2022a) at the sentence level.
Our results indicate that discourse predictability is maximized by reference sentences compared with alternative orderings, indicating that discourse predictability influences Hindi word-order preferences.This finding corroborates previous findings of adaptation/priming in comprehension (Fine et al., 2013;Fine and Jaeger, 2016) and production (Gries, 2005;Bock, 1986).Generally, this effect is influenced by lexical priming, but we also find that certain object-fronted constructions prime subsequent object-fronting, providing evidence for self-priming of larger syntactic configurations.With the introduction of neural model surprisal scores, dependency length minimization effects reported to influence Hindi word order choices in previous work (Ranjan et al., 2022a) disappear except in the case of direct object fronting, which we interpret as evidence for the Information Locality Hypothesis (Futrell et al., 2020).Finally, we discuss the implications of our findings for syntactic priming in both comprehension and production.
Our main contribution is that we show the impact of discourse predictability on word order choices using modern computational methods and naturally occurring data (as opposed to carefully controlled stimuli in behavioural experiments).Cross-linguistic evidence is imperative to validate theories of language processing (Jaeger and Norcliffe, 2009), and in this work we extend existing theories of how humans prioritize word order decisions to Hindi.

Surprisal Theory
Surprisal Theory (Hale, 2001;Levy, 2008) posits that comprehenders construct probabilistic interpretations of sentences based on previously encountered structures.Mathematically, the surprisal of the k th word, w k , is defined as the negative log probability of w k given the preceding context: These probabilities can be computed either over word sequences or syntactic configurations and reflect the information load (or predictability) of w k .High surprisal is correlated with longer reading times (Levy, 2008;Demberg and Keller, 2008;Staub, 2015) as well as longer spontaneous spoken word durations (Demberg et al., 2012;Dammalapati et al., 2021).Lexical predictability estimated using n-gram language models is one of the strongest determinants of word-order preferences in both English (Rajkumar et al., 2016) and Hindi (Ranjan et al., 2022a(Ranjan et al., , 2019;;Jain et al., 2018).

Dependency Locality Theory
Dependency locality theory (Gibson, 2000) has been shown to be effective at predicting the comprehension difficulty of a sequence, with shorter dependencies generally being easier to process than longer ones (Temperley, 2007;Futrell et al., 2015;Liu et al., 2017, cf. Demberg andKeller, 2008).

Data and Models
Our dataset comprises 1996 reference sentences containing well-defined subject and object constituents from the HUTB1 corpus of dependency trees (Bhatt et al., 2009).The HUTB corpus, which belongs to the newswire domain and contains written text in a natural discourse context, is a human-annotated, multi-representational, and multi-layered treebank.The dependency trees here assumes Panini's grammatical model where each sentence is represented as a series of modifiermodified elements (Bharati et al., 2002;Sangal et al., 1995).Each tree in the HUTB corpus denotes words in the sentence with nodes such that head words (modified) are linked to the dependent words (modifier) via labelled links denoting the grammatical relationship between word pairs.
For each reference sentence in the HUTB corpus, we created artificial variants by permuting the preverbal constituents whose heads were linked to the root node in the dependency tree.Inspired by grammar rules proposed in the NLG literature (Rajkumar and White, 2014), ungrammatical variants were automatically filtered out by detecting dependency relation sequences not attested in the original HUTB corpus.After filtering, we had 72833 variant sentences for our classification task.Figure 1 in Appendix A displays the dependency tree for Example sentence 1a and explains our variant generation procedure in more detail.
To determine whether the original word order (i.e. the reference sentence) is preferred to the permuted word orders (i.e. the variant sentences), we conducted a targeted human evaluation via forcedchoice rating task and collected sentence judgments from 12 Hindi native speakers for 167 randomly selected reference-variant pairs in our data set.Participants were first shown the preceding sentence, and then they were asked to select the best continuation between either the reference or the variant.We found that 89.92% of the reference sentences which originally appeared in the HUTB corpus were also preferred by native speakers compared to the artificially generated grammatical variants expressing similar meaning (Further details are provided in Appendix G).Therefore, in our analyses we treat the HUTB reference sentences as human-preferred gold orderings compared with other possible automatically-generated constituent orderings.

Models
We set up a binary classification task to separate the original HUTB reference sentences from the variants using the cognitive metrics described in Section 2. To alleviate the data imbalance between the two classes (1996 references vs 72833 variants), we transformed our data set using the approach described in Joachims (2002).This technique converts a binary classification problem into a pairwise ranking task by training the classifier on the difference of the feature vectors of each reference and its corresponding variants (see Equations 2 and 3).Equation 2 displays the objective of a standard binary classifier, where the classifier must learn a feature weight (w) such that the dot product of w with the reference feature vector (φ(ref erence)) is greater than the dot product of w with the variant feature vector (φ(variant)).This objective can be rewritten as equation 3 such that the dot product of w with the difference of the feature vectors is greater than zero.
Every variant sentence in our dataset was paired with its corresponding reference sentence with order balanced across these pairings (e.g., Example 1 would yield (1a,1b) and (1c,1a)).Thereafter, their feature vectors were subtracted (e.g., 1a-1b and 1c-1a), and binary labels were assigned to each transformed data point.Reference-Variant pairs were coded as "1" and Variant-Reference pairs were coded as "0".The alternate pair ordering thus re-balanced our previously severely imbalanced classification task.Table 5 in Appendix D illustrates the original and transformed values of the independent variables.
For each reference sentence, our objective was to model the possible syntactic choices entertained by the speaker.In each instance, the author chose to generate the reference order over the variant, implicitly demonstrating an order preference.If the cognitive factors in our study influenced that decision, a logistic regression model should be able to use those factors to predict which syntactic choice was ultimately chosen by the author.Using the transformed features dataset labelled with 1 (denoting a preference for the reference order) and 0 (denoting a preference for the variant order), we trained a logistic regression model to predict each reference sentence (see Equation 4).We report our classification results using 10-fold crossvalidation.The regression results are reported on the entire transformed test data for the respective experiments.All experiments were done with the Generalized Linear Model (GLM) package in R.Here choice is encoded by the binary dependent variable as discussed above (1: reference preference and 0: variant preference).To obtain sentencelevel surprisal measures, we summed word-level surprisal of all the words in each sentence.The values for independent variables were calculated as follows.
1. Dependency length: We computed a sentence-level dependency length measure by summing the head-dependent distances (measured as the number of intervening words) in the HUTB reference and variant dependency trees.

Trigram surprisal:
For each word in a sentence, we estimated its local predictability using a 3-gram language model (LM) trained on the written section of the EMILLE Hindi Corpus (Baker et al., 2002), which consists of 1 million mixed genre sentences, using the SRILM toolkit (Stolcke, 2002) with Good-Turing discounting.
3. PCFG surprisal: The syntactic predictability of each word in a sentence was estimated using the Berkeley latent-variable PCFG parser2 (Petrov et al., 2006).12000 phrase structure trees were created to train the parser by converting Bhatt et al.'s HUTB dependency trees into constituency trees using the approach described in Yadav et al. (2017).
Sentence level log-likelihood of each test sentence was estimated by training a PCFG LM on four folds of the phrase structure trees and then testing on a fifth held-out fold.
4. Information status (IS) score: We automatically annotated whether each sentence exhibited given-new ordering.The subject and object constituents in a sentence were assigned a Given tag if its head was a pronoun or any content word within it was mentioned in the preceding sentence.All other phrases were tagged as New.For each sentence, IS score was computed as follows: a) Given-New order = +1 b) New-Given order = -1 c) Given-Given and New-New = 0.For an illustration of givenness coding, see Example 3 in Appendix A and the description in Appendix B.
5. Lexical repetition surprisal: For each word in a sentence, we accounted for lexical priming by interpolating a 3-gram language model with a unigram cache LM based on the history of words (H = 100) containing only the preceding sentence.We used the original implementation provided in the SRILM toolkit with a default interpolation weight parameter (µ = 0.05; see Equations 5 and 6) based on the approach described by Kuhn and De Mori (1990).The idea is to keep a count of recently occurring words in the sentence history and then boost their probability within the trigram language model.Words that have occurred recently in the text are likely to re-occur3 in subsequent sentences (Kuhn and De Mori, 1990;Clarkson and Robinson, 1997).
6. LSTM surprisal: We estimated the predictability for each word according to the entire sentence prefix using a long short-term memory language model (LSTM; Hochreiter and Schmidhuber, 1997) trained on the 1 million written sentences from the EMILLE Hindi corpus (Baker et al., 2002).We used the LSTM implementation provided in the Neural Complexity toolkit (van Schijndel and Linzen, 2018) with default hyper-parameter settings to estimate surprisal using the neural context within each sentence.Their method takes a pre-trained LSTM LM, and, after generating surprisals for a test sentence, the parameters of the LM get updated based on the cross-entropy loss for that sentence.After that, the revised LM weights are used to predict the next test sentence.This continuous fine-tuning approach effectively modulates a sentence-level LSTM through discourse priming.In our work, for each test sentence, we used our base LSTM LM and adapted it to the immediately preceding context sentence and then used it to generate (discourse-sensitive) surprisal values for the desired sentence.We used an adaptive learning rate of 2 as it minimized the perplexity of the validation data set (see Table 1).5

Experiments and Results
We tested the hypothesis that discourse predictability (estimated from adaptive LSTM and lexical repetition surprisal) influences constituent ordering in Hindi over other baseline cognitive controls, including dependency length, information status and trigram and non-adaptive surprisal.dependency length and information status score (see Figure 2 in Appendix C).We report the results of regression and prediction experiments on the full data set as well as on subsets of the data consisting of two types of non-canonical constructions.

Regression Analysis
Our regression results over the entire data set (Table 2) show that all of our measures are significant predictors for the task of classifying reference and variant sentences.The negative regression coefficients for our surprisal metrics (including adaptive LSTM surprisal) indicate that surprisal is consistently lower in the reference sentences compared with the competing variants.And adding adaptive discourse LSTM surprisal into a model containing all other predictors significantly improved the fit of our regression model (χ 2 = 66.81; p < 0.001).Thus these results support our core hypothesis that word order choices seem to maximize discourse predictability compared with possible alternative productions.The positive regression coefficient for information status (IS) score indicates that reference sentences adhere to given-new ordering.Similarly, adding IS score into a model containing all other predictors significantly improved the fit of our regression model (χ 2 = 127.94;p < 0.001).However, the positive regression coefficient of dependency length suggests that reference sentences exhibit longer dependency lengths compared to their variant counterparts, violating locality considerations.Thus dependency length might be in conflict with (and/or overridden by) other factors like discourse priming or information locality (see Section 6 for more discussion of this idea).We also examined the contribution of each predictor on two non-canonical syntactic configura-tions, direct object (DO) fronted and indirect object (IO) fronted constructions, which have been studied extensively in the sentence comprehension literature.Prior work has shown that salient objects tend to occur early in the sentence, thus leading to fronting (Wierzba and Fanselow, 2020;Kaiser and Trueswell, 2004).In the specific context of Hindi, Vasishth (2004) examined the role of locality effects in processing these non-canonical word orders in salient as well as non-salient contexts.He showed that the increased distance to the verb in DO-fronted sentences leads to high self-paced reading times at the inner-most verb as compared to its canonical counterpart in both salient and non-salient conditions.However, in IO-fronted constructions, he found that salient contexts alleviated the processing difficulty which was caused by increased distance.Based on these findings, we predict that adaptive surprisal should be more effective in IO-fronted than DO-fronted constructions.
To test this hypothesis, we isolated reference sentences where the direct object precedes the subject (for a DO-fronted test set) and reference sentences where the indirect object precedes the subject (for an IO-fronted test set) along with their context sentences.We compared both sets to paired variants that exhibited canonical order (i.e.where the subject preceded both objects).Tables 3a and 3b present regression results for DO-and IO-fronted constructions respectively.These subsets constitute a very small fraction of our dataset due to the infrequency of these constructions in Hindi.The regression coefficient for adaptive LSTM surprisal was significantly negative for both subsets, indicating that the non-canonical structures are more common in the context of similarly non-canonical structures.This pattern is more robust for IO-fronted reference sentences (χ 2 = 90.90;p < 0.001) than for DO-fronted reference sentences (χ 2 = 4.03; p = 0.04), validating our proposed prediction about these constructions.Coming to the efficacy of IS scores over these two non-canonical constructions, givenness is effective in case of DO-fronted reference sentences only (χ 2 = 49.06;p < 0.001).Furthermore, in contrast to the IO-fronted subset, the regression coefficient for dependency length in DO-fronted items is significantly negative suggesting that locality considerations are limited to constructions involving a high dependency length difference between reference and variants,6 a similar finding to that reported in Ranjan et al. (2022a) on the same task.

Prediction Accuracy
While the previous section explored how predictors contribute to Hindi ordering preferences across all of the data in aggregate, in this section we frame our model as a classification task on held-out data to determine how many sentences are affected by each predictor.This enables us to examine the relative performance of different predictors in identifying Hindi reference sentences amidst artificially generated grammatical variants and to conduct more detailed error analysis of our results.We used 10-fold cross-validation to evaluate model classification accuracy, i.e. the percentage of data points where a model correctly predicted the referent sentence over a paired variant, for different subsets of predictors (see Table 4).
Non-adaptive LSTM surprisal (94.01%accuracy) and adaptive LSTM surprisal (94.06%) yielded the best classification accuracies when no other predictors were included.Over a baseline model comprised of every other feature except lexical repetition surprisal (see base2 in Table 4), adaptive LSTM surprisal induced a small but significant increase of 0.03% in accuracy (p = 0.04 using McNemar's two-tailed test).When we included lexical repetition surprisal in the baseline model (see base1 in Table 4), adaptive LSTM surprisal ceased to be a significant predictor.This suggests that, in the general case, the maximization of discourse predictability is driven by localized lexical priming captured by our trigram cache model.Apart from the content words, adaptive LSTM surprisal accounts for the re-occurrence of function words (e.g., case markers) which have been shown to modulate syntactic priming and drive parsing processes (Husain and Yadav, 2020).
To study prediction accuracy on non-canonical constructions, we restricted our analyses to IOand DO-fronted items in the test partition (still training the classifier on the full training partition for each fold).In contrast to the DO-fronted subset, adaptive surprisal was a significant predictor of IO-fronted syntactic choice, even in the presence of lexical repetition surprisal, as is evident from the significant increase of 0.6% in accuracy (p = 0.02 using McNemar's two-tailed test; see the rightmost IO column in Table 4).This result indicates that discourse predictability is effective in predicting IO-fronting in sentences that follow other IO-fronted sentences, suggesting the presence of syntactic priming effects.We consider adaptive LSTM LM surprisal (i.e., updating LM weights on successive sentences at test time) as an indicative of syntactic priming in this work but not the vanilla LSTM LM surprisal.We present a more nuanced discussion on this theme in Section 6.
Both our regression and classification results demonstrate that discourse adaptation is more effective in IO-fronted than DO-fronted constructions, mirroring the findings in Hindi sentence comprehension, where Vasishth (2004) showed that discourse context could compensate for the processing difficulty induced by indirect object fronting.The findings of our computational modelling reported in Table 4 are further validated by the agreement accuracy of our human evaluation study described in Section 3. Participants were more prone to prefer IO-fronted construction (80%) compared to DOfronted construction (65%) as shown in Table 9 of Appendix G.

Qualitative Analysis: Success of Adaptive LSTM Surprisal
Further linguistic analyses in IO-fronted constructions revealed that LSTM adaptation also captured the priming of given-given items, potentially modeling the preferred ordering of multiple given items, a case not captured by IS score or lexical repetition surprisal.Reference sentence 1a is correctly predicted by the model containing adaptive LSTM surprisal and all other features (i.e., base1+g in Table 4) but a model without adaptive LSTM surprisal (i.e., base1) predicts the variant Example 1b.Appendix E Table 6 presents the exact scores of different predictors for the referent-variant pairs (1a and 1b).All predictors but LSTM and adaptive LSTM surprisal assign high score for the reference sentence with respect to its paired variant.Adaptive LSTM surprisal assigns a low per-word surprisal for the phrase amar ujala when it comes at the first position in the reference sentence (1a) with respect to when it comes at the second position in the variant (1b), potentially modeling givenness as this word occurred in the previous sentence (Example 2 in Appendix E) as well.See Figure 3 in Appendix E for the information profile of the reference-variant pairs.

What causes priming?
In the priming literature, there is debate as to whether priming is driven by residual neural activation (short-lived effects) or by humans learning and updating their language expectations (long-lived effects).Bock and Griffin (2000) showed that syntactic priming in humans persisted even when prime and target sentences were separated by 10 intervening sentences, supporting the implicit learning (long-lived) hypothesis of syntactic priming.In order to test this effect on constituent ordering choice, we repeated our adaptation experiment by adapting to additional context sentences from the preceding discourse.Adaptive LSTM surprisal and lexical repetition surprisal were estimated by adapting the base LSTM LM and trigram LM, respectively, to five preceding context sentences, rather than the single sentence we used for our other analyses.We found that for non-canonical IO/DOfronted constructions, additional context sentences do not improve the adaptive LSTM LM's word order predictions, suggesting that priming may be driven by short-term residual activation (see Table 8 in the Appendix F).

Variance Inflation Factor
In this section, we evaluate our regression models for multicollinearity in terms of variance inflation factor (VIF) score.As Figure 2 in Appendix C denotes, the adaptive LSTM surprisal measure has a high correlation with all other surprisal predictors, which raises some suspicion that estimates of effects of the variables in our regression model might be unreliable.The IO-fronted construction is very rare (0.76% of our data) compared to DO-fronted non-canonical sentences (1% of our data) in the HUTB corpus of 13274 sentences.We find strong priming effects in IO-fronted constructions but weak priming in the case of DO-fronted constructions, providing evidence for an inverse frequency interaction (Scheepers, 2003;Jaeger and Snider, 2007).
Our finding that priming is not aided by longterm contexts indicates a decay effect in priming, which supports the residual activation (short-lived) hypothesis of priming in comprehension (Pickering and Branigan, 1998).Nevertheless, there has been evidence for implicit learning effects in comprehension as well (Luka and Barsalou, 2005;Wells et al., 2009).More recently, Ranjan et al. (2022b) using a similar setup as our current work argued for the existence of both the accounts viz., residual activation and implicit learning, and demonstrated the role of dual mechanism priming effects (Tooley and Traxler, 2010) in Hindi word order.
Previous work suggests that lexical overlap between prime and target sentences enhances syntactic priming (Pickering and Branigan, 1998;Gries, 2005).The repeated lexical items become cues during sentence planning and bias the speaker to produce similar structures that those repeated lex-ical items tend to occur in.Overall, we find that lexical repetition influences Hindi syntactic choice; however, syntactic priming is observed over and above lexical repetition in non-canonical constructions.It's interesting to note that comparable results have also been reported for English dialogue corpora (Healey et al., 2014;Green and Sun, 2021).We plan to conduct a systematic investigation on Hindi spoken data as part of future work.
Finally, with regards to the cumulativity of priming, Jaeger and Snider (2007) showed in their corpus study of production of passives and thatinsertion/omission that the effect of priming increases with the number of primes preceding it.Our work does not investigate this specifically, and more controlled experiments would be required.
The success of LSTM-based surprisal estimates over and above dependency length can also be interpreted in light of Futrell's (2019) point about the limitation of Surprisal Theory with respect to word order.Futrell modified Surprisal Theory by positing that the per-word processing difficulty is proportional to its surprisal given a lossy memory representation of the preceding context.Moreover, Futrell et al. (2020) proposed the Information Locality Hypothesis (ILH) which states that all pairs of words with high mutual information (not merely syntactically related words) tend to be located close to one another.The long window offered by LSTM surprisal thus models relationships between words at varying distances (over and above conventional trigram models).The success of these surprisal estimates for the task of reference sentence prediction provides some preliminary evidence for ILH in the case of word order.
Future work needs to tease apart priming effects of both vanilla LSTM and adaptive LSTM surprisal in the light of recent works.In this work, sentences are treated as independent while estimating their surprisal using vanilla LSTM LM, so there is no way vanilla LSTM can exhibit syntactic priming given the independent sentences.However, Misra et al. (2020) demonstrated that BERT exhibits "priming effect".The BERT LM was able to predict a word with greater probability when the context included a related word than an unrelated word.However, the effect decreased as the amount of information provided by the context increases.In other words, the related prime under high contex-tual constraint started acting as distractor-actively demoting the target word in the probability distribution; thus exhibiting "mispriming effect" (Kassner and Schütze, 2020).This could be due to stylistic avoidance of repeated structures/words in the adjacent sentences.Future work also needs to investigate whether word-order preferences can be jointly optimized using multiple factors (Gildea and Jaeger, 2015).In particular, the relationship between the drive to minimize surprisal (as found in this work) and the tendency to make information profiles uniform (Jaeger, 2010) needs to be explored more thoroughly in the light of recent findings (Meister et al., 2021).
Overall, our results demonstrate that Hindi word order preferences are influenced by discourse predictability maximization considerations.The actual mechanisms of discourse effects are plausibly lexical and syntactic priming.

Limitations
The 'levels' problem discussed in Levy (2018) which posits 2 levels of linguistic optimisation is germane while evaluating our work.Our results are restricted to the level of syntactic choices made by individual speakers or users of a given language over a lifetime (and not at the level of entire grammars and evolutionary timescales).Our experiments conducted on written text need to be performed on spoken data in order to make claims about priming in language production.P.R. Clarkson and A. J. Robinson. 1997 Amar Ujala received it by post on Friday.b.
sukravar-ko yah amar ujala-ko daak-se prapt hua [New-Given = -1] (Variant 2) This work uses sentences from the Hindi-Urdu Treebank (HUTB) corpus of dependency trees (Bhatt et al., 2009) containing well-defined subject and object constituents.Figure 1 displays the dependency tree (and a glossary of relation labels) for reference sentence 3a.The grammatical variants were created using an algorithm that took as input the dependency tree corresponding to each HUTB reference sentence.The reordering algorithm permuted the preverbal 7 dependents of the root verb and linearized the resulting tree to obtain variant sentences.For example, corresponding to the reference sentence 3a and its root verb "hai" (see figure 1a), the preverbal constituents with parents as "ujala", "yah", "suravar", "daak", and "prapt" were permuted to 7 Hindi is not a strictly verb-final language, but the majority of the constituents in the HUTB corpus are preverbal.Our corpus analysis of 13274 sentences present in HUTB suggests 20,750 pairs of preverbal constituents and 2599 pairs of postverbal constituents.Therefore, our variant generation (via reordering of constituents) and subsequent experiments focus on word-order variation in the preverbal domain, considering the preverbal domain to be the locus of word-order variation.Only preverbal constituents are permuted to generate grammatical variants and leave the postverbal constituents in the reference-variants sentences as it is.
generate the artificial variants (3b and 3c).The ungrammatical variants were automatically filtered out using dependency relation sequences (denoting grammar rules) attested in the gold standard corpus of HUTB trees.In the dependency tree 1a, "k4-k1", "k7t-k1", "k3-k7t", and "pof-k3" are dependency relation sequences.In cases where the total number of variants exceeded 100 (a random cutoff),8 we chose 99 non-reference variants randomly along with the reference sentence.

B Information Status Annotation
The subject and object constituents in a sentence were assigned a Given tag if any content word within them was mentioned in the preceding sentence or if the head of the phrase was a pronoun.All other phrases were tagged as New.The sentence example 3 illustrates the proposed annotation scheme.
• Example 3a follows Given-Given ordering - The object "Amar Ujala" in the sentence is mentioned in the preceding context sentence 2, it would be annotated as Given.In contrast, the subject "yah" is a pronoun so it would also be tagged as Given following the annotation scheme.
• Example 3c follows New-Given ordering -The object "sukravar" in the sentence should be tagged as New as it is not mentioned in the preceding context sentence 2. In contrast, the subsequent pronoun "yah", which acts as the subject of the sentence, should be tagged as Given following the annotation scheme.

C Correlation Plot
The Pearson's correlation coefficients between different predictors are displayed in Figure 2. The adaptive LSTM surprisal has a high correlation with all other surprisal features and a low correlation with dependency length and information status score.

D Joachims Transformation
This technique converts a binary classification problem into a pair-wise ranking task involving the  feature vectors of a reference sentence and each of its variants.Table 5 displays the Joachims's transformation.The delta (δ) refers to the difference between the feature vectors of the reference sentence and its paired variant.The overall goal is to model two-alternative choices for each reference sentence such that the speaker generates the reference sentence after rejecting a potential grammatical variant.Moreover, the reference sentence that appeared originally in the corpus must have been present due to its properties (viz., dependency length, discourse context, accessibility, or surprisal), and variant sentences are less likely to be produced due to the same reasons.

E Information Profile for IO-fronted Example
The LSTM LM, when adapted to the previous sentence (2) in the discourse, assigns a lower surprisal score to the given item when it occurs in the first position ("amar ujala" in sentence 3a) than when it appears in the second position ("amar ujala" in sentence 3b) in the subsequent sentence.

F Contextual Adaptation on One Vs. Multiple Sentences for DO/IO Constructions
We investigated if adapting the LSTM LM to the preceding five contextual sentences instead of one contextual sentence will help predict wordordering patterns better for IO/DO constructions.

G Human Evaluation
To determine whether the permuted word order (variant) is dispreferred to the original word order (reference), we conducted a targeted human evaluation via a forced-choice task and collected sentence judgments from 12 Hindi native speakers for 167 randomly selected reference-variant pairs in our data set.Participants were first shown the preceding sentence and then asked to judge the subsequent most likely sentence as the best choice between the reference-variant pair.Each sentence was assigned a human label of "1" if more than 50% participants voted it as best choice else human label of "0".The stimuli belonged to two different constructions, viz., the reference sentence (Ref) has canonical ordering whereas, the variant (Var) has non-canonical ordering (DO-fronted or IO-fronted) and vice versa.Table 9 presents the results.On the entire dataset containing 167 reference-variant pairs, 89.92% (agreement accuracy) of the reference sentences originally appearing in the HUTB corpus were also preferred by native speakers compared to the artificially generated grammatical variants expressing similar meanings.Moreover, as initially hypothesized, the Hindi participants were more prone to prefer IO-fronted construction (80%) compared to DO-fronted construction (65%) as captured by the agreement accuracy validating the findings reported in Table 4. Overall, the full model containing all the features, including adaptive LSTM surprisal, predicted human preferences (76.65%) much better than corpus choice labels (74.85%).

Table 1 :
Learning rate influence on adaptive LSTM validation perplexity (N = 13274 sentences; the initial non-adaptive model uses a learning rate of 0) and training epoch with early stopping.Rest other parameters were set to default setting. 47. Discourse LSTM surprisal: We estimated the discourse predictability of each word in the sentence using the ADAPT function of the neural complexity toolkit.van Schijndel and Linzen (2018) proposed a simple way to continuously adapt a neural LM to each successive test sentence, and found that adaptive surprisal predicts human reading times significantly better than non-adaptive surprisal.

Table 3 :
Discourse adaptation regression model on DO/IO fronted cases (all significant predictors denoted by |t|>2)

Table 4 :
Prediction performances (Full data set (72833 points), Direct objects (DO; 1663 points) and indirect object (IO; 1353 points) fronted cases; each row refers to a distinct model; *** McNemar's two-tailed significance compared to model on previous row)

Table 8 :
Table 10a displays the VIF scores for each predictor in the different regression models.The VIF scores for the regression models without the correlated features, such as trigram surprisal and vanilla LSTM surprsial, are documented in Table 10b.And Table 11 reports the results of the regression experiment when the model did not contain these highly correlated features.Prediction performance (Direct objects (DO: 1663 points), Indirect Objects (IO: 1353 points)); Baseline denotes base1+g shown in Table 4; bold denotes McNemar's two-tailed significance compared to a baseline model in the same row

Table 9 :
Targeted human evaluation -Agreement human/corpus: Percentages of times human judgment matches with corpus reference choice; Model corpus: Percentages of corpus choice correctly predicted by the classifier containing all the predictors (base1 + g as per Table4); Model human: Percentages of human label correctly predicted by the classifier containing all the predictors (base1 + g as per Table4)

Table 10 :
Variance inflation factor analysis on different regression models containing: (a) all predictors b) all predictors but correlated features; Each column denotes individual models on a given dataset with a different set of predictors; VIF larger than 5 or 10 indicates that the model has problems estimating the coefficient of variables

Table 11 :
Regression model on full data set after removing the correlated features (N = 72833; all significant predictors denoted by |t|>2)