Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars

In computational linguistics, it has been shown that hierarchical structures make language models (LMs) more human-like. However, the previous literature has been agnostic about a parsing strategy of the hierarchical models. In this paper, we investigated whether hierarchical structures make LMs more human-like, and if so, which parsing strategy is most cognitively plausible. In order to address this question, we evaluated three LMs against human reading times in Japanese with head-final left-branching structures: Long Short-Term Memory (LSTM) as a sequential model and Recurrent Neural Network Grammars (RNNGs) with top-down and left-corner parsing strategies as hierarchical models. Our computational modeling demonstrated that left-corner RNNGs outperformed top-down RNNGs and LSTM, suggesting that hierarchical and left-corner architectures are more cognitively plausible than top-down or sequential architectures. In addition, the relationships between the cognitive plausibility and (i) perplexity, (ii) parsing, and (iii) beam size will also be discussed.


Introduction
It has been debated in computational linguistics whether language models (LMs) become more human-like by explicitly modeling hierarchical structures of natural languages. Previous work has revealed that while sequential models such as recurrent neural networks (RNNs) can successfully predict human reading times (Frank and Bod, 2011), there is an advantage in explicitly modeling hierarchical structures (Fossum and Levy, 2012). More recently, RNNs that explicitly model hierarchical structures, namely Recurrent Neural Network Grammars (RNNGs, Dyer et al., 2016), have attracted considerable attention, effectively capturing grammatical dependencies (e.g., subject-verb agreement) much better than RNNs in targeted syntactic evaluations (Kuncoro et al., 2018;Wilcox et al., 2019). In addition, Hale et al. (2018) showed that RNNGs can successfully predict human brain activities, and recommended RNNGs as "a mechanistic model of the syntactic processing that occurs during normal human sentence processing." However, this debate has focused almost exclusively on the dichotomy between the hierarchical and sequential models, without reference to the parsing strategies among the hierarchical models. Specifically, although Dyer et al. (2016) and Hale et al. (2018) adopted the vanilla RNNG with a top-down parsing strategy for English with headinitial right-branching structures, Abney and Johnson (1991) and Resnik (1992) suggested that the top-down parsing strategy is not optimal for headfinal left-branching structures, and alternatively proposed the left-corner parsing strategy as more human-like parsing strategy.
In this paper, we investigate whether hierarchical structures make LMs more human-like, and if so, which parsing strategy is most cognitively plausible. In order to address this question, we evaluate three LMs against human reading times in Japanese with head-final left-branching structures: Long Short-Term Memory (LSTM) as a sequential model and Recurrent Neural Network Grammars (RNNGs) with top-down and left-corner parsing strategies as hierarchical models.

Linking hypothesis
It is well established in psycholinguistics that humans predict next segments during sentence processing, and the less predictable the segment is, the more surprising that segment is. Surprisal theory (Hale, 2001;Levy, 2008) quantifies this predictability of the segment as − log p(segment|context), an information-theoretic complexity metric known as surprisal. In line with the previous literature (e.g., Smith and Levy, 2013), we employed this metric to logarithmically link probabilities estimated from LMs with cognitive efforts measured from humans. Intuitively, the cognitively plausible LMs will compute surprisals with similar trends as human cognitive efforts. Computational models of human sentence processing have been explored by comparing surprisals from various LMs with reading times (e.g., Frank and Bod, 2011) and brain activities (e.g., Frank et al., 2015).

Language models
Long Short-Term Memory (LSTM): LSTMs are a sequential model that does not model hierarchical structures. We used a 2-layer LSTM with 256 hidden and input dimensions. The implementation by Gulordava et al. (2018) was employed.

Data sets
Training data: All LMs were trained on the National Institute for Japanese Language and Linguistics Parsed Corpus of Modern Japanese (NPCMJ), that comprises 67,018 sentences annotated with tree structures. 3 The sentences were split into subwords by a byte-pair encoding (Sennrich et al., 2016). LSTM used only terminal subwords, while RNNGs used terminal subwords and tree structures, both of which were trained sentence-level for 40 epochs and 3 times with different random seeds. 4 1 Resnik (1992) suggested that an arc-eager left-corner parsing strategy is cognitively plausible. Jin and Schuler (2020) implemented an incremental neural parser that builds tree structures with the arc-eager left-corner parsing strategy, but it requires an extremely large beam size to achieve the reasonable parsing accuracy. Thus, in this paper, we employed arc-standard left-corner RNNGs as an approximation to the arc-eager left-corner parsing strategy that delayed attachments (Kuncoro et al., 2018). 2 k means the action beam size. We set the word beam size to k/10 and the fast-track to k/100 (Stern et al., 2017).
Reading time data: All LMs were evaluated against first pass reading times from BCCWJ-EyeTrack (Asahara et al., 2016), that comprises 218 sentences annotated with eye-tracking reading times at each phrasal unit. Following Asahara et al. (2016), the data points (i) not corresponding to the main text or (ii) not fixated were removed. In addition, following Fossum and Levy (2012), the data points that contained subwords "unknown" to the LMs were also removed. Consequently, we included 12,114 data points in the statistical analyses among 19,717 data points in total.

Evaluation metrics
Psychometric predictive power: We evaluated how well surprisal (− log p(segment|context)) from each LM could predict human reading times. LMs process the sentences subword-by-subword, while reading times are annotated phrase-by-phrase. Thus, following Wilcox et al. (2020), the phrasal surprisal I(p) was computed as the cumulative sum of surprisals of its constituent subwords w l , w l+1 , · · · , w m : where I(w) is the surprisal of subword w: For the statistical analyses, we first trained a baseline regression model with several predictors that are known to affect reading times. Then, we added surprisal estimated from each LM as a predictor and evaluated the decrease in deviance (∆D(LM )) as psychometric predictive power: 5 where D B and D LM are deviance of the baseline regression model and the regression model with surprisal, respectively. The details of our regression models are shown in Appendix A.
Perplexity and parsing accuracy: Goodkind and Bicknell (2018) demonstrated that perplexity of LMs and their psychometric predictive power are highly correlated. In order to investigate whether this correlation can be observed, perplexities of LMs were calculated based on the sentences in BCCWJ-EyeTrack. In addition, given that RNNGs also serve as a parser, the correlation between parsing accuracy and psychometric predictive power was also investigated. The evaluation metric of parsing accuracy was the labeled bracketing F1. For this purpose, we used the sentences in NPCMJ because the sentences in BCCWJ-EyeTrack are not annotated with tree structures. Parsing accuracies of RNNGs were calculated based on the tree structures at the top of the final beam in word-synchronous beam search.

Results and discussion
The result of our computational modeling is summarized in Figure 1: psychometric predictive power (the vertical axis) is plotted against perplexity (the horizontal axis). 6 In this section, we first analyze psychometric predictive power itself, and then discuss its relationships with (i) perplexity, (ii) parsing, and (iii) beam size. Figure 1 demonstrates that the hierarchical models (top-down/left-corner RNNGs) achieved higher psychometric predictive power than the sequential model (LSTM) and, among the hierarchical models, left-corner RNNGs achieved higher psychometric predictive power than top-down RNNGs. In order to confirm that these differences are statistically meaningful, we performed nested model comparisons. The result of nested model comparisons is summarized in Table 1, where the best result from each LM was compared. 7 The significance threshold at α = 0.0056 was imposed by the Bonferroni correction motivated by 9 tests (0.05/9). First, nested model comparisons revealed that psychometric predictive power was significant for all LMs relative to the baseline regression model. The point here is that surprisals computed by LMs do explain human reading times in Japanese, generalizing the previous results in English.

Psychometric predictive power
Second, the hierarchical models (top-down/leftcorner RNNGs) significantly outperformed the sequential model (LSTM), and the sequential model Finally, among the hierarchical models, leftcorner RNNGs significantly outperformed topdown RNNGs, and top-down RNNGs did not account for unique variances that left-corner RNNGs cannot explain. This result corroborates Abney and Johnson (1991) from an information-theoretic perspective: the left-corner parsing strategy is more cognitively plausible than the top-down and bottom-up paring strategies.
Here we can conclude from these results that LMs become more human-like by explicitly modeling hierarchical structures and, most importantly, the left-corner parsing strategy was more cognitively plausible than the top-down parsing strategy.

Perplexity
In this subsection, we discuss the relationship between perplexity and psychometric predictive power. First, Figure 1 indicates that, among the hierarchical models, left-corner RNNGs, which achieved higher psychometric predictive power, also achieved lower perplexity than top-down RN-NGs. Overall, the correlation between perplexity and psychometric predictive power of the hierarchical models was robust: the lower perplexity RNNGs have, the higher psychometric predictive power they also have. In sharp contrast, the correlation did not hold for the sequential model, where LSTMs achieved better perplexity, but worse psy- chometric predictive power than RNNGs with similar or even worse perplexity, corroborating Goodkind and Bicknell (2018) that LSTM stands out as an outlier of the correlation between perplexity and psychometric predictive power. Kuribayashi et al. (2021) recently showed that the correlation between perplexity and psychometric predictive power cannot be generalized to Japanese. They proposed that LMs are trained to flatten information density and thus satisfy the Uniform Information Density (UID) assumption (Genzel and Charniak, 2002;Levy, 2005;Jaeger and Levy, 2007), but information density in Japanese turned out not to be empirically uniform and far from the idealized UID assumption. At first, this proposal appears to be inconsistent with our results, but notice that Kuribayashi et al. (2021) only tested sequential models. Here we would like to suggest that, unlike sequential models, hierarchical models can be trained to be human-like, even in languages far from the idealized UID assumption.

Parsing
In this subsection, we discuss the relationship between parsing accuracy and psychometric predictive power, which is summarized in Appendix B, where psychometric predictive power (the vertical axis) is plotted against parsing accuracy (the horizontal axis). Interestingly, just like perplexity, left-corner RNNGs, which achieved higher psychometric predictive power, also achieved higher parsing accuracy than top-down RNNGs. Here again, the correlation between parsing accuracy and psychometric predictive power of the hierarchi-cal models was robust: the higher parsing accuracy RNNGs have, the higher psychometric predictive power they also have.

Beam size
Finally, we discuss the relationship between beam size and psychometric predictive power. The important generalization here is that, although topdown RNNGs improved in psychometric predictive power, perplexity, and parsing accuracy only when the beam size increased, left-corner RNNGs consistently performed well even with a small beam size. We interpret this generalization as demonstrating that left-corner RNNGs may be more human-like than top-down RNNGs in that they can infer the correct tree structure even with a small beam size comparable to humans. In order to reinforce this reasoning, we discuss (i) why left-corner RNNGs can infer the correct tree structure with a small beam size, and (ii) why inference with a small beam size is comparable to humans.
First, we show why left-corner RNNGs can infer the correct tree structure with a small beam size. Consider the following left-branching structures: The structures (a) and (b) represent the order in which nodes are computed by the top-down and the left-corner parsing strategies, respectively. In the top-down parsing strategy, all the ancestor nodes of a must be enumerated before processing a. Only when the beam size increases, it is possible to assume various depths and nodes, and thus retain the correct tree structure during sentence processing.
On the other hand, in the left-corner parsing strategy, where the mother node is enumerated after its leftmost child, it is not necessary to enumerate the ancestor nodes before processing a. Thus, the correct tree structure can be inferred with a small beam size via the left-corner parsing strategy. Second, we show why inference with a small beam size is comparable to humans. In fact, Jurafsky (1996) suggested that full parallel processing and serial processing are not appropriate for human sentence processing, and instead partial parallel processing via beam search that prunes low prob-ability structures is cognitively plausible. From this perspective, we may argue that inference with a small beam size is comparable to humans, and left-corner RNNGs, which can infer the correct tree structure with a small beam size, may be more human-like than top-down RNNGs.
As an anonymous reviewer correctlly pointed out, however, given that Jurafsky (1996) proposed beam search that only keeps structures with a probability within a multiple of 3.8 to 5.6 of the probability of the most probable structure, the number of structures within such a relative beam could be extremely large, especially if the probabilities of the structures within the beam are similar. In order to address this point, we computed an empirical beam size comparable to humans. Specifically, we calculated the number of structures with a probability more than 1/3.8 and 1/5.6 of the probability of the most probable structure, among structures within the word beam which corresponds to the beam defined in Jurafsky (1996). The results showed that, even with the largest word beam size of 100, the average number of structures through derivations within the proposed relative beam turned out to be empirically small: between 3.05 and 4.14 for top-down RNNGs and between 4.08 and 5.68 for left-corner RNNGs. The details of the results are shown in Appendix C. We believe that these results do not affect our argument that inference with a small beam size is comparable to humans.
These discussions taken together, we could still argue that left-corner RNNGs, which can infer the correct tree structure with a small beam size, may be more human-like than top-down RNNGs. In addition, given that larger beam sizes make LMs more computationally expensive, these results also suggest that left-corner RNNGs are more efficient.

Limitations and future work
Interestingly, Wilcox et al. (2020) demonstrated that top-down RNNGs underperformed LSTMs in predicting human reading times in English, which appears to be contradictory to our results in Japanese. We would like to suggest that this discrepancy can be attributed to the difference in the languages tested in these papers. In fact, Kuribayashi et al. (2021) have shown that several established results in English cannot be straightforwardly generalized to Japanese.
In addition, Wilcox et al. (2020) found that ngram language models outperformed various neural language models, while Merkx and Frank (2021) observed that Transformers (Vaswani et al., 2017) outperformed LSTMs in modeling self-paced reading times and N400 brain activities, but not in predicting eye-tracking reading times.
In order to address these limitations, we plan to conduct detailed comparisons between English (Dundee Corpus, Kennedy et al., 2003) and Japanese (BCCWJ-EyeTrack, Asahara et al., 2016) with RNNGs, incorporating n-gram language models and Transformers as baselines in future work.

Conclusion
In this paper, we investigated whether hierarchical structures make LMs more human-like, and if so, which parsing strategy is most cognitively plausible. Our computational modeling demonstrated that left-corner RNNGs outperformed top-down RNNGs and LSTM, suggesting that hierarchical and left corner architectures are more cognitively plausible than top-down or sequential architectures. Moreover, lower perplexities and higher parsing accuracies of the hierarchical models were strongly correlated with the higher psychometric predictive power, but the correlation did not hold for the sequential model. In addition, left-corner RNNGs may be more human-like than top-down RNNGs in that they can infer the correct tree structure with a small beam size comparable to humans. Roger Levy. 2005. Probabilistic models of word order and syntactic discontinuity. stanford university.

A Details of our regression model
The logarithmic reading time (log(RT)) was modeled using the following linear mixed-effects model as a baseline regression model: (4) Table 2 shows descriptions for the factors used in our experiments. We contained predictors and random intercepts that were used in Asahara et al. (2016), and added a predictor related to frequency following the previous literature (e.g., Frank and Bod, 2011;Fossum and Levy, 2012). Frequencies were estimated using the larger National Institute for Japanese Language and Linguistics Web Japanese Corpus (NWJC, Asahara et al., 2014). To capture spillover effects, the length and frequency of the previous segments were also added as a predictor (Smith and Levy, 2013). All numeric factors were centered, and the predictors that were not significant (p > 0.05) for modeling reading times were excluded. We removed 27 data points that were beyond three standard deviations. This left 12,087 data points as final statistical analysis targets. Figure 3: The relationship between perplexity on the NPCMJ test set and psychometric predictive power: psychometric predictive power (the vertical axis) is plotted against perplexity on the NPCMJ test set (the horizontal axis).

B Relationship between parsing accuracy and psychometric predictive power
The relationship between parsing accuracy and psychometric predictive power is summarized in Figure 2: psychometric predictive power is plotted against parsing accuracy (F1). Just like perplexity, left-corner RNNGs, which achieved higher psychometric predictive power, also achieved the higher parsing accuracy than top-down RNNGs. Overall, the correlation between parsing accuracy and psychometric predictive power of the hierarchical models was robust: the higher parsing accuracy RNNGs have, the higher psychometric predictive power they also have. (1996) (1996). Specifically, we calculated the number of structures with a probability more than 1/3.8 and 1/5.6 of the probability of the most probable structure, among structures within the word beam which corresponds to the beam defined in Jurafsky (1996). The average number of structures within the proposed relative beam turned out to be empirically small.

E Dataset split ratio
Sentences in the training data, NPCMJ, are from 14 sources. We used 90% of sentences in each source as a training set, and 5% of sentences as a validation set. The remaining 5% were used as a test set to calculate paring accuracies of RNNGs in Section 4 and perplexities of LMs in Appendix F.

F Relationship between perplexity on the NPCMJ test set and psychometric predictive power
We additionally investigated the relationship between perplexity calculated based on the sentences in the NPCMJ test set and psychometric predictive power. The result is shown in Figure 3: psychometric predictive power is plotted against perplexity on the NPCMJ test set. Although the perplexities of all LMs were overall lower, there was no substantial difference with the result shown in Figure 1. The difference in the corpus domain may cause the overall difference in perplexity.

Name
Type Description length int Number of characters in the segment prev_length int Number of characters in the previous segment freq num Logarithm of the geometric mean of the word frequencies in the segment prev_freq num Logarithm of the geometric mean of the word frequencies in the previous segment is_first factor Whether the segment is the first on a line is_last factor Whether the segment is the last on a line is_second_last factor Whether the segment is the second to last on a line screenN int Screen display order lineN int Line display order segmentN int Segment display order article factor Article ID subj factor Participant ID  Table 3: The average number of structures with a probability more than 1/3.8 and 1/5.6 of the probability of the most probable structure, among structures within the word beam which corresponds to the beam defined in Jurafsky (1996)