William Schuler

2026

Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly
Yi-Chien Lin | William Schuler
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

There has been considerable interest in using surprisal from Transformer-based language models (LMs) as predictors of human sentence processing difficulty. Recent work has observed an inverse scaling relationship between Transformers’ per-word estimated probability and the predictive power of their surprisal estimates on reading times, showing that LMs with more parameters and trained on more data are less predictive of human reading times. However, these studies focused on predicting latency-based measures. Tests on brain imaging data have not shown a trend in any direction when using a relatively small set of LMs, leaving open the possibility that the inverse scaling phenomenon is constrained to latency data. This study therefore conducted a more comprehensive evaluation using surprisal estimates from 17 pre-trained LMs across three different LM families on two functional magnetic resonance imaging (fMRI) datasets. Results show that the inverse scaling relationship between models’ per-word estimated probability and model fit on both datasets still obtains, resolving the inconclusive results of previous work and indicating that this trend is not specific to latency-based measures.

2025

pdf bib abs

Linear Recency Bias During Training Improves Transformers’ Fit to Reading Times
Christian Clark | Byung-Doh Oh | William Schuler
Proceedings of the 31st International Conference on Computational Linguistics

Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates from a Transformer that includes ALiBi during training and inference show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi’s mixture of slopes—which determine the rate of memory decay in each attention head—may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.

pdf bib abs

How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?
Christian Clark | Byung-Doh Oh | William Schuler
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model’s probability distribution over a word’s first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, suggesting a need for caution in using first-token approximations of contextual entropy.

pdf bib abs

The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage
Byung-Doh Oh | Hongao Zhu | William Schuler
Findings of the Association for Computational Linguistics: ACL 2025

In psycholinguistic modeling, surprisal from larger pre-trained language models has been shown to be a poorer predictor of naturalistic human reading times. However, it has been speculated that this may be due to data leakage that caused language models to see the text stimuli during training. This paper presents two studies to address this concern at scale. The first study reveals relatively little leakage of five naturalistic reading time corpora in two pre-training datasets in terms of length and frequency of token n-gram overlap. The second study replicates the negative relationship between language model size and the fit of surprisal to reading times using models trained on ‘leakage-free’ data that overlaps only minimally with the reading time corpora. Taken together, this suggests that previous results using language models trained on these corpora are not driven by the effects of data leakage.

pdf bib abs

The Impact of Token Granularity on the Predictive Power of Language Model Surprisal
Byung-Doh Oh | William Schuler
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word-by-word language model surprisal is often used to model the incremental processing of human readers, which raises questions about how various choices in language modeling influence its predictive power. One factor that has been overlooked in cognitive modeling is the granularity of subword tokens, which explicitly encodes information about word length and frequency, and ultimately influences the quality of vector representations that are learned. This paper presents experiments that manipulate the token granularity and evaluate its impact on the ability of surprisal to account for processing difficulty of naturalistic text and garden-path constructions. Experiments with naturalistic reading times reveal a substantial influence of token granularity on surprisal, with tokens defined by a vocabulary size of 8,000 resulting in surprisal that is most predictive. In contrast, on garden-path constructions, language models trained on coarser-grained tokens generally assigned higher surprisal to critical regions, suggesting a greater sensitivity to garden-path effects than previously reported. Taken together, these results suggest a large role of token granularity on the quality of language model surprisal for cognitive modeling.

pdf bib abs

Are Larger Language Models Better at Disambiguation?
Ziyuan Cao | William Schuler
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Humans deal with temporary syntactic ambiguity all the time in incremental sentence processing. Sentences with temporary ambiguity that causes processing difficulties, often reflected by increase in reading time, are referred to as garden-path sentences. Garden-path theories of sentence processing attribute the increases in reading time to the reanalysis of the previously ambiguous syntactic structure to make it consistent with the new disambiguating text. It is unknown whether transformer-based language models successfully resolve the temporary ambiguity after encountering the disambiguating text. We investigated this question by analyzing completions generated from language models for a type of garden-path sentence with ambiguity between a complement clause interpretation and a relative clause interpretation. We found that larger language models are worse at resolving such ambiguity.

2024

pdf bib abs

Frequency Explains the Inverse Correlation of Large Language Models’ Size, Training Data Amount, and Surprisal’s Fit to Reading Times
Byung-Doh Oh | Shisen Yue | William Schuler
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades. The current work presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends. First, residual errors from four language model families on four corpora show that the inverse correlation between model size and fit to reading times is the strongest on the subset of least frequent words, which is driven by excessively accurate predictions of larger model variants. Additionally, training dynamics reveal that during later training steps, all model variants learn to predict rare words and that larger model variants do so more accurately, which explains the detrimental effect of both training data amount and model size on fit to reading times. Finally, a feature attribution analysis demonstrates that larger model variants are able to accurately predict rare words based on both an effectively longer context window size as well as stronger local associations compared to smaller model variants. Taken together, these results indicate that Transformer-based language models’ surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.

pdf bib abs

Categorial Grammar Induction with Stochastic Category Selection
Christian Clark | William Schuler
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Grammar induction, the task of learning a set of syntactic rules from minimally annotated training data, provides a means of exploring the longstanding question of whether humans rely on innate knowledge to acquire language. Of the various formalisms available for grammar induction, categorial grammars provide an appealing option due to their transparent interface between syntax and semantics. However, to obtain competitive results, previous categorial grammar inducers have relied on shortcuts such as part-of-speech annotations or an ad hoc bias term in the objective function to ensure desirable branching behavior. We present a categorial grammar inducer that eliminates both shortcuts: it learns from raw data, and does not rely on a biased objective function. This improvement is achieved through a novel stochastic process used to select the set of available syntactic categories. On a corpus of English child-directed speech, the model attains a recall-homogeneity of 0.48, a large improvement over previous categorial grammar inducers.

William Schuler

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2003

2002

2001

2000

1999

1998

Co-authors

Venues