Eye-tracking based classification of Mandarin Chinese readers with and without dyslexia using neural sequence models

Eye movements are known to reflect cognitive processes in reading, and psychological reading research has shown that eye gaze patterns differ between readers with and without dyslexia. In recent years, researchers have attempted to classify readers with dyslexia based on their eye movements using Support Vector Machines (SVMs). However, these approaches (i) are based on highly aggregated features averaged over all words read by a participant, thus disregarding the sequential nature of the eye movements, and (ii) do not consider the linguistic stimulus and its interaction with the reader’s eye movements. In the present work, we propose two simple sequence models that process eye movements on the entire stimulus without the need of aggregating features across the sentence. Additionally, we incorporate the linguistic stimulus into the model in two ways—contextualized word embeddings and manually extracted linguistic features. The models are evaluated on a Mandarin Chinese dataset containing eye movements from children with and without dyslexia. Our results show that (i) even for a logographic script such as Chinese, sequence models are able to classify dyslexia on eye gaze sequences, reaching state-of-the-art performance, and (ii) incorporating the linguistic stimulus does not help to improve classification performance.


Introduction
Reading effortlessly constitutes a key skill in modern society.Individuals suffering from developmental dyslexia are characterized by specific and persistent reading problems.Global prevalence estimates range from 3 to 7% (Landerl et al., 2013;Peterson and Pennington, 2012).Previous research has consistently shown that early diagnosis and intervention is key to mitigate the resulting long-term Figure 1: Proposed approach.Each eye-movement reading measure vector is concatenated with contextualized word embeddings and used as input for the sequence models to infer whether a reader suffers from dyslexia.
consequences (Vaughn et al., 2010).Psychological and clinical research on eye movement patterns has revealed that individuals with dyslexia exhibit gaze patterns that differ significantly from the patterns observed in individuals without dyslexia (Rayner, 1998;Pan et al., 2014).In particular, scanpaths of individuals with dyslexia are characterized by longer fixation durations, more fixations, decreased saccade durations and a higher proportion of regressions.In recent years, increasing effort has been spent on utilizing these findings and applying supervised classification methods such as SVMs and Random Forests on eye movement data (see Kaisar 2020 for an overview) to infer the presence or absence of dyslexia.There are several reasons why automatized approaches for assistance in dyslexia detection are desirable.Currently, paper-pencil diagnostic tools are conducted by trained speech therapists.These tools are time-intensive and are typically only considered after a suspected case has been reported by observant educational staff, leaving many cases overlooked.Eye-movement-based diagnostic tools have the potential to be deployed in schools in a relatively inexpensive manner and as part of a standard procedure aimed at early and comprehensive detection of dyslexia; making an important contribution to educational equity.Although the aforementioned approaches provide promising results, they suffer from specific drawbacks: (i) The model input consists of eye movement features, aggregated for each subject over the presented stimulus material (text), thus disregarding the sequential nature of the eye movements; (ii) both the linguistic stimulus and its interaction with the reader's eye movements are not considered.For classification purposes, this does not pose a problem per se.However, it does not allow us to investigate questions such as: Which words (or, more specifically, what linguistic properties of the stimulus) are particularly informative to discriminate between individuals with and without dyslexia?In the present work, we propose two neural sequence models, depicted in Figure 1, that process the eye movements on the entire stimulus without the necessity of feature aggregation over the sentence.To incorporate the linguistic stimulus into the model, we use pre-trained contextualized word embeddings.We evaluate our model on an eyetracking-while-reading dataset from children with and without dyslexia reading Mandarin Chinese sentences by Pan et al. (2014).

ML-based detection of dyslexia
To date, various data types and signals have been utilized to solve the task of automated detection of dyslexia such as text, MRI scans (Cui et al., 2016), EEG recordings (Frid and Breznitz, 2012), student engagement data (Abdul Hamid et al., 2018) as well as eye-tracking data (Rello and Ballesteros, 2015;Raatikainen et al., 2021;Benfatto et al., 2016). Benfatto et al. (2016) train a Support Vector Machine with recursive feature elimination (SVM-RFE) on 168 eye-tracking features obtained from an eye-tracking-while-reading dataset from 185 Swedish children (aged 9-10 years).Their best SVM-RFE model selected 48 features and achieved an accuracy score of 95.6% ± 4.5% (sic!) on a balanced dataset.We reimplement this method and use it as a reference method (cf.4.1).Jothi Prabha and Bhargavi (2020), using the same dataset as Benfatto et al. (2016), experiment with various feature selection algorithms and machine learning models.They find that feature selection via Principle Component Analysis (PCA) in combination with a Particle Swarm Optimization based Hybrid Kernel SVM classifier yields the best accuracy.Raatikainen et al. ( 2021) combine a Random Forest classifier for feature selection with an SVM, achieving an accuracy of 89.7%.They expand their feature space with transition matrices that represent the number of transitions between the different segments (question, answer selection) in a trial as well as the number of gaze shifts within one segment.

Modeling eye-tracking data with deep neural sequence models
Eye movement data for task inference.Deep neural sequence models have been deployed to solve inference tasks based on eye movements such as reader (Jäger et al., 2019) and viewer identification (Lohr et al., 2020;Makowski et al., 2020Makowski et al., , 2021)), ADHD detection (Deng et al., 2022) as well as the prediction of reading comprehension (Reich et al., 2022).
Integrating the linguistic stimulus.There has been growing interest in combining language and eye movement models to predict gaze patterns during naturalistic reading (Hollenstein et al., 2021;Merkx and Frank, 2021;Hollenstein et al., 2022).Wiechmann et al. (2022) investigate the role of general text features and their interaction with eye movement patterns in predicting human reading behavior and find that models incorporating the linguistic stimulus improves prediction accuracy.

Problem Setting
We investigate the two closely related tasks of classifying (i) whether a given eye gaze sequence on one sentence is from a reader with or without dyslexia and (ii) whether a given eye gaze sequence on a set of sentences is from a reader with or without dyslexia.Formally, our training data can be represented as a set D = {(W 11 , y 1 ), . . ., (W N M , y N )}, where W ij = w ij1 . . .w ijK is a sequence of reading measure vectors 2 for each word k ∈ 1 . . .K j obtained from subject i reading sentence j, where N is the number of participants, M is the number of stimulus sentences read by each of the participants and K j the number of words in a given sentence j.Each reading measure vector consists of R reading measures, i.e., w ijk = (r ijk1 . . .r ijkR ).The binary target label y i denotes whether participant i is a reader with or without dyslexia.For (i), our goal is to train a binary classifier g θ such that where δ denotes the decision threshold and θ the set of hyperparameters.Accordingly, for (ii), The performance of a binary model can be characterized by a false-positive and a true-positive rate.By altering the decision threshold δ, a receiver operator characteristic (ROC) curve can be derived, with the area under the curve providing an aggregated measure for all possible values of δ.

Reference method
As a baseline method, we train an SVM-RFE, following the procedure described by Benfatto et al. (2016).We use the scikit-learn implementation (Pedregosa et al., 2011) of the SVM-RFE with a linear kernel.In the subject-prediction setting, we use eye movement features from each subject aggregated (mean and standard deviation) across trials and sentences as input vectors.In the sentence-prediction setting, we use aggregates of each sentence over all trials, yielding 2 × 12 = 24 features per instance in both settings.3

Proposed neural sequence models
Both models take as input an enriched reading measure vector r ij (cf.Section 4.2.1) of a sentence j read by participant i, normalized for each train/test set separately, and predict a label y i .We tune both models using random search.
LSTM.We implement a bidirectional recurrent neural network with LSTM cells.The mean of the hidden states is fed into a linear layer projecting it down to a single sigmoid output to represent the label prediction.Optimized hyperparameters and search space are reported in Appendix 2.
CNN.We implement a CNN that convolves the input accross the word sequence axis.It consists of two convolutional layers, each followed by a pooling layer, two dense layers, and a sigmoid output unit.Hyperparameters are listed in Appendix 2.

Incorporating the linguistic stimulus
Using contextualized word embeddings.To incorporate the linguistic stimulus (the words occurring in the current sentence), we first extract 768-dimensional BERT embeddings e jk for each word w in a given sentence j, using the pre-trained BERT BASE -embeddings, provided by Hugging Face (Wolf et al., 2020), and concatenate them with the reading measure vector w ijk , resulting in an enriched reading measure vector r ijk .Concatenating the full embedding to the feature vectors results in 768 + R dimensions, resulting in a substantial increase in parameters to be estimated.Given the small amount of available training data, we test two methods of dimensionality reduction: (i) We perform PCA on the word embeddings and use the first 20 principal components.(ii) Mean-differenceencoding: In order to capture domain-specific information from the word embeddings relating to differences in reading behaviour exhibited by individuals with and without dyslexia, we propose an alternative method, which we call mean-differenceencoding: We train a feed-forward neural network with one hidden layer of size 20 to predict differences between the mean values of each eye movement feature between the two groups for each word based on its original word embedding.The values of the hidden layer are a compressed representation of the original embedding that is optimized to encode information that discriminates between children with and without dyslexia.In order to avoid train-test data leakage, in each fold, the meandifference-encoder is trained from scratch on the respective training set.
Using manually extracted features.As an alternative way to incorporate the linguistic stimulus, we add a range of manually extracted linguistic features for each token w jk in sentence j: Surprisal, i.e., − log p(w jk | w j<k ), estimated with GPT-2 (Radford et al., 2019), part of speech, dependency relation type, distance to syntactic head, extracted using spaCy (Honnibal et al., 2020), mean character frequency and lexical frequency extracted from SUBTLEX-CH (Cai and Brysbaert, 2010).
Data.We employ eye-tracking-while-reading data from 62 Mandarin Chinese children (33 with dyslexia) provided by Pan et al. (2014).Participants were instructed to read 60 sentences out loud while their eye movements were recorded.40 sentences were selected from fifth grade textbooks and 20 additional control sentences were extracted from the Beijing Sentence Corpus (Pan et al., 2022).The dyslexia label had been assigned when a child scored at least 1.5 standard deviations below their corresponding age mean in standard character recognition test (Shu et al., 2003).

Evaluation procedure
We evaluate our models using 10-fold nested crossvalidation in two settings.In the sentence prediction setting, we predict the label from a single sentence, read by a given subject.In the subject prediction setting, we average the sigmoid outputs from all sentences read by a given subject in order to obtain a subject-level prediction.In both settings, sentences are stratified over 10 folds, balanced by group.Data from the same subject is always constrained to one fold, thus, the model always makes predictions for unseen subjects.
Hyperparameter tuning.For each test fold, we iterate through 9 validation folds, training 50 LSTM and 100 CNN models using randomly sampled parameter combinations for each fold.We select the highest scoring parameter set over all 9 validation folds and train a final model using 8 training folds.We use one left-out fold for early stopping and evaluate it on the test fold.

Results
For all methods, we report AUC as well as accuracy, recall, precision and the harmonic precision-recall mean F 1 for a decision threshold of 0.5 on subjectand sentence-level.As can be seen in Table 1, our proposed models reach but do not outperform stateof-the art performance.While on subject-level, the CNN architecture enriched with PCA-reduced word embeddings achieves the highest AUC, on sentence-level, the best results are obtained by the LSTM that solely includes eye-movement features.Overall, we note that classification performance on subject-level is higher than on sentence-level and that adding the linguistic stimulus does not aid classification performance, neither as contextualized word embeddings nor as manually-extracted features.Furthermore, as can be seen in Figure 2, performance varies considerably with respect to different test sets.We also observe that the variance in AUC for models enriched with the linguistic stimulus is larger for LSTMs compared to CNNs.Lastly, our domain-specific dimensionality reduction method (cf.Section 4.2.1) has no advantage over PCA, although the former is explicitly trained on differences between the two groups.

Discussion
Our proposed neural sequence models reach stateof-the-art performance on solving the task of detecting dyslexia from eye gaze sequences, for the first time investigated for a logographic script such as Chinese.Our results suggest that for our dataset, (i) neural architectures processing eye-movement sequences along the sentence have no advantage over the parsimonious SVM-baseline where features are aggregated over the sentence, and (ii) enabling the interaction between stimulus input and eye movements does not improve classification performance.However, after having shown that our approach is able to reach SOTA performance, we aim to exploit its properties to investigate the informativeness of particular sentences, words, and other linguistic sub-units for dyslexia detection in the future.Furthermore, for all investigated models, the overall performance appears to be driven by a small subset of individuals who presumably exhibit less typical reading behavior among their group and were more difficult to classify.Given that dyslexia is a spectrum disorder-not binary as it is often perceived-it is to be expected that individuals that are not located at the two extremes (clearly dyslexic  or clearly not dyslexic) are more difficult to classify in a binary environment.
Our study was able to show that an SVM-based approach, previously applied to alphabetic languages such as Swedish and Spanish, also works well on a logographic script such as Chinese.In future work, we would like to test our approach on alphabetic language data sets.This is particularly interesting given the fact that young Chinese readers are faced with different challenges, e.g., the absence of orthographic word boundaries, therefore requiring word segmentation, and the much larger number of characters required to be memorized.
Limitations.It should be noted that our dataset contained very little data.Considering that the number of parameters of our sequence models exceeded the one of the baseline model by orders of magnitude, it might be worth comparing the approaches again, once more data is available.The problem of data scarcity might be alleviated by pretraining on domain general eye-tracking datasets or with data augmentation methods 4 .Furthermore, we did not have access to the raw scores of the character recognition task.While our methods did not outperform the baseline in this binary environment, it would be interesting to assess their performance on a regression task. 4In a preliminary experiment, we pre-trained our models on the Beijing Sentence Corpus (Pan et al., 2022) and found that it did not increase classification performance.

Conclusion
For the first time, we deploy models to detect dyslexia from eye gaze sequences on data from Mandarin Chinese readers.We propose two sequence classification approaches that (i) take as input the full, non-aggregated linguistic stimulus and (ii) model the interaction of the stimulus with the eye movements.As a comparison, we adapt a previously proposed SVM-based approach for Mandarin Chinese.We find that all models reach SOTA performance for data based on a logographic script such as Chinese.In addition, we find that incorporating the linguistic stimulus does not improve the models' performance.Given that we reach SOTA performance on a very small dataset, our approach has proven worthwhile to be pursued, expanded, and further tested (e.g., on alphabetic language data sets).It has the potential to be successfully deployed in the context of automatized approaches for dyslexia detection with the final objective being the improvement of educational equity.

Figure 2 :
Figure 2: ROC curves over all test sets for best performing model (LSTM with no linguistic stimulus representation) on sentence-level.

Table 1 :
Classification results using 10-fold cross validation on subject-and sentence-level.We report AUC, accuracy, recall, precision and F 1 [results ± standard error].The latter four were computed for a decision threshold of 0.5.