Cross-Lingual Transfer of Cognitive Processing Complexity

When humans read a text, their eye movements are influenced by the structural complexity of the input sentences. This cognitive phenomenon holds across languages and recent studies indicate that multilingual language models utilize structural similarities between languages to facilitate cross-lingual transfer. We use sentence-level eye-tracking patterns as a cognitive indicator for structural complexity and show that the multilingual model XLM-RoBERTa can successfully predict varied patterns for 13 typologically diverse languages, despite being fine-tuned only on English data. We quantify the sensitivity of the model to structural complexity and distinguish a range of complexity characteristics. Our results indicate that the model develops a meaningful bias towards sentence length but also integrates cross-lingual differences. We conduct a control experiment with randomized word order and find that the model seems to additionally capture more complex structural information.


Introduction
Approximately 7,000 languages are currently spoken in the world, exhibiting differences at almost every level of linguistic organization (Eberhard et al., 2022).Nonetheless, psycholinguistic theories are predominantly supported by evidence from a handful of Indo-European languages (Norcliffe et al., 2015).Only recently, researchers have started to explore cross-linguistic differences in the neural implementation of language, uncovering both striking similarities across languages and empirical differences that cannot be explained by a unitary account (Malik-Moraleda et al., 2022).
In natural language processing, multilingual language models are optimized for tasks such as machine translation or cross-lingual information retrieval (Conneau et al., 2020) and follow a linguis-tically naïve training regime.They are trained on dozens of languages simultaneously and do not account for typological differences between languages.Nevertheless, their cross-lingual transfer performance sets new records, even in zero-shot settings (Pires et al., 2019).The ability to transfer knowledge across languages has been attributed to the shared vocabulary that is used for all languages (Wu and Dredze, 2019) because it enables the reuse of common morphological roots for languages from the same family.However, recent studies indicate that vocabulary sharing is not a prerequisite for cross-lingual transfer (Artetxe et al., 2020) and that structural commonalities between languages play a more prevalent role in models (Karthikeyan et al., 2020).
Human sentence processing is sensitive to structural complexity.Eye movement data recorded during reading provide insights into cognitive processing patterns with a temporal accuracy of milliseconds (Winke, 2013).Structural processing difficulty materializes as regressions towards the complex region and an increase of fixations on that region (Clifton and Staub, 2011).For example, sentences with an object-relative structure trigger more regressions than sentences with more common subject-relative clauses (Gordon et al., 2006).A classical example of structural complexity are garden-path sentences which initially trigger a simplified interpretation that must be revised when reading the rest of the sentence (Bever, 1970).
On the surface level, eye movement patterns are language-specific since they are influenced by visual factors such as orthography and word length (Kliegl et al., 2004).For example, the Chinese script is much more visually dense than the alphabetic script, resulting in longer fixations and saccades that move to positions relatively close to the current word (Liversedge et al., 2016).On a deeper processing level, reading patterns seem to converge across languages.Predictability effects have been demonstrated in multiple languages (Al-Jassmi et al., 2022;Laurinavichyute et al., 2019) and sentences that are matched for content are read at a similar speed in Chinese, English, and Finnish (Liversedge et al., 2016).Sarti et al. (2021) find that the representations of an English pre-trained transformer-based language model encode structural complexity more prominently when they are fine-tuned to predict English eye-tracking patterns.Interestingly, Rama et al. (2020) claim that structural similarity between languages is only weakly represented in multilingual models.Nevertheless, Hollenstein et al. (2021) show that multilingual models are able to predict eye movement patterns of reading even for languages that are not seen during fine-tuning, which indicates a general learnability of the relationship between structural complexity and eye movement patterns.Their results are restricted to four languages (three of them are from the Germanic family), and it remains unclear which structural cues are leveraged for the cross-lingual prediction because the test sentences are not aligned across languages.

Contributions
We examine whether the multilingual model XLM-RoBERTa (henceforth XLM-R) is sensitive to the structural complexity patterns that can be found in eye-tracking data.We use data from the newly released Multilingual Eyetracking Corpus (Siegelman et al., 2022) to predict eye movement patterns for parallel texts in 13 typologically diverse languages.This allows us to specifically target the model's sensitivity towards structural information and rules out the possibility that the results are influenced by differences in semantics or dataset sizes.
We show that XLM-R can apply cross-lingual transfer to predict eye-tracking patterns for all 13 languages while being fine-tuned only on English eye-tracking data.Our results indicate that the model develops a meaningful bias towards sentence length, but also integrates cross-lingual differences.For a more detailed analysis of structural sensitivity, we probe the model's final layer for complexity features.Based on a control experiment with randomized word order, we conclude that the model seems to additionally capture more complex structural information.All our experimental code is publicly available at https://github.com/CharlottePouw/crosslingual-complexity-transfer.

Related Work
We introduce recent findings on the role of structural information for cross-lingual transfer in multilingual models and motivate the use of eye-tracking data as a proxy for cognitive processing complexity.

Cross-lingual Transfer in Multilingual Models
Massive multilingual language models such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) are trained on more than a hundred languages simultaneously.Wu and Dredze (2019) show that this approach leads to surprisingly strong performances in cross-lingual transfer settings and attribute the improvements to the shared subword vocabulary.Pires et al. (2019) note that the model's ability to generalize "cannot be attributed solely to vocabulary memorization".Complementary, Artetxe et al. (2020) and Liu et al. (2020) find that a shared vocabulary is not necessary for cross-lingual transfer.Instead, the multilingual model seems to exploit structural similarity between the training and the target language to facilitate transfer (Karthikeyan et al., 2020).Structural similarity is loosely defined as an overlap on a subset of typological characteristics which seem to be better reflected in multilingual language models explicitly optimizing for cross-lingual transfer (Beinborn and Choenni, 2020;Choenni and Shutova, 2022).In language-agnostic models such as mBERT and XLM-R, the multilingual representations of the input can be separated into languagespecific and language-neutral components (Tanti et al., 2021;Libovický et al., 2020;Gonen et al., 2020).While Rama et al. (2020) find that structural similarity between languages is only weakly represented in these models, Bjerva et al. (2019) observe that structural similarity between languages correlates most with representational similarity.Experiments with artificial languages indicate that multilingual models are sensitive to hierarchical structure (De Varda and Zamparelli, 2022) and to word order (Chai et al., 2022;Deshpande et al., 2022).Ahmad et al. (2021) show that cross-lingual transfer can be improved by explicitly encoding structural information via an auxiliary syntactic objective and Guarasci et al. (2022) find that structural complexity knowledge can even be transferred across languages without explicit training.

Predicting Processing Complexity
Recent studies indicate that transformer-based language models are sensitive to structural characteristics of the input sentence when predicting eyetracking patterns.Hollenstein et al. (2021) find a correlation between the Flesch reading ease score and eye-tracking prediction accuracy of pre-trained multilingual transformer models which disappears after fine-tuning.Wiechmann et al. (2022) detect similar correlations between the prediction accuracy of English transformer models and a wider range of readability features.Finally, Hollenstein et al. (2022b) find that eye-tracking metrics predicted by multilingual transformer models correlate in a similar way with readability features as eye-tracking metrics recorded from human readers.
Sensitivity to structural complexity also seems to increase when incorporating eye-tracking data in NLP models.Learning eye movement behavior as an auxiliary task has been shown to facilitate the prediction of text complexity in English and Portuguese (González-Garduño and Søgaard, 2017;Evaldo Leal et al., 2020).Barrett et al. (2016) show that English eye-tracking features improve the performance a French part-of-speech tagger, suggesting that information learned from monolingual eye-tracking data is transferable across languages.
In this work, we explicitly test for sensitivity to a range of structural characteristics in multilingual models and analyze if structural sensitivity increases by learning to predict eye-tracking patterns.We extend previous analyses to a much wider range of languages from five different families (Indo-European, Koreanic, Semitic, Turkic, and Uralic).

Methodology
We fine-tune a pre-trained multilingual transformer model to predict eye-tracking metrics in a setting of zero-shot cross-lingual transfer.

Data
We use the aligned multilingual eye-tracking corpus MECO for testing.As the multilingual data consists of only few samples, we use the larger monolingual English eye-tracking dataset GECO for training.Size statistics of both corpora can be found in the appendix in Table 3.

Multilingual Eye-tracking Corpus (MECO)
The Multilingual Eye-tracking Corpus contains par-allel eye-tracking data of reading in 13 different languages (Siegelman et al., 2022). 1 The reading material consists of 12 short Wikipedia-style texts about various topics, which participants read in their native language.The texts were either directly translated or carefully matched for topic, genre, and readability.Each of the 12 texts was presented on a single screen and in the same fixed order in all languages.The number of participants ranged from 29 to 54 per language (45 on average).

Ghent Eye-tracking Corpus (GECO)
The Ghent Eye-tracking Corpus contains eye-tracking data from 14 monolingual English readers (Cop et al., 2016).They were reading the entire novel The Mysterious Affair at Styles by Agatha Christie which was presented on the screen one paragraph at a time.

Experimental Setup
We use multi-task learning for predicting four sentence-level eye-tracking metrics.
Sentence-Level Eye-Tracking Metrics Liversedge et al. ( 2016) find that eye movement patterns are more comparable across languages at the sentence level than at the word level.We select four sentence-level eye-tracking metrics that cover both early and late language processing in line with Sarti et al. (2021).For each sentence s, we consider: 1. Fixation count: number of fixations on s 2. Total fixation duration: total duration of all fixations on s 3. First-pass duration: duration of the first reading pass over s 4. Regression duration: total duration of all regressions within s.
Duration values are measured in milliseconds.To obtain generalized eye movement patterns, we average all eye-tracking metrics over participants and scale each eye-tracking feature to fall in the range 0-100, so that the loss can be calculated uniformly for durations and counts (Hollenstein et al., 2021).
The distribution of the four metrics is shown in the appendix in Figure 7.
Model We use XLM-R (Conneau et al., 2020) as our multilingual transformer model since it achieved the best zero-shot results in the CMCL 2022 Shared Task on Multilingual and Crosslingual Prediction of Human Reading Behaviour (Srivastava, 2022;Hollenstein et al., 2022a).The model was pre-trained on 2.5TB CommonCrawl data containing 100 languages using the Masked Language Modelling objective and uses SentencePiece subword tokenization (Kudo and Richardson, 2018).We select the Huggingface checkpoint xlm-robertabase and add a linear dense layer to predict four sentence-level eye-tracking metrics.
Multi-Task Learning We employ multi-task learning with hard parameter sharing to fine-tune the model on all eye-tracking metrics simultaneously in line with Sarti et al. (2021).This means that all model parameters are shared except for the task-specific regression heads in the final prediction layer.More specifically, the same sentence representation is fed into each of the four regression heads which predict their respective eye-tracking metric.The model parameters are optimized jointly for all regression tasks by summing the individual MSE losses in line with previous work (Hollenstein et al., 2021(Hollenstein et al., , 2022a;;Wiechmann et al., 2022).
Training Parameters We fine-tune XLM-R for 15 epochs with early stopping after 5 epochs without an improvement in the validation accuracy.We use 10% of the training data as validation data and evaluate every 40 steps.We employ a batch size of 32 and a learning rate of 1e-5.The sentence representation is obtained by mean pooling over token representations.We train the model on the GECO data using 5-fold cross-validation and report the average over the folds for each language in MECO.
Evaluation We report explained variance and R-Squared (R2 ) to capture the proportion of variance in the dependent variable that can be explained by our model in line with Sarti et al. (2021).Explained variance uses the biased variance to determine what fraction of the variance is explained.R 2 uses the raw sums of squares instead and provides complementary information about systematic offsets in the predictions.We report both metrics and evaluate the performance of the fine-tuned model individually for each of the four eye-tracking metrics. 2 4 Cross-Lingual Transfer Results Figure 1 shows the explained variance and R 2 scores of the fine-tuned model for total fixation duration across languages.In terms of explained variance, we see that the model achieves a similar performance across languages, i.e. it captures 60 to 80 percent of the variance in the original eyetracking signal for all languages.The R 2 scores, on the other hand, vary much more depending on the language.Similar results were observed for two of the other eye-tracking metrics, i.e. fixation count and first-pass duration, but the model is worse at predicting regression duration (see Figure 8 in the appendix).To better control for spurious correlations, we ran the experiment on permuted input-output pairs, i.e., we paired input sentences with eye-tracking values corresponding to another random sentence and averaged the results over 5 folds.For this random baseline setup, both explained variance and R 2 are always strictly negative for all languages.To better understand the varied R 2 scores for different languages, we show the distribution of the true and predicted values for total fixation duration for two languages with high R 2 (Estonian, Turkish) and two languages with low R 2 (English, Korean) in Figure 2. We see that the low R 2 for English and Korean is caused by predictions that are consistently too high.For Estonian and Turkish, the difference between true and predicted values is clearly smaller, resulting in a higher R 2 .Nevertheless, the model is able to predict a significant amount of the variance in the eye-tracking signal of all languages, as expressed by the stable explained variance scores across languages.
Interestingly, the model performs slightly better for most zero-shot languages than for the finetuning language English.Recall that this performance difference cannot be attributed to crosslingual differences in semantics, since all sentences are parallel with respect to content.On the right side of Figure 2, we analyze the predictions with respect to sentence length and find that both the model predictions and the true values for fixation duration correlate with sentence length in all languages.As sentence length is an indicator of structural complexity, we further dissect this phenomenon and conduct an analysis of a range of structural characteristics in the following section.

Sensitivity to Structural Complexity
We explore four categories of sentence-level complexity features: length, frequency, morphosyntactic, and syntactic.Word frequencies are obtained as standardized Zipf frequencies using the Python package wordfreq (Speer et al., 2018).The package combines several frequency resources, including SUBTLEX lists (e.g.Brysbaert and New ( 2009)) and OpenSubtitles (Lison and Tiedemann, 2016).The morpho-syntactic and syntactic features are computed using the Profiling-UD tool (Brunato et al., 2020).

Cross-Lingual Differences
We showcase an individual example sentence in Table 1 to compare the predicted fixation duration for English, Finnish and Turkish.We observe that the highest value is predicted for the English version.This is most likely caused by its length, as the sentence is less complex than the Finnish and Turkish versions in terms of all other linguistic features.
Interestingly, the model predicts that Finnish readers will fixate on the sentence longer than Turkish readers, even though both sentences have the same length.The Turkish sentence contains longer, less frequent words, and is lexically more dense, but the Finnish sentence contains longer dependency links.This indicates that the model is more sensitive to dependency structure than to low-level complexity (i.e.word length and frequency) when predicting eye-tracking values for sentences of the same length.

Sensitivity to Fine-Tuning Input
To analyze the model's sensitivity to the structural complexity of the fine-tuning data, we compare the performance of the fine-tuned model for indomain data (English GECO) and cross-domain data (English MECO).Table 2 shows the explained variance and R 2 scores of the fine-tuned model predictions for each eye-tracking metric for both domains.We see that the model consistently yields

English
In ancient Roman religion and myth, Janus is the god of beginnings and gates.

Turkish
Antik Roma inanışlarında ve mitlerinde, Janus başlangıçların ve kapıların tanrısıdır.To better understand why the model does not generalize well across domains for English, we visualize the Spearman correlation between complexity features and eye-tracking metrics for English GECO and MECO sentences in Figure 3.We see that the predicted values for the MECO sentences exhibit a similar correlation pattern with the complexity features as the GECO sentences.The true values of MECO are less consistent with this pattern.Literary texts contain very different words than encyclopedic texts, which might influence fixation durations and trigger regressions that cannot solely be explained by structural complexity.In addition, MECO is significantly smaller than GECO (99 vs 4,041 English sentences) and contains data from a higher number of participants (46 vs 14).The smaller amount of sentences and the larger amount of readers increase the effect of individ-ual differences3 which might obscure correlations between structural complexity and eye movement patterns.Directly applying the learned correlations from GECO to MECO might explain why the finetuned model fails to generalize across domains.

Structural Complexity
The average sentence length is considerably higher in GECO than in MECO (21 vs 13 words, see Table 3).As the model predictions strongly correlate with sentence length, we speculate that the model overestimates eye-tracking values for sentences that are longer than the majority of finetuning sentences which would explain the higher mean of the predictions in Figure 2.
Multi-Task Learning Effect Figure 3 further shows that regression duration is only weakly correlated with the complexity metrics in contrast to the other eye-tracking metrics.Nevertheless, the correlations between the model predictions and the complexity features are similar for all four metrics.This indicates a drawback of multi-task learning: since the loss is computed jointly over all tasks, accurate predictions for three out of four tasks already yield a small loss.The model seems to overfit to first-pass duration, total fixation duration and fixation count, which can all be predicted from similar complexity features, and does not learn the deviat- ing patterns to predict regression duration.Further research is needed to better understand the linguistic features underlying regression duration.

Feature-Based Prediction
To further establish which complexity features are good predictors for each individual eye-tracking metric, we examine the extent to which the four eye-tracking metrics can be predicted from explicit features.Since multi-task learning seems to have a negative impact on learning the structural features underlying each individual eye-tracking metric, we train a separate feature-based model for each eyetracking metric individually.We use support vector machines (SVM) with a linear kernel as our featurebased regression models.We employ the SVR implementation from scikit-learn (Pedregosa et al., 2011) with all default parameters and use different subsets of features from Table 1: 1) only the two length features, 2) only the two frequency features, 3) only the five structural (i.e., morpho-syntactic and syntactic) features, and 4) all nine features.As the SVM models predict a simpler problem (a single eye-tracking metric), it is not surprising that they outperform the fine-tuned multi-task model with respect to the absolute predictions (as measured by R 2 , see appendix Figure 9).More interestingly, Figure 4 shows that the multi-task model is able to capture a similar amount of variance as the length-based SVM.Furthermore, we see that the length-based SVM performs almost identically to the SVM trained on all complexity features, outperforming the SVMs trained on frequency features and structural features.This shows that length is a strong predictor for sentence-level eye-tracking metrics, and suggests that structural and frequency features do not provide much additional information.We further investigate if length is the main factor affecting the predictions of the fine-tuned model in the following section.The models are trained on GECO using 5-fold crossvalidation and evaluated on the English part of MECO; error bars denote the standard deviation over folds.

The Role of Sentence Length
To test whether the fine-tuned XLM-R model captures more sophisticated structural information than sentence length, we conduct two additional experiments.First, we probe the final-layer representations of the model for the complexity features from Table 1, both before and after fine-tuning on eye-tracking data.Second, we compare the performance of the fine-tuned model to a control condition: we randomize the word order within each MECO sentence to analyze the prediction performance on scrambled input.

Probing Set-up
We train regressors g i to predict a value for each of the nine latent factors of structural complexity = z 1 ,...,z 9 using XLM-R's final-layer representation θ(x) of our input sentence x.The prediction accuracy of g i is an indication of how prominently the linguistic property z i is encoded in θ.We analyze this both for the pre-trained and fine-tuned representations of XLM-R to quantify the relative increase of sensitivity to z i after fine-tuning on eye-tracking metrics.
We conduct the probing experiments for three typologically different languages to analyze if the structural sensitivity that was acquired from English eye-tracking data transfers to other languages.As input, we use 1,000 parallel sentences from the English, Korean and Turkish parts of the Parallel Universal Dependencies (PUD) treebanks which were randomly selected from Wikipedia and news articles (Zeman et al., 2017).We apply a 5-fold cross-validation setting with 800 sentences for training the probing regressors for each language and the remaining 200 for testing.We use the same architecture as described in Section 3.2, but freeze the encoder model and only update the final regression layer during training.The regression layer contains nine probing heads (one for each linguistic feature) and is trained for 5 epochs. 4

Results
We report the results of the probing experiments and the model performance on scrambled inputs.
Probing Figure 5 shows the relative probing performance for each complexity feature.We see that fine-tuning yields the largest improvements for probing sentence length and average dependency link length.For the other complexity features, we see that the fine-tuned representations yield little to no improvement in probing accuracy compared to the pre-trained representations.This mostly concerns the features for which sentence length is factored out, i.e., average word frequency, average word length and lexical density.Sarti et al. (2021) report similar results and show that increased probing performance for dependency features persists for sentences of the same length.This provides additional evidence that structural information is learned in addition to low-level length information.
We observe only minor differences in probing accuracy for individual complexity features of En- 4 We report results for a multi-task set-up for probing in line with Sarti et al. (2021) and use the same hyperparameters as for the fine-tuning experiments but without intermediate evaluation on a development set.We also ran single-task probing as a sanity check and obtained similar results.glish, Korean and Turkish sentences.The general pattern is consistent for all languages: features related to the structural complexity of sentences are more easily predicted after fine-tuning on eye-tracking metrics.This indicates that the finetuned model is able to transfer structural complexity knowledge acquired from English eye-tracking data to other languages.

Influence of Word Order
We compare the performance of the fine-tuned model on sentences with normal versus scrambled word order, both in terms of explained variance and R 2 .We measure similar explained variance scores for both input types.This indicates that the model is able to account for a large portion of the variance in our eye-tracking data by merely considering sentence length.The R 2 scores, on the other hand, are consistently lower for scrambled inputs, as shown for total fixation duration in Figure 6 (see appendix Figure 10 for the other eye-tracking metrics).We conclude that the model is sensitive to word order and bases its eyetracking predictions not only on sentence length but also on more complex structural characteristics.

Conclusion
We find that XLM-R apply cross-lingual transfer to predict cognitive processing difficulty with similar performance across 13 typologically diverse languages, despite being fine-tuned only on English data.We conducted a range of experiments to quantify the model's sensitivity to structural complexity and find that the fine-tuned model prominently encodes sentence length, but also considers more complex structural information such as dependency structure and word order for the prediction of eye-tracking metrics.
Our analyses suggest that domain differences in training and testing data have a greater impact on model performance than language differences within the same domain.More specifically, XLM-R performs better on in-domain GECO data than cross-domain MECO data, but within MECO, XLM-R shows similar performance across languages.This aligns with the findings of Morger et al. (2022), who show that the correlation between relative importance metrics and total fixation duration is influenced by text domain.Our study highlights the significance of controlling for text domain and size, as it allows to evaluate crosslingual generalization that is independent of dataset characteristics.
In future work, we plan to better account for individual differences between readers (Brandl and Hollenstein, 2022) and spill-over effects across sentence boundaries (Wiechmann et al., 2022).The modeling approach for learning eye-tracking patterns also needs further exploration.We find that sentence-level prediction of eye-tracking patterns works well for learning about structural complexity, but that it is not optimal for capturing lexical complexity.Token-level measures, as predicted in Hollenstein et al. (2021), are more likely to be informative about lexical phenomena.A joint loss for sentence and token-level eye-tracking metrics might lead to sensitivity to a wider range of linguistic complexity features.

Limitations
The main limitation of our work is the use of relatively small datasets for testing our models due to limited availability of eye-tracking data in multiple languages.The dataset used for testing crosslingual transfer (MECO) contains approximately 100 sentences per language.For probing structural complexity, we used a sample of 1,000 sentences per language.
As in related work, we averaged the eye-tracking metrics over readers to obtain a more robust indication of human reading behavior.This approach disregards the fact that reading is a highly individual process that is dependent on cognitive factors and experience.A computational model might develop a better sense of linguistic complexity when it learns about the linguistic properties that lead to variation across readers and we are working towards methods for integrating this information.

Figure 1 :
Figure1: Cross-lingual transfer results for predicting cognitive processing complexity (i.e.sentence-level fixation duration).Prediction performance is evaluated with explained variance and R 2 for each language in MECO.The results are averaged over 5 folds; error bars denote the standard deviation over folds.

Figure 2 :
Figure 2: The left plot shows the distribution of true and predicted values for total fixation duration for Estonian, Turkish, English and Korean sentences in MECO.The right figure shows the distribution of values with respect to sentence length.

Figure 4 :
Figure 4: Explained variance of the four featurebased SVM models and the fine-tuned XLM-R model.The models are trained on GECO using 5-fold crossvalidation and evaluated on the English part of MECO; error bars denote the standard deviation over folds.

Figure 5 :
Figure5: Relative improvement in R 2 for complexity features of English, Korean and Turkish sentences in fine-tuned XLM-R sentence representations over pretrained representations.The results are calculated using probing regressors and averaged over 5 folds.

Figure 6 :
Figure6: R 2 scores for total fixation duration for each language in MECO, both for sentences with normal and scrambled word order.The results are averaged over 5 folds; error bars denote the standard deviation.

Figure 8 :Figure 9 :Figure 10 :
Figure8: Cross-lingual transfer results for predicting cognitive processing complexity (i.e.fixation count, first-pass duration and regression duration).Prediction performance is evaluated with explained variance and R 2 for each language in MECO.The results are averaged over 5 folds; error bars denote the standard deviation over folds.

Table 1 :
Predicted values for total fixation duration for the same example sentence in English, Finnish, and Turkish (top), and the respective values for the nine structural complexity features (bottom).

Table 2
Figure3: Spearman correlations between complexity features and eye-tracking metrics of GECO and the English part of MECO (predicted versus true).A darker color represents a stronger correlation.All GECO correlations are significant (p < 0.001); MECO correlations above 0.2 are significant (p < 0.01).