Relative Importance in Sentence Processing

Determining the relative importance of the elements in a sentence is a key factor for effortless natural language understanding. For human language processing, we can approximate patterns of relative importance by measuring reading fixations using eye-tracking technology. In neural language models, gradient-based saliency methods indicate the relative importance of a token for the target objective. In this work, we compare patterns of relative importance in English language processing by humans and models and analyze the underlying linguistic patterns. We find that human processing patterns in English correlate strongly with saliency-based importance in language models and not with attention-based importance. Our results indicate that saliency could be a cognitively more plausible metric for interpreting neural language models. The code is available on github: https://github.com/beinborn/relative_importance.


Introduction
When children learn to read, they first focus on each word individually and gradually learn to anticipate frequent patterns (Blythe and Joseph, 2011). More experienced readers are able to completely skip words that are predictable from the context and to focus on the more relevant words of a sentence (Schroeder et al., 2015). Psycholinguistic studies aim at unraveling the characteristics that determine the relevance of a word and find that lexical factors such as word class, word frequency, and word complexity play an important role, but that the effects vary depending on the sentential context (Rayner and Duffy, 1986).
In natural language processing, the relative importance of words is usually interpreted with respect to a specific task. Emotional adjectives are most relevant in sentiment detection (Socher et al., 2013), relative frequency of a term is an indicator for information extraction (Wu et al., 2008), the relative position of a token can be used to approximate novelty for summarisation (Chopra et al., 2016), and function words play an important role in stylistic analyses such as plagiarism detection (Stamatatos, 2011). Neural language models are trained to be a good basis for any of these tasks and are thus expected to represent a more general notion of relative importance (Devlin et al., 2019).
Relative importance of the input in neural networks can be modulated by the so-called "attention" mechanism (Bahdanau et al., 2014). Analyses of image processing models indicate that attention weights reflect cognitively plausible patterns of visual saliency (Xu et al., 2015;Coco and Keller, 2012). Recent research in language processing finds that attention weights are not a good proxy for relative importance because different attention distributions can lead to the same predictions (Jain and Wallace, 2019). Gradient-based methods such as saliency scores seem to better approximate the relative importance of input words for neural processing models (Bastings and Filippova, 2020).
In this work, we compare patterns of relative importance in human and computational English language processing. We approximate relative importance for humans as the relative fixation duration in eye-tracking data collected in naturalistic language understanding scenarios. In related work, Sood et al. (2020a) measure the correlation between attention in neural networks trained for a documentlevel question-answering task and find that the attention in a transformer language model deviates strongly from human fixation patterns. In this work, we instead approximate relative importance in computational models using gradient-based saliency and find that it correlates much better with human patterns. Figure 1: Example fixations for two subjects in the ZuCo dataset for the sentence "The soundtrack alone is worth the price of admission". The numbers indicate the fixation duration and the circles represent the approximate horizontal position of the fixation (positions are simplified for better visualization). The plot at the bottom indicates the relative importance of each token averaged over all subjects.

Determining Relative Importance
The concept of relative importance of a token for sentence processing encompasses several related psycholinguistic phenomena such as relevance for understanding the sentence, difficulty and novelty of a token within the context, semantic and syntactic surprisal, or domain-specificity of a token. We take a data-driven perspective and approximate the relative importance of a token by the processing effort that can be attributed to it compared to the other tokens in the sentence.

In Human Language Processing
The sentence processing effort can be approximated indirectly using a range of metrics such as response times in reading comprehension experiments (Su and Davison, 2019), processing duration in self-paced reading (Linzen and Jaeger, 2016), and voltage changes in electroencephalography recordings (Frank et al., 2015). In this work, we approximate relative importance using eye movement recordings during reading because they provide online measurements in a comfortable experimental setup which is more similar to a normal, uncontrolled reading experience. Eye-tracking technology can measure with high accuracy how long a reader fixates each word. The fixation duration and the relative importance of a token for the reader are strongly correlated with reading comprehension (Rayner, 1977;Malmaud et al., 2020).
Language models that look ahead and take both the left and right context into account are often considered cognitively less plausible because humans process language incrementally from left to right (Merkx and Frank, 2020). However, in human reading, we frequently find regressions: humans fixate relevant parts of the left context again while already knowing what comes next (Rayner, 1998). In Figure 1, subject 1 first reads the entire sentences and then jumps back to the token "alone". Subject 2 performs several regressions to better understand the second half of the sentence. The fixation duration is a cumulative measure that sums over these repeated fixations. Absolute fixation duration can vary strongly between subjects due to differences in reading speed but the relative fixation duration provides a good approximation for the relative importance of a token as it abstracts from individual differences. We average the relative fixation duration over all subjects to obtain a more robust signal (visualized in the plot at the bottom of Figure 1).

In Computational Language Processing
In computational language models, the interpretation of a token depends on the tokens in its context but not all tokens are equally important. To account for varying importance, so-called attention weights regulate the information flow in neural networks (Bahdanau et al., 2014). These weights are optimized with respect to a target objective and higher attention for an input token has been interpreted as higher importance with respect to the output (Vig, 2019). Recent research indicates that complementary attention distributions can lead to the same model prediction (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019) and that the removal of input tokens with large attention weights often does not lead to a change in the model's prediction (Serrano and Smith, 2019). In transformer models, the attention weights often approximate an almost uniform distribution in higher model layers (Abnar and Zuidema, 2020). Bastings and Filippova (2020) argue that saliency methods are more suitable for assigning importance weights to input tokens.
Saliency methods calculate the gradient of the output corresponding to the correct prediction with respect to an input element to identify those parts of the input that have the biggest influence on the prediction (Lipton, 2018). Saliency maps were first developed for image processing models to highlight the areas of the image that are discriminative with respect to the tested output class (Simonyan et al., 2014). Li et al. (2016) adapt this method to calculate the relative change of the output probabilities with respect to individual input tokens in text classification tasks and Ding et al. (2019) calculate saliency maps for interpreting the alignment process in machine translation models.
In general-purpose language models such as BERT (Devlin et al., 2019), the objective function tries to predict a token based on its context. A saliency vector for a masked token thus indicates the importance of each of the tokens in the context of correctly predicting the masked token (Madsen, 2019).
We iterate over each token vector x i in our input sequence x 1 , x 2 , ... x n . Let X i be the input matrix with x i being masked. The saliency s ij for input token x j for the prediction of the correct token t i is then calculated as the Euclidean norm of the gradient of the logit for x i .
The saliency vector s i indicates the relevance of each token for the correct prediction of the masked token t i . 1 The saliency scores are normalized by dividing by the maximum. We determine the relative importance of a token by summing over the saliency scores for each token. For comparison, we also approximate importance using attention values from the last layer of each model as Sood et al. (2020a).

Patterns of Relative Importance
Relative importance in human processing and in computational models is sensitive to linguistic properties. Rayner (1998) provides a detailed overview of token-level features that have been found to correlate with fixation duration such as length, frequency, and word class. On the contextual level, lexical and syntactic disambiguation processes cause regressions and thus lead to longer fixation duration (Just and Carpenter, 1980;Lowder et al., 2018). Computational models are also highly susceptible to frequency effects and surprisal metrics calculated using language models can predict the human processing effort (Frank et al., 2013).
The inductive bias of language processing models can be improved using the eye-tracking signal (Barrett et al., 2018;Klerke and Plank, 2019) and the modification leads to more "human-like" output in generative tasks (Takmaz et al., 2020;Sood et al., 2020b). This indicates that patterns of relative importance in computational representations 1 Our implementation adapts code from https:// pypi.org/project/textualheatmap/. An alternative would be to multiply saliency and input (Alammar, 2020). differ from human processing patterns. Previous work focused on identifying links between the eyetracking signal and attention (Sood et al., 2020a). To our knowledge, this is the first attempt to correlate fixation duration with saliency metrics. The eye-tracking signal represents human reading processes aimed at language understanding. In previous work, we have shown that contextualized language models can predict eye patterns associated with human reading (Hollenstein et al., 2021), which indicates that computational models and humans encode similar linguistic patterns. It remains an open debate to which extent language models are able to approximate language understanding (Bender and Koller, 2020). We are convinced that language needs to be cooperatively grounded in the real world (Beinborn et al., 2018). Purely textbased language models clearly miss important aspects of language understanding but they can approximate human performance in an impressive range of processing tasks. We aim to gain a deeper understanding of the similarities and differences between human and computational language processing to better evaluate the capabilities of language models.

Methodology
We extract relative importance values for tokens from eye-tracking corpora and language models as described in section 2 and calculate the Spearman correlation for each sentence. 2 We first average the correlation over all sentences to analyze whether the importance patterns of humans and models are comparable and then conduct token-level analyses.

Eye-tracking Corpora
We extract the relative fixation duration from two eye-tracking corpora and average it over all readers for each sentence. Both corpora record natural reading and the text passages were followed by multiple-choice questions to test the readers' comprehension.
GECO contains eye-tracking data from 14 native English speakers reading the entire novel The Mysterious Affair at Styles by Agatha Christie (Cop et al., 2017). The text was presented on the screen in paragraphs.
ZuCo contains eye-tracking data of 30 native English speakers reading full sentences from movie reviews and Wikipedia articles (Hollenstein et al., 2018(Hollenstein et al., , 2020. 3

Language Models
We compare three state-of-the-art language models trained for English: BERT, ALBERT, and Dis-tilBERT. 4 BERT was the first widely successful transformer-based language model and remains highly influential (Devlin et al., 2019). ALBERT and DistilBERT are variants of BERT that require less training time due to a considerable reduction of the training parameters while maintaining similar performance on benchmark datasets (Lan et al., 2019;Sanh et al., 2019). 5 We analyze if the lighter architectures have an influence on the patterns of relative importance that the models learn.

Results
The results in Table 1 show that relative fixation duration by humans strongly correlates with the saliency values of the models. In contrast, attentionbased importance does not seem to be able to capture the human importance pattern. A random permutation baseline that shuffles the importance assigned by the language model yields no correlation (0.0) in all conditions. 6 As the standard deviations of the correlation across sentences are quite high (ZuCo: ∼0.22, GECO: ∼0.39), the small differences between models can be neglected (although they are consistent across corpora). For the subsequent analyses, we focus only on the BERT model 3 We combine ZuCo 1.0 (T1, T2) and ZuCo 2.0. (T1). 4 We use the Huggingface transformers implementation (Wolf et al., 2020) and the models bert-based-uncased, albert-base-v2, and distilbert-base-uncased. 5 Reduction is achieved by parameter sharing across layers (ALBERT) and by distillation which approximates the output distribution of the original BERT model using a smaller network (DistilBERT). See model references for details. 6 We repeat the permutation 100 times and average the correlation over all iterations.  Table 2: Spearman correlation between relative importance and word length and frequency. For the Sent condition, correlation is calculated per sentence and averaged. For Tok, importance is normalized by sentence length and correlation is calculated over all tokens.
which yields the best results. The differences between the corpora might be related to the number of sentences and the differences in average sentence length (ZuCo: 924,19.5,GECO: 4,926,12.7).

Length and Frequency
In eye-tracking data, word length correlates with fixation duration because it takes longer to read all characters. The correlation for frequency is inverse because highfrequency words (e.g. "the", "has") are often skipped in processing as they carry (almost) no meaning (Rayner, 1998). For English, word frequency and word length are both closely related to word complexity (Beinborn et al., 2014). Language models do not directly encode word length but they are sensitive to word frequency. Our results in Table 2 show that both token length and frequency are strongly correlated with relative importance on the sentence level. Interestingly, the correlation decreases when it is calculated directly over all tokens indicating that the token-level relation between length and importance is more complex than the correlation might suggest. Figure 2 shows the average relative importance of all tokens belonging to the same word class (normalized by sentence length). We see that both humans and BERT clearly assign higher importance to content words (left) than to function words (right). Interjections such as "Oh" in figure 3 receive the highest relevance which is understandable because they interrupt the reading flow. When we look at individual sentences, we note that the differences in importance are more pronounced in the model saliency while human fixation duration yields a smoother distribution over the tokens.

Word Class
Novelty We extract the language model representations for each sentence separately whereas the readers processed the sentences consecutively. If tokens are mentioned repeatedly such as "Sherlock  Holmes" which also occurred in the sentence preceding the example in Figure 3), processing ease increases for the reader, and not for the model. Some language models are able to process multiple sentences, but establishing semantic links across sentences remains a challenge.

Conclusion
We find that human sentence processing patterns in English correlate strongly with saliency-based importance in language models and not with attentionbased importance. Our results indicate that saliency could be a cognitively more plausible metric for interpreting neural language models. In future work, it would be interesting to test the robustness of the approach with different variants for calculating saliency (Bastings and Filippova, 2020; Ding and Koehn, 2021). As we conducted our analyses only for English data, it is not yet clear whether our results generalize across languages. We will address this in future work using eye-tracking data from non-English readers (Makowski et al., 2018;Laurinavichyute et al., 2019) and comparing monoand multilingual models (Beinborn and Choenni, 2020). We want to extend the token-level analyses to syntactic phenomena and cross-sentence effects. For example, it would be interesting to see how a language model encodes relative importance for sentences that are syntactically correct but not se-mantically meaningful (Gulordava et al., 2018). Previous work has shown that the inductive bias of recurrent neural networks can be modified to obtain cognitively more plausible model decisions (Bhatt et al., 2020;Shen et al., 2019). In principle, our approach can also be applied to left-to-right models such as GPT-2 (Radford et al., 2019). In this case, the tokens at the beginning of the sentence would be assigned disproportionately high importance as the following tokens cannot contribute to the prediction of preceding tokens in incremental processing. It might thus be more useful to only use the first fixation duration of the gaze signal for analyzing importance in left-to-right models. However, we think that the regressions by the readers provide valuable information about sentence processing.

Ethical Considerations
Data from human participants were leveraged from freely available datasets (Hollenstein et al., 2018(Hollenstein et al., , 2020Cop et al., 2017). The datasets provide anonymized records in compliance with ethical board approvals and do not contain any information that can be linked to the participants.

A Additional Results
(a) Human Fixation (b) Model saliency Figure 4: Relative importance of tokens with respect to word class in the GECO dataset. Relative importance is measured as relative fixation duration for humans (top) and as relative gradient-based saliency in the BERT model (bottom). This is the same figure as Figure 2 in the paper but it includes the number of instances per word class on top of the respective bar.