A Comparative Study on Textual Saliency of Styles from Eye Tracking, Annotations, and Language Models

There is growing interest in incorporating eye-tracking data and other implicit measures of human language processing into natural language processing (NLP) pipelines. The data from human language processing contain unique insight into human linguistic understanding that could be exploited by language models. However, many unanswered questions remain about the nature of this data and how it can best be utilized in downstream NLP tasks. In this paper, we present EyeStyliency, an eye-tracking dataset for human processing of stylistic text (e.g., politeness). We develop an experimental protocol to collect these style-specific eye movements. We further investigate how this saliency data compares to both human annotation methods and model-based interpretability metrics. We find that while eye-tracking data is unique, it also intersects with both human annotations and model-based importance scores, providing a possible bridge between human- and machine-based perspectives. We propose utilizing this type of data to evaluate the cognitive plausibility of models that interpret style. Our eye-tracking data and processing code are publicly available.


Introduction
Human perception and understanding of text is critical in NLP.Typically, this understanding is leveraged in the form of ground-truth human annotations in supervised learning pipelines, or in the form of human evaluations of generated text.However, human language understanding is complex; multiple cognitive processes work together to enable reading, many of which occur automatically and unconsciously (DeVito, 1970).
Because of the complexity, disciplines concerned with understanding and modeling how hu-BERT: @Delta.Are you kidding?Delayed from 7:30pm to 11:00 -only to cancel it because the pilot is overtime.
Annotation: @Delta.Are you kidding?Delayed from 7:30pm to 11:00 -only to cancel it because the pilot is overtime.
Eye Tracking: @Delta.Are you kidding?Delayed from 7:30pm to 11:00 -only to cancel it because the pilot is overtime.
Figure 1: Salient words for impoliteness from three different perspectives.We find that eye tracking data contains some overlap between machine and human-annotated salience.
mans read -e.g., psycholinguistics and cognitive science -heavily utilize implicit measures of the human reading experience that capture signals from these automatic processes in real time.Examples of implicit measures include event-related potential, reaction times, and eye movements.In contrast, explicit measures include surveys and other methods that directly ask people to report their perceptions and experiences.We posit that traditional NLP pipelines, which have widely used explicit measures of human understanding, can also benefit from implicit measures.In this paper, we focus specifically on the use of eye movements as an implicit measure of textual saliency.
Recent research in NLP has demonstrated the feasibility of incorporating various types of eye movement data into NLP models in order to improve performance on a number of tasks (see Table 2 for an overview).However, this is still an underexplored area: best practices remain unclear, and it's not obvious whether there are tasks that are unsuitable for eye movement data, or how eye movement data should be balanced with traditional annotation data.In this work, we address two main research questions: RQ1: Does eye-tracking-based saliency meaningfully differ from simply gathering word-level human annotations, or from modelbased word importance measures?RQ2: Which eye tracking metrics and data processing methods are best suited to capturing textual saliency?
To address these questions, we conduct an eye tracking case study in which participants read texts with linguistic styles from HummingBird dataset (Hayati et al., 2021).We choose this dataset because its domain (textual styles) has not to our knowledge been previously explored for eye tracking applications and because the dataset contains lexical-level human annotations indicating which words contribute to the text's style.
We conduct an experiment designed to collect style-specific eye movements over text (see Section 3 for details), and we compare this saliency information to the human annotations as well as two large language model (LLM)-derived importance scores: integrated gradient scores from a BERT model fine-tuned on style datasets (Hayati et al., 2021), and word-surprisal scores from GPT-2 (Brown et al., 2020) (See Figure 1 for an example).Our findings indicate that eye-tracking-based saliency highlights some unique areas of the text, but it also intersects with both saliency from modelbased metrics and saliency from human annotations, making a bridge of sorts between the humanand machine-based perspectives.In-context learning experiments with GPT-3, in which the model makes a style classification decision based on both the text and the salient words, do not reveal a clear "best" metric for saliency, indicating that further study is required for evaluation.
Specifically, our contributions are: • A three-way comparison between salient text obtained via annotation, via eye tracking, and via large language model importance that illustrates the distinction between human annotations and human eye data.• A first-of-its-kind eye movement dataset on style saliency, collected from 20 participants and consisting of both control readings and style-focused readings for polite, impolite, positive, and negative textual styles.• An "eye-in-the-loop" in-context learning approach for eye-tracking data for the evaluation of small datasets that estimates the feasibility of eye-tracking data for aiding NLP models in style-related tasks.

Related Work
Eye tracking has been a staple of psycholinguistic investigations of reading for decades (Rayner, 1978;Just and Carpenter, 1980).Eye movement data is compelling because it provides realtime in- formation about how people process language in a natural, ecologically valid setting (i.e., there is no explicit experimental task, such as question answering, for participants to complete) (Kaiser, 2013).
Eye data provides insight into cognitive processes through the eye-mind assumption, which posits that (1) our eyes fixate on whatever our brains are currently processing, and (2) as cognitive effort to process an item increases, the amount of time that the eyes fixate on that item also increases (Just and Carpenter, 1980).Analysis of eye data under this framework has led to important insights into many unconscious phenomena in human language comprehension, e.g. the mechanisms involved in ambiguity resolution during reading (fra; Traxler and Frazier, 2008).
Eye Tracking in NLP.Due to the eye-mind assumption, eye-tracking data is particularly wellsuited to inferring patterns of reader attention, or saliency, over text.This saliency information has so far shown promising results when integrated into NLP models for question answering (e.g.Malkin et al. (2022); Sood et al. (2020a); Malmaud et al. (2020)).However, this is still a developing research area: there is limited available data, and there is little consensus regarding how to effectively collect data and incorporate it into NLP pipelines.To our knowledge there is no previous research that inves-tigates saliency for style via eye tracking, nor any previous research that compares saliency from eye tracking to human annotations (Table 1 compares our work with the prior work).
Outside of textual saliency, eye-tracking data has been leveraged for a variety of NLP tasks.Mishra et al. (2013) quantify the difficulty of sentences in machine translation tasks using eye movement data; Mishra et al. (2016) determine whether a reader understands sarcasm in text, and Søgaard (2016) evaluate the quality of word embeddings and text generations, respectively.Other work uses existing datasets, sometimes augmenting the data with a learned gaze predictor model, and uses this eye movement data as an additional signal when training models for various NLP tasks, including named entity recognition (Hollenstein et al., 2019;Tokunaga et al., 2017), paraphrasing (Sood et al., 2020b), part-of-speech tagging (Barrett et al., 2018), and sentiment analysis (see also Mathias et al. (2020) for a review).
Saliency in Linguistic Styles.People apply styles to language in order to express attitudes, reflect interpersonal intentions or goals, or convey social standings of the speaker or listener.The meaning expressed by these styles can be significant; in fact, there is strong evidence that effective communication requires an understanding of both style and literal semantic meaning (Hovy, 1987).Although BERT (Devlin et al., 2018) based finetuned models show strong performance on style classification, recent findings indicate that there are significant differences between how BERT perceives style at the lexical level and how humans perceive it (Hayati et al., 2021).

eyeStyliency: A Dataset of Eye Data for Textual Saliency
We describe the data collection procedure for eye-Styliency dataset from 20 participants and methods for computing saliency scores over text.

Data Setups
Our dataset consists of items from the Hummingbird dataset (Hayati et al., 2021) in the following stylistic categories: polite, impolite, positive sentiment, and negative sentiment. 2 We chose this subset because of the small correlation between 2 Politeness and sentiment datasets in Hummingbird are originally sourced from Danescu-Niculescu-Mizil et al. (2013) and Socher et al. (2013).categories (other categories, e.g.anger, disgust, and negative sentiment are all highly correlated).
In this initial exploratory study, we limit participants' total time commitment to one hour.To achieve this, the dataset size is 90 items across the four style categories.Most participants finished the experiment in 40-60 minutes, depending on both the individual's reading speed and the time needed to calibrate the individual to the eye tracker.

Eye-Tracking Measures
Monocular eye movement data is collected with an EyeLink 1000 Plus3 at a rate of 1000Hz.We look at the following eye-tracking metrics: • First Fixation Duration (FFD): The duration of the first fixation in an interest area.• First Run Dwell Time (FRD): The time interval beginning with the first fixation in the interest area and ending when the eye exits an interest area (whether to the right or left).• Go Past Time (GP): The time interval beginning with the first fixation in an interest area and ending when the eye exits the interest area to the left (i.e., to reread).• Dwell Time (DT): The total fixation duration for all fixations in an interest area.Also known as gaze duration.• Reread Time (RR): The total fixation duration for all fixations in an interest area after the area has already been entered and exited once.• Pupil Size (PS): The average pupil size over all fixations in an interest area.
(Note that First Run Dwell Time + Reread Time = Dwell Time.)These measures can broadly be categorized into early measures (first fixation duration, pupil size) that reflect more low-level reading processes and late measures (go past time, dwell time, reread time) that reflect higher-level processing and meaning integration (Conklin et al., 2021).Previous eye tracking applications for NLP have commonly used dwell time, but a variety of measures have been examined (see Table 2).In this study, we aim to compare a wide variety of measures in order to estimate which may be best-suited to capturing textual saliency.Note that to avoid redundancy, we chose to omit fixation counts from our analysis after finding high correlations between this measure and dwell time (pearson's r = 0.93, p < 0.01).We also chose to omit regression counts from our analysis after finding that regression counts were extremely sparse -specifically, 1.8% of the dataset had a non-zero regression count.

Pre-processing Eye Tracking Data
Eye data was delineated into fixations and saccades using the DataViewer software with EyeLink's standard algorithm and default velocity and acceleration thresholds.We further cleaned the data by removing trials with significant track loss (i.e.trials with track loss in over 50% of the text area); 1.5% of trials were removed due to track loss.An outlier analysis showed that 0.5% of fixations were outliers and were removed in our analysis.

Calculating Saliency Scores
We divide the text into interest areas (IAs) and calculate saliency scores for each IA.We do not segment the IAs such that each IA contains a single word, because in a single fixation people can read a span of about 21 surrounding characters (Rayner, 1978), meaning that many short words are not fixated on, leading to difficulties with our desired analyses.Instead, we use the natural language processing toolkit (NLTK)'s stopwords list (Bird et al., 2009) to define each IA such that stopwords share an IA with the closest non-stopword.Specifically, each stopword is combined with the closest nonstopword, with non-stopwords to the right being preferred in the case of a tie.We also ensure that no IA contains a line break.
We investigate two techniques for calculating eye tracking-based metrics for each IA i .
• z-score: For each participant p k , denote the eye tracking measurement in IA i as x ki .We calculate the participant-specific z-score of eye tracking measurement from IA i as z k (IA i ) = x ki −µ k σ k , where µ k and σ k are the participantspecific arithmetic mean and standard deviation, respectively.Then, the saliency score for IA i is given by

Experimental Procedure
The experiment follows a between-subjects, blocked design where each block contains stimuli that share a specific style (polite, impolite, positive, or negative) and source (Twitter, IMdB, or Stack Exchange/Wikipedia forums).We occasionally present an incongruent style in a block (e.g., we may display an impolite Tweet during the polite Tweet block).We are interested in comparing the eye movements of participants who read a stimulus congruently with the eye movements of participants who read that stimulus incongruently, since we expect that incongruent readings will pay more attention to style-specific aspects of the text, as they are unexpected and surprising.The congruent reading of the text provides a control.Figure 3 shows a concrete example of these two conditions, while Figure 4 shows a visualization of these contrasted eye movements.
Figure 2 shows a procedure of our experiments.The experimental procedure is as follows (more details in Appendix A).Participants complete nine blocks.At the beginning of block, the participant is informed of the style and source, and asked to pay attention to the style of the following texts.Each block contains 10 items, eight of which are congruent with the target style.The remaining two items are incongruent with the target style.Incongruent items are counterbalanced across participants.Blocks are presented in a random order, and items within the blocks are pseudorandomized to ensure adequate spacing between congruent and incongruent trials (Egner, 2007) (we also include a block of context-free text as an added control).Participants are asked True/False comprehension questions pseudorandomly after 30% of the items in order to maintain motivation to read the items carefully.After the experiment concludes, participants complete the Perceived Awareness of Research Hypothesis Scale (PARH) (Rubin, 2016) to evaluate whether demand characterstics (Nichols and Maner, 2008) of the experiment may have influenced reading behavior.The study procedure was approved by the institutional review board (IRB).

Congruent Setup
A densely constructed, highly referential film, and an audacious return to form that can comfortably sit among Jean-Luc Godard's finest work.

Incongruent Setup Context Stimuli
Watching its rote plot points connect is about as exciting as gazing at an egg timer for 93 minutes.

Eye-based saliency
The following movie reviews were written by critics who disliked the film.
The following movie reviews were written by critics who liked the film.
incongruent gaze congruent gaze (control) Figure 3: Illustrative example of congruent vs incongruent presentation of the same stimulus.We rely on expectation effects to induce participants to attend to the unexpected style (in this case, positive sentiment); in other words, we assume that the surprise regarding the style will result in longer gaze durations for words that contribute to the perception of that style -in this case, words relating to positive sentiment (relative to the gaze durations in the congruent condition).Participants We collect data from 20 participants (12 male, 7 female, 1 non-binary; median age 23 years) recruited from the University community and word-of-mouth.An additional 6 participants were recruited but unable to complete the study due to problems with eye tracker calibration.Participants were compensated with a $15 Amazon gift card.
Apparatus Monocular eye movement data is collected with an EyeLink 1000 Pro, using the desktop mount, at a rate of 1000Hz.Participants use a chinrest while reading in order to stabilize the head.We use the Experiment Builder software to present stimuli to participants in a 16pt serif font with 1.5 line spacing, on our display monitor with a 508mm display area and a 1680x1050 resolution.Participants are seated with their eyes 50-60cm away from the display monitor.
The movie, directed by Mick Jackson, leaves no cliche unturned, from the predictable plot to the characters straight out of central casting.
The movie, directed by Mick Jackson, leaves no cliche unturned, from the predictable plot to the characters straight out of central casting.
The movie, directed by Mick Jackson, leaves no cliche unturned, from the predictable plot to the characters straight out of central casting.
The movie, directed by Mick Jackson, leaves no cliche unturned, from the predictable plot to the characters straight out of central casting.
Annotation BERT Gradient

GPT2 Surprisal
Eye Dwell Time (a) negative sentiment It's one of those baseball pictures where the hero is stoic, the wife is patient, the kids are as cute as all get-out and the odds against success are long enough to intimidate, but short enough to make a dream seem possible.
It's one of those baseball pictures where the hero is stoic, the wife is patient, the kids are as cute as all get-out and the odds against success are long enough to intimidate, but short enough to make a dream seem possible.
It's one of those baseball pictures where the hero is stoic, the wife is patient, the kids are as cute as all get-out and the odds against success are long enough to intimidate, but short enough to make a dream seem possible.
It's one of those baseball pictures where the hero is stoic, the wife is patient, the kids are as cute as all get-out and the odds against success are long enough to intimidate, but short enough to make a dream seem possible.Study Design Rationale Based on the welldocumented phenomenon of expectancy effects in cognition (see Schwarz et al. (2016) for further discussion), we assume that the incongruent texts that subvert the stylistic expectation will lead to participants reacting with surprise and increased processing difficulty in response to parts of the text associated with the unexpected style.
Alternative designs that explicitly ask participants to classify an item's style were strongly considered, but were rejected for two reasons: first, we are interested in observing a relatively natural reading process and introducing a classification task runs counter to that goal; second, the style classification task could increase the saliency of not only the target style but also its opposing style, as both can be relevant to the decision.We also considered designs in which congruency is established via explicit text labels rather than implicit expectations, but decided to instead choose an experimental paradigm that adheres as closely as possible to an ecologically valid reading task.

Comparison with Other Saliency Metrics
We investigate how eye tracking metrics compare with other existing measures for lexical-level sig- nificance -namely, human annotations, integrated gradient scores, and large language model surprisal scores (see Figure 5 for a visualization of these scores): • Surprisal scores: For the text in the i th interest area, denoted IA i , the surprisal is P (IA i |IA 0 , IA 1 , ...IA i−1 ).We obtain this probability estimate from the pre-trained GPT2 model4 (Radford et al., 2019).• Model gradient scores: Integrated gradient scores are obtained using the Captum codebase (Kokhlikyan et al., 2020) and the finetuned BERT model from Hayati et al. (2021).
For IA i , the integrated gradient score is the average of the individual tokens within IA i .• Human annotations: Human annotations come from the Hummingbird dataset (Hayati et al., 2021).Three annotators per item were asked to highlight words that contribute to the text's style.We averaged these binary highlighting scores over each annotator to arrive at a saliency score for each interest area.Throughout the comparison, we answer the following two questions: How much do the saliency words derived from each measure overlap and how much does each measure agree on the saliency strength of each word?
To find the overlap between salient interest areas derived from different measures, we compute a binary saliency map over the dataset for each measure.We then compute the pairwise Jaccard similarity coefficient for each possible pairing of salient text sets (Figure 7), where the Jaccard similarity coefficient is the intersection over union of   the two sets.We use the median as the saliency score threshold that determines whether the interest area is labeled "salient." We find that the intersection over union of salient interest areas from eye tracking methods and both integrated gradient scores and human annotations falls between 0.26 and 0.31.Critically, the threeway intersection over union between salient text from integrated gradients, human annotations, and eye tracking metrics falls between 0.05 and 0.06, indicating that each metric captures a relatively unique set of text within the dataset (see Fig 6).
We also investigate what types of words are selected as salient by each method by performing part-of-speech (POS) tagging on the salient interest areas for each measure (Figure 8), finding that while distributions of parts of speech are similar, humans select proportionally more adjectives while eye tracking metrics select proportionally more verbs and adverbs.We hypothesize that this discrepancy may be explained by human annotators focusing more on single words with high standalone style (oftentimes these are adjectives such as happy, gracious), while people's eyes pay attention to the context surrounding that word (often-  times this surrounding context includes verbs and adverbs).For example, in the polite phrase "Thank you for removing...," human annotators highlight only "thank you" whereas eye gaze also focuses on the gerund verb "removing." To measure agreement between different measures with respect to saliency strength, we compute a saliency score for each word in the dataset derived from each measure.We then compute the pairwise Pearson's r correlation coefficient, finding most coefficients are near 0 (see Appendix for exact results).In other words, while there is some agreement across human-, machine-, and eye-based methods with respect to which words are above median saliency, there is little correlation with respect to the saliency scores themselves.

Qualitative Results
We observe that eye data, and in particular dwell time, shows high attention to certain nouns -i.e., names, usernames, and movie titles.
For a qualitative visualization of saliency over the politeness style, see Figure 9.In general, human annotations have a tendency to focus on segments of text with clear style markers.For instance, phrases such as "please" are consistently highlighted by human annotators.Our eye tracking data indicates that these phrases do not reliably draw the reader's gaze during the realtime reading process.We notice that the eyes often focus on the object of the politness marker rather than the politeness marker itself: For instance, the polite text "Thank you for your kind comment," human annotators highlight only "thank you" whereas gaze data focuses on "your kind comment."

"Eye-in-the-loop" few-shot learning
It is difficult to directly evaluate the quality of the saliency scores for style-related NLP applications.Due to the small size of our dataset (90 items), we are unable to fine-tune a model on the eye tracking data for evaluation.This is a common issue in eye tracking applications to NLP, as data collection is resource-intensive.To circumvent this issue, we propose utilizing "eye-in-the-loop" few-shot learning in order to roughly evaluate the feasibility of the eye tracking data in improving task performance.Few-shot (in-context) learning is a useful paradigm when faced with data scarcity.
In our case, we incorporate eye tracking data into our prompts for a style classification task (Polite/Impolite and Negative/Positive) over our dataset by including important words as follows: We run experiments using 0, 1, 2, and 4-shot learning, with the number of shots corresponding to the number of examples presented in the prompt.We randomly select these examples from our dataset, and we repeat each item five times using different sets of examples in the prompt each time.We vary the "important words" section of the prompt to include the salient text as defined by each eye-tracking measure, human annotations, and integrated gradient scores (see Section 3.4 for details of how these important words are selected).As a baseline, we omit the "important words" section of the prompt.We also include a Hybrid Score that selects any text that is salient with respect to either human annotations or with respect to dwell time, since dwell time is the eye tracking metric most aligned with human annotations.Results are relatively inconsistent across each of the four shots, but in most cases, it seems that including salient words does not hurt model accuracy on the style classification task.A subset of the results are shown in Figure 10; see Appendix for full results.

Key Findings and Discussion
Here we discuss the relationship between our results and our research questions: RQ1: Does eye tracking data for saliency meaningfully differ from simply gathering word-level human annotations, or from model-based word importance measures?Our data shows a strong difference between eye-tracking-based saliency, modelbased saliency, and human annotations.It is perhaps unintuitive that reading behavior would differ from self-reports after reading, but this is consistent with findings in psycholinguistics that establish strong distinctions between explicit measures (i.e., human annotations) and implicit measures (i.e., eye tracking) of human language processing.Interestingly, there is some intersection between eye tracking-based saliency and model-based saliency that is not shared with human annotators.This suggests that some automatic aspects of human language processing, accessible through eye tracking but not necessarily survey methods, may be shared with large language models.
RQ2: Which eye-tracking metrics and data processing methods are best suited to captur-ing textual saliency.We find that dwell time and reread time appear to be the strongest eye-tracking metrics for capture textual saliency, both with respect to highest overlap with human-and machine-based saliency and with respect to our few-shot learning results.Using the same two criterion, we also find that using participant-level z-scores to represent the data gives us the best results.
Finally, whether textual saliency through eye tracking is best obtained from comparative (i.e., comparing an experimental and control condition) data is an open question.In our case, there was a small trend for comparitive data outperforming a simple aggregation of both conditions on the fewshot learning experiments (see Figure 11).Comparative data had more overlap with human annotations (see Appendix).We also note that by design, the experiment presented incongruent items rarely, and consequently we have considerably more congruent datapoints than incongruent datapoints.It is possible that with more participants, the differences between the two conditions would become more pronounced.
The movie, directed by Mick Jackson, leaves no cliche unturned, from the predictable plot to the characters straight out of central casting.An entertaining, colorful, action-filled crime story with an intimate heart.The mesmerizing performances of the leads keep the film grounded and keep the audience riveted.

Figure 4 :
Figure 4: Exemplary eye-tracking data showing saliency for polite style, with comparison to human word-level style importance highlighting.The eye-tracking data is visualized as a heat map showing gaze data from the incongruent style condition, with the gaze data from the congruent style (control) condition subtracted.
positive sentiment Figure 5: A comparison of salient words from various methods: manual human annotations, language model introspection, and eye tracking.Darker highlights indicate stronger saliency scores.

Figure 6 :
Figure 6: Venn diagram illustrating the intersection of sets of salient interest areas derived from Dwell Time (blue), integrated gradients (green), and human annotations (red).

Figure 7 :
Figure 7: Confusion matrix of the Jaccard similarity score for salient text derived from each metric.(See Appendix for the correlation coefficient for saliency scores derived from each metric.)

Figure
Figure diagram showing interest areas salient to the polite style.For each section of the Venn diagram, the interest areas with the top five highest saliency scores are shown.

Figure 10 :
Figure 10: Few-shot learning classification experiment accuracy scores, averaged over 5 rounds with randomly selected demonstrations.Error bars indicate 95% confidence interval.

Figure 11 :
Figure 11: Few-shot learning experiment results for various eye tracking metrics, using either all collected data or the comparison between the congruent and incongruent conditions.

Table 1 :
A summary of prior work applying eye tracking methods to NLP.Most research has focused on either (a) comparing and contrasting eye movements with various models' attention mechanisms, or (b) multi-task learning, where NLP task performance can be improved by a model that jointly learns to predict eye movements in addition to the relevant NLP task.To our knowledge, there have not been

Table 2 :
A comparison of prior works with respect to the eye tracking metrics studied, data processing techniques, and number of participants whose eye tracking data is collected.FFD = first fixation duration, FC = fixation count, RC = regression count, RR = reread time, PL = pupil size, N = number of participants.
For each participant p k , we aggregate the raw values of the eye tracking measurements from each IA.The saliency score for IA i is given by