Predict and Use: Harnessing Predicted Gaze to Improve Multimodal Sarcasm Detection

Sarcasm is a complex linguistic construct with incongruity at its very core. Detecting sarcasm depends on the actual content spoken and tonal-ity, facial expressions, the context of an utter-ance, and personal traits like language proficiency and cognitive capabilities. In this paper, we propose the utilization of synthetic gaze data to improve the task performance for multimodal sarcasm detection in a conversational setting. We enrich an existing multimodal conversational dataset, i.e., MUStARD++ with gaze features. With the help of human participants, we collect gaze features for < 20% of data instances, and we investigate various meth-ods for gaze feature prediction for the rest of the dataset. We perform extrinsic and intrinsic evaluations to assess the quality of the predicted gaze features. We observe a performance gain of up to 6.6% points by adding a new modality, i.e., collected gaze features. When both collected and predicted data are used, we observe a performance gain of 2.3% points on the complete dataset. Interestingly, with only predicted gaze features, too, we observe a gain in performance (1.9% points). We retain and use the feature prediction model, which maximally correlates with collected gaze features. Our model trained on combining collected and synthetic gaze data achieves SoTA performance on the MUStARD++ dataset. To the best of our knowledge, ours is the first predict-and-use model for sarcasm detection. We publicly release 1 the code, gaze data, and our best models for further research.


Introduction
Sarcasm originates from the Greek word sarkasmós adapted from sarkázein, which means a sneering or cutting remark.Sarcasm depends on "bitter, caustic, and other ironic expressions that are usually directed against an individual."(Gibbs, 1986).It 1 https://www.cfilt.iitb.ac.in/emnlp23sarcgaze is a complex linguistic phenomenon that gets expressed with words that mean the opposite of what the speaker intends to say; e.g., I love being ignored expresses the bitterness of the speaker.The roots of sarcasm lie in incongruity (Joshi et al., 2015), which makes computational sarcasm detection a challenging problem; and the NLP community has attempted to tackle this problem using innovative approaches.Sarcasm detection in the text has largely been attempted by focusing on lexical indicators (Bamman and Smith, 2021), sentiment incongruity (Joshi et al., 2015), etc., in both rule-based and learning-based systems (Abulaish and Kamal, 2018).However, sarcasm is also expressed through tonal changes and/or facial expressions.Hence researchers have started investigating modalities other than text, viz., audio and video, to help detect sarcasm (Castro et al., 2019a;Cai et al., 2019;Gupta et al., 2021;Chauhan et al., 2022;Ray et al., 2022).Mishra et al. (2017a) observed that gaze features are helpful in detecting sarcasm within short sentences without context, which is our inspiration.In a conversational setting, sarcasm often results from an earlier utterance, which is the problem we focus on in this work.To the best of our knowledge, ours is the first attempt at multimodal detection of sarcasm using gaze behaviour in a conversational setting.

Gaze Terminology
A fixation is a relatively longer stay of gaze on an object (word), and saccades refer to quick shifting of gaze between two positions of rest (Mishra et al., 2017b).An Interest Area (IA) is a part of the screen that is of interest to us.In these areas, the text is displayed and each word is a separate and unique IA.Forward and backward saccades are called progressions and regressions, respectively, while a scanpath is a line graph that contains fixations as nodes and saccades as edges.
We use the MUStARD++ dataset (Ray et al., Figure 1: Sample images from a Gaze data collection setup which shows saccadic movements (yellow lines) and fixations (blue circles) for 1) a sarcastic (left image) and 2) a non-sarcastic dialogue (right image).
2022) which is a multimodal conversational dataset with videos annotated for sarcasm, sarcasm type and emotions.This data has several video frames as visuals linked with the utterance that is marked sarcastic or non-sarcastic Our primary hypothesis in this work is that there are distinctive eye movement patterns when a human reader is processing sarcasm due to the presence of incongruous words within the utterance or previously spoken sentences (Mishra et al., 2016b).Unlike previous studies, we perform the task of sarcasm detection in a conversational setting, exploiting multimodality and gaze features.Figure 1 illustrates gaze fixations (blue circles w/ bigger circles for longer duration) and progressions-regressions for a sarcastic, and a non-sarcastic utterance.
Gaze features, however, are costly in terms of resources-subjects, data, time, and money.One major contributions of our work is predicting gaze features and harnessing predicted features for sarcasm detection.We thus venture into generating and using synthetic data for sarcasm detection.Overall, our Contributions are: • A novel method for generating synthetic data from collected gaze features.
• Enriching the MUStARD++ dataset with eyetracking/gaze features for 1155 samples collected from 5 human participants.This will be useful for research in eye-tracking-based sarcasm and similar language phenomena detection.
• Comparing various gaze feature prediction techniques and utilizing gaze data, both collected and synthetic, to achieve SoTA performance (2.3% point gain) for multimodal sarcasm detection.

Motivation
From Figure 1, it can be observed that the nonsarcastic utterance has a significantly lower regressive eye movement (yellow lines) as compared to the sarcastic utterance.The number of fixations is also lower in number.In the sarcastic utterance, we see a lot of regression on the part of the text containing "look up at the stars without a roof over your", we also observe regressive movement towards the previous utterance in the context-towards"PhD in astrophysics".Such indicators can also be used to explain the origin of sarcasm from a conversational context.However, we observe that the nonsarcastic example (right) also has a few regressive paths leading to previous utterances, which will happen for any reader, given they would like to understand the context in the dialogue fully.We believe capturing these regressions and progressions present in gaze data can help detect sarcasm and generate similar gaze data for new samples, as fixations, movements, and regressions can be learned from them.We also believe the creation of quality synthetic eye-tracking data will be useful in reducing dependency on highly time-consuming human eye-tracking annotations.

Related Work
Existing studies demonstrate how cognitive features have been used to improve performance for various NLP tasks.User understandability of sarcasm can be evaluated with the help of gaze behaviour (Mishra et al., 2016a), where incongruity in the text induces gaze behaviour characterized by longer fixation durations, repeated regressions, and also scan path complexity (Mishra et al., 2017b).Previously, sarcasm detection based on only textual input has shown minor improvements with the help of gaze-based features (Mishra et al., 2016b(Mishra et al., , 2017a)).Gaze behaviour has also been used to identify a reader's native language (Berzak et al., 2017), as well as to detect grammatical errors in compressed sentences (Klerke et al., 2015a(Klerke et al., , 2016)).Klerke et al. (2015b) also show that gaze behaviour can be used to evaluate the output of Machine Translation systems better than automated metrics.Similarly, gaze-based features have also been shown to help the task of cognate and false friends' detection (Kanojia et al., 2021).Gaze behaviour has also been used to evaluate how a reader would rate the quality of a piece of text (Mathias et al., 2018).Similarly, Mathias et al. (2020b) also perform the task of essay grading in a zero-shot setting using only gaze-based features and show the efficacy of gaze-based features for performing NLP tasks (Mathias et al., 2020a).However, existing research does not discuss the correlation of multimodal features (like visual and audio) with gaze-based features, and does not investigate these features for multimodal sarcasm detection in a conversational setting.In the subsection below, we discuss the literature on multimodal studies in NLP.
Lack of data has been a common problem in cases of both sarcasm as well as cognitive NLP.Numerous efforts have been made in building gaze feature predictors in order to reduce dependency on gold gaze data by producing high quality synthetic gaze data.Study in Takmaz (2022) utilizes "adapter" in a language model to match the results of a fully fine tuned language model for predicting eye tracking features with a highly efficient network in terms of the number of parameters.Ding et al. (2022) propose a Bi-LSTM-based network that, with the help of a few psycho-linguistic features, predicts eye tracking features.The paper states that the readability of a text reflected in the linguistic features is important to predict eye movement patterns (Scarborough et al., 2009).The creation of synthetic gaze data has also been performed in multilingual settings.In Srivastava (2022), a model trained on a completely different set of languages predicts gaze data for a completely new language.

Multimodal NLP
Existing literature on multimodal sentiment classification refers to the MOUD (Pérez-Rosas et al., 2013) and MOSI (Zadeh et al., 2016) datasets and the IEMOCAP dataset (Busso et al., 2008) for the task of multimodal emotion recognition.Poria et al. (2017) propose the use of a bidirectional contextual long short-term memory (bc-LSTM) architecture for both tasks and show improvements over baseline on all three datasets.However, Majumder et al. (2018) later propose context modelling with a hierarchical fusion of multimodal features and achieve improved performance in a monologue setting.In the conversation setting, Hazarika et al. (2018) propose using a Conversational Memory Network (CMN) to leverage contextual informa-tion from the conversation history and achieve improved performance.Novel multimodal neural architectures (Wang et al., 2019;Pham et al., 2019) and multimodal fusion approach (Liang et al., 2018;Tsai et al., 2018) have propelled the deployment of computational models.Efficient multimodal fusion approaches have also been discussed in (Sahay et al., 2020;Tsai et al., 2019;Liu et al., 2018).
For multimodal sarcasm detection, a recent survey discusses the datasets and approaches in detail (Bhat and Chauhan, 2022).The MUStARD dataset (Castro et al., 2019b) provides clips compiled from popular TV shows, including Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous, annotated with sarcasm labels.Ray et al. (2022) extend upon this dataset by adding emotion labels and additional clips while also benchmarking for the multimodal sarcasm detection task.They call this extended dataset MUStARD++ and utilise feature fusion and a feedforward network to predict the sarcasm label.The authors show an F1-score of 70.2% points using audio, text and video modalities.
Our work utilises a similar approach with the additional gaze modality and also reproduces the baseline experiments.With this work, we aim to underpin how gaze-based features perform in a multimodal setting and if they correlate well with feature sets other than textual (visual and audio).We also investigate predicting gaze-based features to save annotation time/cost for multimodal studies.

Dataset and Gaze Annotation
MUStARD++ is a multimodal dataset that consists of textual utterances with context, audio, and video from a corresponding clip.This data has been acquired from publicly available sources for five television shows: Friends, The Big Bang Theory (seasons 1-8), The Golden Girls, Burnistoun, and The Silicon Valley.Each dialogue is presented as a combination of the main 'utterance' and the 'context' in which it was uttered.It contains a total of 1,202 instances, out of which 601 are sarcastic, and 601 are non-sarcastic.Along with sarcasm annotation, the dataset also provides additional information like an emotion class, valence, arousal, and sarcasm type.We chose this dataset for our experiments and performed gaze annotation on 231 samples, where 129 are sarcastic, and 102 are nonsarcastic.To avoid any skew, the sarcastic instances are chosen to encompass all four types of sarcasm with a distribution similar to the one in the source data from MUStARD++.The selected instances include dialogues with short contexts (in the range of 2-5 speaker turns) as well as long contexts (6-13 speaker turns).

Datasets
For our multimodal sarcasm detection experiments, we now have three variants as datasets.The first variant is the complete dataset from MUStARD++, i.e., D1.Since we are only able to acquire gaze data over 231 samples out of 1202 as discussed above, D2 is the other variant, which consists of a total of 1,155 data instances (231 samples x 5 participants).Please note that textual, audio and video features for the 231 samples remain the same while gaze features vary for each participant in this D2 variant.For the portion of samples we do not get manually collected gaze data, we choose to predict the gaze tracking features as described below (Section 4.1), and call it D3.This variant, i.e., D3 consists of 971 samples in total.We show the train/test split statistics in Table 2.We also try to maintain a balance between various types of sarcasm in these instances, the distribution details of which are provided in Figure 2; and we provide the details of the gaze annotation process below.

Gaze Annotation
We instruct five annotators to read the 'textual utterances with its context' on the screen and ask them to provide annotations for the implied binary sentiment in the dialogue, i.e., positive or negative.These samples are shuffled, and the experiment builder software is allowed to choose a random instance from the 231 samples to be presented next on the screen.We do not instruct the annotators to look for sarcasm to avoid the Priming Effect, i.e., if sarcasm is expected beforehand, it becomes easier to process.It may have resulted in unattentive participation by annotators (Sáchez-Casas et al., 1992).It ensures the ecological validity of our experiment as 1) the participant has no clue which utterance to expect, and no special attention is paid to either class from the instances, and 2) it also ensures attentive participation.Our annotators are graduate students between the ages of 22-27 with good proficiency in the English language.Annotator selection was made after ensuring they had English as the medium of instruction through undergraduate and their ongoing post-graduate degree program.
We ensure that they consent to record their eye movement pattern to be used for this research.We provide two unrecorded samples at the start of the experiment to acquaint them with the annotation process.While annotating for sentiment over 231 samples, we provide our annotators with a short break after every 30 samples to ensure minimal annotator fatigue, and re-calibrate for their eye movements after each break.The head movement was minimised using a chin-rest during the annotation process.The gaze tracking device used is an SR-Research Eyelink-1000 (monocular remote mode with a sampling rate of 500Hz) that captures the eye movement of the reader/annotator.

Annotation & Feature Validity
We compute inter-annotator agreement using a pair-wise Fleiss' kappa (Scott, 1955), which re- sulted in a statistically significant (p<0.05)moderate agreement (0.41) among our annotators.To validate features for our experiment, we chose a standard gaze-based feature and a saccadic regressionbased feature, i.e., average fixation duration and interest area regression path duration (Table 8), respectively.In Table 1, we show the analysis from a two-sampled t-test over feature data from each participant.We observe that for each participant (P1-P5), the difference between sarcastic and nonsarcastic instances is statistically significant, which further motivates us to use these features for sarcasm detection/classification.

Our Approach
An architecture diagram for our setup is shown in Figure 3.We reproduce the multimodal sarcasm detection experiments as in Ray et al. (2022) as the baseline, with the addition of gaze features as the fourth modality in addition to text, speech and visual modalities.For textual features, we utilize the pre-trained BART language model (Lewis et al., 2019) and obtain embeddings for both textual utterance and the context from the dialogue.BART provides a feature vector representation x t ∈ R dt for every instance x.We encode the text using the For visual features from the videos, we use a pool-5 layer from pre-trained ResNet-152 (He et al., 2016) image classification model.To improve the video representation and reduce noise, we extract the keyframes to be passed to ResNet-152.The computer vision community widely uses key frame extraction, which is defined as the frames that form the most appropriate summary of a given video (Jadon and Jasim, 2019).We use an opensource tool, Katna 4 , to perform key-frame extraction.For the final feature vectors, we average the vectors of each key frame of an instance (context and utterance) extracted from ResNet-152.The size of the final video feature representation is For gaze-based features, we obtain a total of 31 features from the SR Research Experiment DataViewer software5 , but do not use them all for our experiments.We employ the KBest feature selection method from the scikit-learn library (Pedregosa et al., 2011) to optimize the features.Given the sarcasm label along with gaze-based features, the method resulted in the selection of a total of 25 correlated gaze-based features.Due to space constraints, we provide this list in Appendix C with a feature description in Table 8.

Gaze Feature Prediction
We provide details of our gaze feature prediction model here.To remind, we predict 25 gaze-based features, but, to report the correlations between the actually collected gaze features and the predicted gaze features, we choose the three most important features, i.e., the average fixation duration, the regression path duration, and the regression count.In a sarcastic utterance, if the human eye rests on the text for a longer duration, it signifies the presence of some incongruity, which may be because of sarcasm; this phenomenon is captured in the feature named average fixation duration.Similarly, suppose the eye regresses back to the context again and again.In that case, this depicts that there is some difference between the surface meaning and the deep meaning of the utterance that is creating complexity in understanding.This we capture in the regression features.Since collecting gaze data requires human effort and is costly in terms of time and money, we try to predict these gaze-based features for our D3 variant of the MUStARD++ dataset (see 3.1 on datasets for an explanation of D3), containing 971 samples.We used SVM (Cortes and Vapnik, 1995) as well as feed-forward NN (FFNN) (Bebis and Georgiopoulos, 1994), convolutional NN (CNN) (O'Shea and Nash, 2015), RoBERTa (Liu et al., 2019b), and adapters-infused RoBERTa (Pfeiffer et al., 2020) to compare the quality of the gaze features predicted using these techniques.We evaluate the quality of the gaze features by calculating two different correlation metrics, i.e., Pearson correla-tion (Freedman et al., 2007) and Spearman correlation (Spearman, 1904) between the predicted gaze and actual collected gaze feature values for a specific set of samples.These correlation coefficients are given in Table 9, 10 and 11 in Appendix D. Our best-performing model for the gaze features prediction task is a feed-forward neural network-based architecture.The BART encoding for the text utterances (discussed in section 4) is used as input to the FFNN in a matrix form.The architecture constitutes 3 hidden layers, an input layer having 1024 nodes (equal to the size of text embedding).The 4 layers are followed by fully connected layers with a single output node, the gaze feature values being the ground truth labels for the prediction task.

Experimental Setup
To perform feature fusion, we use a collaborative gating mechanism (Liu et al., 2019a) on all the features described above.We first compute projection Ψ (i) (V ) where i ∈ {t, tc, a, ac, v , vc, g} and t, a, v , c, g are text, audio, video, corresponding context, and gaze.This mechanism implements two tasks, 1) finding the attention vector prediction over provided input vectors and 2) performing expert response modulation using the computed attention vector prediction.For response modulation of each modality projection, we perform where σ is an element-wise sigmoid activation and • is the element-wise multiplication, i.e., Hadamard product (Horn, 1990).These modulated projections are then concatenated and passed to fully connected linear layers (ReLU) followed by a softmax layer to predict target class probability distribution.We use the standard cross-entropy loss for sarcasm detection/classification.For training, we perform hyper-parameter search with dropout in range of [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], learning rate in [0.001, 0.0001], batch size [256,384], shared embedding size [2048,1024] and the projection embedding size [1024,256].For benchmarking our results, we train over three iterations with different (randomly chosen) seed values of 42, 200, and 1005.The seed values ensure that the dataset samples are randomly shuffled.Our experiments were performed using a single nVidia RTX A6000, where each iteration takes approximately 2 hours.We report the mean ± standard deviation on the test splits in terms of macro precision, recall and F1-scores.We note that the quality of predicted gaze features plays a key role in the performance of our multimodal sarcasm detection task, as described in the section below.We first describe the results on the D2 variant in Table 3.This table reports the results while ablating on various feature combinations.A clear performance improvement can be observed when the gaze modality is added.Compared to the baseline, i.e., all modalities except gaze, there is a significant gain of 6.6% points when all features, including gaze, are used for sarcasm detection.We do reproduce the baseline experiments on the D1 variant and show the results in Table 4, where we also observe a slight gain (0.8% points) on the best baseline score provided by MUStARD++ authors when they used the standard video, audio and textual modalities.We compare the results on D1 from baseline experiments with combinations of video, audio, and textual modalities with our experiments on D1 with the same feature combinations.

Results and Discussion
We observe from Table 4, that adding predicted gaze-based features lowers the performance benchmark significantly.However, a combination of PredGaze and Gaze added to each combination still outperforms all the baseline combinations with up to 6.8% point gain by using only video, PredGaze, and Gaze (highlighted with underlined in the table).On using all the available features, we still outperform the baseline scores by 2.3% points (highlighted using bold).We also observe that by only using predicted and collected gaze-based (PredGaze+Gaze) features, we are able to obtain an F1-score of 0.679, which is encouraging.It shows that our model can predict sarcasm with some certainty even without using standard modalities like video, audio, or text, only on the basis of eye movement patterns.This, however, encouraged us to probe the efficacy of the predicted gaze features.Our Experiments on the D3 variant, where only predicted gaze-based features are available, also show an improvement of at most 1.9% points when video, audio and text are the modalities used along with predicted gaze features.We compare the results of sarcasm detection on this D3 dataset with and without gaze features in Table 6, accuracies without the use of gaze features being the baseline for this experiment.When used along with other modalities, i.e., video, text and audio, the model was able to beat the corresponding baseline by 1.9% points which is encouraging as this improvement came from complete synthetic data.We report the accuracies from the best performing model among the models we experimented with.

Discussion
Upon looking at the results from our task while ablating for feature combinations, we also perform a qualitative analysis of data samples which encourages the use of gaze as a modality.We randomly choose two samples which are shown at the top in Table 5.The first two samples (rows one and two) show the final textual utterance in the first column, which are examples of 'Propositional' and 'Illocutionary' sarcasm.Starting with a positive and ending with an implied negative sentiment has incongruity, which should be detected with the help of computational models.However, the label for this sample was incorrectly predicted by all feature combinations, and only after introducing gazebased features, the model correctly predicts the label.We also show a representative image from the video clip during the final utterance.The image is only a single frame, but our analysis of the clip shows that the character maintains a similar expression throughout the utterance except for the final few frames when she utters the part, 'with that'.We believe that selecting key frames from the video using an automated method may not be very effective in such a case; therefore, even the visual features are unable to help.However, we have other examples which are still hard to label accurately.The last two rows in Table 5 show two examples where the first is a sarcastic utterance of the illocutionary type.Ironically, the word 'Sarcastic' is present within the utterance itself.Given the full context of this utterance and, on observing the clip, audio-based features should have helped in this scenario.The moderator utters a very grumpy 'Nooo.' and high-pitched 'Sarcastic?Us?'.We believe that the low-level audio features are not able to capture these tonal changes, but the failure of other feature sets also begets further investigation.Also, the video clip shows the scene to be focused on the whole group, and there is a lesser chance of detecting key frames for visual features to capture subtle changes.Due to space constraints, we provide full context and utterance for all these samples in another table present in the Appendix B, i.e., Table 7.
Similarly, in the last example, despite sufficient conversational context (can be seen from Table 7), neither of the computational models is able to capture the sarcasm present in the final utterance.There is sufficient incongruity present in this sam-ple instance of the embedded type of sarcasm.However, the final utterance, "But the fact that you're so sorry makes it all better" can come across as an emotional and caring statement.

Conclusion and Future Work
This paper discussed the use of gaze-based features for the task of sarcasm detection in a multimodal and conversational setting.We propose the use of textual, audio, and video in combination with the gaze modality by showing a substantial improvement in performance with the addition of collected gaze-based features.We collect gaze data over a small number of samples and predict these features for a larger portion of the data, both of which we will release with the code and the best models from our experiments.With predicted gaze-based features, however, we observe a small improvement in the task performance in this case.To the best of our knowledge, our results indicate that adding collected gaze-based features certainly improves task performance in every feature combination, proving the efficacy of gaze-based features.Our qualitative analysis also suggests that better audio and visual features should help improve task performance.
In future, we would like to improve the quality of predicted gaze-based features further in a multi-task setting of sarcasm detection and gaze prediction.

Limitations
Our work has certain limitations, as gaze data collection is challenging.Multimodal datasets are also scarce, and it's challenging to benchmark the performance of this approach over multiple datasets.We release the complete gaze data with annotatorprovided sentiment labels, but our inter-annotator agreement is only moderate.The subjectivity of sarcasm and cultural contexts present in humour, are the key reasons for the inter-annotator agreement value being lower than expected.The understanding of sarcasm varies from person to person depending upon the age, culture, context, familiarity with the characteristics present in the utterance, etc.This makes sarcasm a very hard and cognitively loaded phenomenon for even linguists to annotate.Collection of eye-tracking/gaze data is a tedious and costly process; it requires hours of human participation without any loss of concentration of the annotator.Transformers-based models, in the case of video, audio, as well as text, require large amounts of data to be able to generalise and perform well.Thus, dataset contribution becomes essential to push boundaries and enable more research in the field.

Figure 3 :
Figure 3: The architecture diagram for multimodal sarcasm detection setup shows how gaze-based features are introduced in the pipeline with other features (text/audio/video) while a collaborative gating mechanism fuses these features for predicting labels via a feed-forward neural network.
BART Large model with d t = 1024 and use the mean of the last four transformer layer representations to get a unique embedding representation for both the utterance and the context.For audio features, like in Ray et al. (2022), we sampled the audio signal at 22.5KHz as a preprocessing step.Since the audio has background noise and canned laughter (we deal with sitcoms), we used the vocal-separation method 2 to process it.We extract three low-level features: Mel Frequency Cepstral Coefficients (MFCC), Mel spectrogram (using Librosa library (McFee et al., 2022)), and prosodic features using OpenSMILE 3 .We split the audio signal into equal segments of 1-second duration each to maintain consistent feature representation in all instances.Since the audio signal length varies with utterances, this segmentation helps in keeping the vector size constant across the dataset.For each segment, we extract MFCC, Mel spectrogram and prosodic features of size d m , d s , d p respectively.Then we take the average across segments to get the final feature vector.Here d m = 128 , d s = 128 , d p = 35 , so our audio feature vector is of size d a = d m + d s + d p = 291 .

Table 1 :
Two-sampled T-test statistics for average fixation duration and interest area regression path duration for Positive labels (Sarcastic) and Negative labels (Non-sarcastic) for participants P1-P5.

Table 2 :
Train/test split statistics for dataset variants.

Table 3 :
Results obtained on the D2 dataset via experi- ments with Video (Vid), Audio (Aud), Textual (Text), and Gaze-based (Gaze) features where macro-P/R/F1 are macro Precision, Recall and F1-score, respectively.These results are compared with results obtained on same dataset without using gaze features for the sarcasm detection task.

Table 6 :
Results obtained on the D3 dataset via experiments Hollis S Scarborough, Susan Neuman, and David Dickinson.2009.Connecting early language and literacy to later reading (dis) abilities: Evidence, theory, and practice.Approaching difficulties in literacy development: Assessment, pedagogy and programmes, 10:23-38.William A Scott. 1955.Reliability of content analysis: The case of nominal scale coding.Public opinion quarterly, pages 321-325.C Spearman.1904.ntheproof and measurement of association between two things, oamerican j.Harshvardhan Srivastava.2022.Poirot at CMCL 2022 shared task: Zero shot crosslingual eye-tracking data prediction using multilingual transformer models.In CMCL Shared Task on Multilingual and crosslingual prediction of human reading behavior.Ece Takmaz.2022.Team DMG at CMCL 2022 shared task: Transformer adapters for the multi-and crosslingual prediction of human reading behavior.In CMCL Shared Task on Multilingual and crosslingual prediction of human reading behavior.Min.Fixation Duration Time Minimum time for which eye fixated in a trial.
RTReaction time associated with the trial.

Table 8 :
Gaze features and their description, these are the final set of gaze features that were used in the sarcasm detection experiment.