DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading

The use of visually-rich documents (VRDs) in various fields has created a demand for Document AI models that can read and comprehend documents like humans, which requires the overcoming of technical, linguistic, and cognitive barriers. Unfortunately, the lack of appropriate datasets has significantly hindered advancements in the field. To address this issue, we introduce \textsc{DocTrack}, a VRD dataset really aligned with human eye-movement information using eye-tracking technology. This dataset can be used to investigate the challenges mentioned above. Additionally, we explore the impact of human reading order on document understanding tasks and examine what would happen if a machine reads in the same order as a human. Our results suggest that although Document AI models have made significant progress, they still have a long way to go before they can read VRDs as accurately, continuously, and flexibly as humans do. These findings have potential implications for future research and development of Document AI models. The data is available at \url{https://github.com/hint-lab/doctrack}.


Introduction
With the continuous development of information technology, our access to information is becoming increasingly diverse.Among the various formats, the proportion of visual information in daily documents such as tables, graphs, diagrams, etc., is on the rise (Ceci et al., 2007;Jaume et al., 2019).Therefore, effectively utilizing such visual information has become a hot research topic in the NLP research community and introduces a new challenge of understanding Visually-Rich Documents (VRDs) (Liu et al., 2019;Yu et al., 2021), which are documents that contain substantial visual components.

OCR Output Order Human Reading Order
Figure 1: Case comparison of the default serialized input order output by the OCR engine with the real human reading order.It clearly reveals the significant differences in processing behavior between machines and humans.The red numbers indicate the ordinal numbers in the reading sequence.For best visibility, it is recommended to zoom in on the results.
Identifying and understanding VRDs is a timeconsuming and laborious task due to their diversity and complexity (Wang et al., 2021a).Textual information alone is insufficient for extracting key information from diverse document types, necessitating a multimodal approach that considers the consistency and correlation of multiple modalities, including text, visual, and layout, through joint modeling.Examples of modern document AI models for VRD understanding include the LayoutLM series models (Xu et al., 2020(Xu et al., , 2021;;Huang et al., 2022) and StructText series models (Li et al., 2021;Yu et al., 2023a).
While these models can obtain fine-grained multimodal document representations and achieve promising results in downstream VRD (VRD) understanding tasks, they lack the ability to generate a serialized input order from a given document that fits into the Transformer architecture.As a result, they typically utilize simple rules, such as leftto-right or top-to-bottom, or directly use the input order generated by the OCR tool in the previous step (Lee et al., 2021) to serialize inputs.However, these input orders are quite different from the reading order that humans are used to (See Figure 1 for an example).The different input orders can significantly affect the performance of document AI models on the downstream document understanding tasks, which is often overlooked.
To address this issue, Wang et al. (2021b) construct the ReadingBank dataset containing the local priority order using the order of the XML source code of Word documents.However, it remains questionable whether this reading order is consistent with the actual human reading order and whether it is actually beneficial for machine comprehension tasks.Therefore, state-of-the-art Document AI models lack a deep understanding of spatial relationships and document structure, resulting in limited performance when dealing with VRDs, e.g., forms and infographics.
To this end, we propose DOCTRACK, a benchmark dataset containing various types of real-world documents aligned with human eyemovement information.Specifically, we propose a preordering pipeline to integrate human reading order within modern document AI models.The integration process occurs during inputting document contents after OCR parsing but before feeding them into the downstream task.We also explore different approaches to generate human-like reading orders, including default OCR tools, Zpattern, simple rules, and AI models that utilize multimodal information.Our objective is to evaluate the impact of different reading orders on downstream document comprehension tasks and identify the most effective reading order for existing multimodal document AI models.This provides insight into the similarities and differences between human and machine reading patterns.
In summary, this paper makes three contributions: 1. We construct a benchmark dataset, namely DOCTRACK, aligned with human eye movement information.To our knowledge, DOC-TRACK is the first human-annotated benchmark dataset for the purpose of research on VRD reading order generation.
2. We investigate different human-like reading order generation methods, which refer to techniques for generating machine reading orders that mimic human reading patterns in VRDs.We explore various techniques for generating these reading orders and propose a practical preordering pipeline that leverages these generated reading orders to improve document understanding tasks.
3. We conduct both intrinsic and extrinsic evaluations to analyze the performance of the human-like reading order generation model and measure its impact on downstream tasks.
The observations suggest that human reading order may not be suitable for reading VRDs.

Human Reading Order
The human reading order plays a significant role in the comprehension of VRDs.Human reading order refers to the direction and sequential order in which people scan documents as they read the texts.Generally, people read left-to-right or top-tobottom to comprehend the text and obtain information.This order is also related to the way the text is written in most languages and scripts, including English and Modern Chinese.However, other reading order patterns exist.For example, traditional Chinese and Japanese are usually written vertically from right to left, while Arabic is written horizontally from right to left (Rayner, 1998).As a result, people naturally scan their eyes in this order while reading (Henderson and Ferreira, 1993).
Although the reading order of text generally follows a relatively fixed pattern, VRDs usually present information in a mixture of modalities (e.g., text, images, graphics, etc.) with a complex two-dimensional layout and semantic structure.Readers typically choose the appropriate reading order by considering the spatial structure of text content, image information, and the typographic position of the document in combination.This natural temporal processing can help readers better understand the content and intent of the document and establish connections between different modalities.For example, when presented with a diagram, readers typically first visually analyze it as a whole to get a general idea of what it describes.From there, they usually focus on the most obvious data or labels first, and then gradually scan the rest of the data and labels to gain a more complete understanding of the information presented.
3 Eye Tracking in NLP

Basic Notions
Eye tracking is an important technique for studying eye movements during human reading.Human reading is a complex cognitive activity, and eye movement trajectories can visualize the reading process and are important for understanding how humans acquire and comprehend knowledge.Intuitively, different visual tasks result in varying scan (i.e., eye movement) patterns.Studying these patterns can help us understand the mechanisms of human cognitive processing.Numerous research in neuroscience has established a strong association between eye-tracking data and language comprehension activity in human brains (Henderson and Ferreira, 1993).The eye movement trajectory is typically described as an irregular curve, mainly composed of two alternating eye movement actions: saccade and fixation.A saccade is when our eyes rapidly jump from one word to the next in the text, representing the shift of attention.Fixation refers to the situation we sometimes focus on a word and remains stationary during the visual task for a period of time.In addition, the scan pattern and speed of the eye movement trajectory can be affected by various factors, such as font size, line spacing, contrast, etc. of the text.

Applications
In cognitive-motivated natural language processing (NLP), several studies have investigated the impact of eye-tracking data on NLP tasks, for example, designing machine learning models for NLP tasks such as part-of-speech tagging (Barrett et al., 2016), term extraction (Yaneva et al., 2017), and syntactic parsing (Kim, 2009).Later, researchers combine eye-tracking data with word embeddings in neural models to improve NLP tasks, including sentiment analysis (Mishra et al., 2017) and named entity recognition (NER) (Hollenstein and Zhang, 2019) or revising neural attention (Barrett et al., 2018;Sood et al., 2020;Takmaz et al., 2020).Recent studies (Bhattacharya et al., 2020;Ren and Xiong, 2021;Ding et al., 2022;Khurana et al., 2023) attempt to align text features and cognitive signals to identify their differences and commonalities.

Data Collection
Our work aims to evaluate how reading order impacts the comprehension of VRDs.To achieve this, we randomly select documents mainly from three available datasets: FUNSD (Jaume et al., 2019), SeaBill (Zhang et al., 2022), and Infographic (Mathew et al., 2022).These datasets are widely used and provide diverse examples of complex structured and graphic documents for our analysis.We reuse all document images in the FUNSD dataset (Jaume et al., 2019).Given that these documents are less structured than the other datasets and the eye movement tracks primarily to conform to the normal-Z reading order pattern, we rename this sub-dataset as "WEAK." We have selected a number of structured tabular documents from the SEABILL dataset that contains detailed information related to the shipment of goods.This information includes the name of the consignor and consignee, the type and quantity of the shipment, etc.The subset comprises 160 training samples and 50 testing samples, totaling 13,454 semantic entities, denoted as "STRUCTURED."We also select 100 training samples and 30 test samples from the Infographic dataset (Mathew et al., 2022).This subset contains a large number of graphic components, it is more diverse and complicated than the other two subsets, called "INFOGRAPH."Table 1 shows details of the dataset and statistics.

Analysis of Eye Movement Patterns
We observe mainly four types of reading patterns among these documents.Figure 2 illustrates human reading behaviors when reading the documents from the DOCTRACK dataset.Normal-Z.The normal-Z order, also called Cross-modal interaction.When people read infographics that contain pies, charts, or other types of data graphs, they typically show interactive radial eye movements.During reading, the eyes follow a radial path between the graph and the associated text.For example, when people read a pie chart, they usually focus first on the pie portion of the chart and then scan along its perimeter to find the corresponding text.This back-and-forth movement results in a cross-modal interaction pattern.The unique shape and layout structure of pie charts contributes to this pattern because the graphical portion often contains the most important information that people attend to first.In contrast, the text label section provides detailed information that people need to scan step by step in order to understand.Therefore, eye tracking presents a radial pattern in these types of infographics.contextual referencing, readers may re-read earlier sections to aid in better understanding the information's meaning.This self-correcting, backtracking eye movement pattern reflects the information processing of the reader and can help with better comprehension and memory retention of the information.Figure 3 shows the statistics of the four types of reading patterns found in DOCTRACK.

Eye-tracking Experiments
To conduct eye-tracking experiments, we randomly divide the dataset into five parts and recruit five participants. 1All participants are instructed to read the data on an HP 24-inch 1080P display.We record the participants' eye trajectories while they are reading using Tobii TX300 and Tobii Studio.These devices can record the visual movement trajectory of each subject and convert it into digital data.Data extracted from the eye tracker includes fixation points, fixation time, frequency, eye saccade distance, and pupil size.These measurements can help researchers gain insights into participants' internal cognitive processes.
The high sampling frequency of an eye tracker may result in an unsmooth trajectory with recorded gaze points.In addition, gaze points during eye movements may deviate or miss due to peripheral vision (see Appendix A.1).To improve the accuracy of the eye tracker when recording the reading trajectory, we do as follows: 1.In the case of missing gaze points (case 1), if the gaze point hits the periphery of the known OCR bounding box within a certain Euclidean distance, we use the ordinal number of the peripheral gaze point as the reading sequence index of the current bounding box.
2. In the case of missing gaze points (case 2), we use the ordinal number from the surround-1 All participants are graduate or undergraduate students.
ing adjacent reading sequence or the ordinal number between two gaze points.
3. We delete gaze points that are repeatedly returned to the eye multiple times and keep only the ordinal numbers of the first eye moment.
By adopting the above corrections, we ensure the accuracy of the recorded reading trajectory.

Annotation Agreement
In our final experimental setup, we assign two out of the five participants to label the same subset of data.Subsequently, for each document file, we compare the labeling results from these two participants.We then conduct a voting process among the five participants to select the most appropriate labelingdocument-levelthat aligns with everyone's expectations.This chosen labeling is ultimately considered as the final data.
5 Reading Order Integration

Rule-based Heuristic Methods
Multi-modal information extraction methods rely on accurate sequence detection of documents.However, inconsistencies in OCR engines can lead to variances in reading order.To address this issue, Li et al. (2022) introduce position offset threshold to standardize reading order and deal with OCR's instabilities.The text boxes are sorted from top to bottom, and if the distance between two boxes in the Y direction is smaller than the threshold, then their order is determined based on their X direction order.Besides, Gu et al. (2022) sort the bounding box according to the two coordinates of the Y -axis, and then performs the X-axis search and the right-down search with Y +X combined consideration.By changing the order of token input into the model, both approaches achieve good results.The proposed rule-based sorting approaches conform to the basic cognition of humans when reading, resulting in accurate information extraction.Therefore, leveraging the reading order generated by rule-based sorting approaches can significantly improve the accuracy of multimodal information extraction.

Proposed Pipeline
To follow human reading behaviors, we use a process called "preordering" (reordering-as- preprocessing), a term borrowed from the domain of statistical machine translation (Xia and Mc-Cord, 2004;Collins et al., 2005;Neubig et al., 2012;Nakagawa, 2015).This process involved reorganizing the inputs in the order they would be read by a human.By this, we make it easier for the reader to understand the procedure of aligning the multimodal sequential input features with human reading order and thus enabling us to evaluate the impact of reading order on the VRD understanding tasks.

Atomic Comparison Models
We obtain the basic features of different modalities of documents through different encoders and imitate human reading order according to these features.Therefore, we propose four different models for reading order generation.Each model uses information from single or multiple modalities simultaneously: Box, Text, Text+Box, and Text+Box+Image.
Each model takes into account different factors that influence how humans prioritize and read elements in VRDs, including the position of the element, the text within the element, and the visual region associated with it.By utilizing these models, we can more accurately evaluate the impact of reading order on human comprehension of such documents.
Figure 5: Atomic reading order comparison model for neighboring bounding boxes.The symbol ">" indicates that the former bounding box might be read by humans after the latter.
positions to learn the order of each bounding box.Specifically, at first, the given bounding box coordinates (x up , y up , x down , y down ) (representing the top-left and bottom-right coordinates) generated by the OCR tool are used to compute the centroid coordinates (x i , y i ) of each bounding box.We combine all the bounding box centroid coordinates in pairs, then feed the centroid coordinates of the two combined bounding boxes directly into a multi-layer Transformer network.This enables us to predict the spatial relationship between the two bounding boxes, i.e., which one should come before or after in the human reading order.
b i : b j = Transformer(x i , y i , x j , y j ) (2) Text.We use BERT to encode the texts, and since each bounding box contains one or more tokens, we take the latent representation at the first token position as the embedding for a bounding box.
Text+Box.To jointly encode the text and 2Dpositions inputs, we use LayoutLM.Similar to the operation in the Text section, we take the first latent representation as the input to the classifier.
The final model can be formulated as follows: Text+Box+Image.We use LayoutLMv2 to joint encode the text, image and 2D-position within the bounding boxes as follows: b i : b j = LayoutLMv2(t i , t j ; x i , y i , x j , y j ; I i , I j ) (5) where we consider the ROIs of document images, denoted as I i and I j .We take the first latent encoding to represent each bounding box.
Calling the atomic comparison model to determine the precedence order.if p < 0.5 then swap r j with r j+1 ; ▷ Consistent with Bubble sort.return r; ▷ New sorted sequence.

Sequence Preordering Algorithm
We test four atomic comparison models within the model-based sequence preordering algorithm as a before-and-after judgment.The preordering algorithm takes as input the atomic comparison models outputs (0/1 sequence) to construct an adjacent matrix.Thus, the preordering algorithm is a variant of the commonly used Bubble Sort algorithm, which outputs the sorted input sequence by rearranging the bounding box positions in the sequence.Refer to Algorithm 1 for more details.

Experiments and Analysis
To gain a deeper understanding of the differences between human reading order and machine processing order, we conduct both intrinsic and extrinsic evaluations using the DOCTRACK dataset.By comparing different modal fusions, the intrinsic evaluation allows us to assess the quality of the reading order generation model, which in turn helps us to understand the extent to which information from each modality influences the reading order built by humans.In order to determine ex-actly what reading order is needed for a machine document intelligence model, i.e., what input order enhances machine document comprehension, we make use of extrinsic evaluation to assess the quality of temporal order on human-like reading order generation.

Intrinsic Evaluation
Specifically, we compare machine-generated input orders with ground truths, i.e., human reading orders, to evaluate the model's performance in the intermediate task of reading order generation.We measure the correlation between the machine-generated input order and the reading order of human experts by calculating Spearman's rank and Kendall's tau scores to assess the accuracy of the machine-generated reading order and the interaction between different modalities and different types of documents.
Table 2 shows the sequence correlation evaluation results of four different models on three datasets.Among them, the average result of the model based on Text+Box+Image is the best, and the result based on the bounding box is the worst, which also verifies multimodal features have an important impact on document ranking.The results in the above Table 3: Extrinsic evaluation on the DOCTRACK dataset.DEFAULT-OCR represents the original order, EYE and EYE++ represent the original eye movement order and the smoothed eye movement order, Z-ORDER (Yu et al., 2023b) and XYLAYOUT (Gu et al., 2022) are two orders generated by expert experience, MODEL-B, MODEL-T, MODEL-T+B and MODEL-T+B+I represent atomic comparison models such as Box, Text, Text+Box, and Text+Box+Image, respectively.R/H/M refers to the order generated by rules, humans, and models.

Extrinsic Evaluation
Semantic entity recognition.The purpose of this study is to analyze the impact of reading order on the VRD understanding using the Semantic Entity Recognition (SER) task for the WEAK and STRUCTURED subset of documents.Table 3 shows the results.We find that the impact of reading order on the SER task varies depending on the document AI model used.When using BERT, the simple Z-order works best, and the effect of each order is better than the effect of the original order.
In the multimodal LayoutLMv2 and LayoutLMv3 models, multimodal ordering works best, slightly better than Z-order.These results suggest that human reading order and machine ordering order have a strong influence on the SER task, and different models have different degrees of sensitivity to these orders.
DQA. Document Question Answering (DQA) is a challenging task in VRD understanding that requires machines to understand both the visual and textual content of a document image and to answer questions about it.There are several possible reasons for this.First, most of the datasets used by existing document AI models are sorted by simple rules and therefore are better suited for the orders generated by using simple rules such as Z-pattern.Additionally, individual human reading orders may be very noisy unless a large human eye-movement dataset is constructed by collecting a significant amount of human eye-movement data.

Conclusion
We investigate the impact of human reading order on Document AI models for VRD understanding tasks.We propose different methods to generate human-like reading orders, along with a practical preordering pipeline that can leverage the generated reading orders.Our observations suggest that true human reading order may not always be suitable for reading VRDs.The dataset we construct can help in designing better document AI models and human reading robots in the future.

Limitations
In this work, we focus on the impact of human eyetracking order and machine reading ordering for VRD understanding.Due to the complexity of eye movement characteristics, when the participants were doing eye movement experiments, they were required to ignore the eye movements information, such as the fixation time of each fixation point, back gaze, and the number of fixations.Therefore, our next step will be to explore the impact of more eye movement gaze information on the independent understanding of VRDs.In addition, due to the high annotation cost, the annotation has not been done by multiple annotators.Therefore, the inner-agreement rate is not available for the current dataset.

Figure 2 :
Figure 2: Scatterplots showing eye movement patterns while humans are reading can be organized in various ways: (a) normal-Z order; (b) local priority order; (c, d, e) cross-modal interaction order; (f) visual instruction order.Note that these eye movement data have not yet been aligned with the OCR output.The full version of the images can be found in Appendix A.2.

Figure 3 :
Figure 3: The statistics of four types of reading patterns.

Figure 4 :
Figure 4: The proposed preordering pipeline for utilizing human-like reading orders in VRD understanding tasks.
) where b i and b j denote the i-th and j-th bounding boxes, respectively, r[b i ] refers to the index ID of the bounding box b i in the predicted reading sequence, and r[b i ] < r[b j ] indicates that b i appears before b j in the reading order sequence.Box.In this model, we use only two-dimensional

Table 2 :
Correlation coefficient scores of Kendall's tau τ and Spearman's rank ρ between resorted input sequences generated by using different preordering methods and golden human eye movement orders.Bold indicates the best and underline indicates the second best.Algorithm 1 Model-based Sequenece Preordering Input: the serialized input sequence [b 1 , . . ., b l ] that contains multimodal features ; Output: the sorted input sequence r table show that humans read VRDs with weak table structure, and humans still use visual features such as font background to help model the temporal sequence, and visual features are less important in the case of table structure, and basic text and position are enough.However, infographics with more images and visual features require more powerful models that can capture text, location, and visual features.For more details see Appendix A.3.

Table 3
form the human-like reading order generated using a rule-based approach.This suggests that true human reading order may not be necessary to enhance existing machine document AI models.
Graham Neubig, Taro Watanabe, and Shinsuke Mori.2012.Inducing a discriminative parser to optimize machine translation reordering.In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 843-853, Jeju Island, Korea.Association for Computational Linguistics.K Rayner.1998.Eye movements in reading and information processing: 20 years of research.Psychological bulletin, 124(3):372422.Yuqi Ren and Deyi Xiong.2021.Cogalign: Learning to align textual neural representations to cognitive language processing signals.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 3758-3769.Association for Computational Linguistics.