ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at http://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout.


Introduction
Visually-rich Document Understanding (VrDU) is an important research field aiming to handle various types of scanned or digital-born business documents (e.g., forms, invoices), which has attracted great attention from the industry and academia due to its various applications.Distinct from conventional natural language understanding (NLU) tasks that use only plain text, VrDU models have the opportunity to access the most primitive data features.
Herein, the diversity and complexity of document formats pose new challenges to the task, an ideal model needs to make full use of the textual, layout, and even visual information to fully understand visually-rich document like humans.
However, existing document pre-training solutions typically fall into the trap of simply taking 2D coordinates as an extension of 1D positions to endow the model layout awareness.Considering the characteristics of VrDU, we believe that the layout-centered knowledge should be systematically mined and utilized from two aspects: (1) On the one hand, layout implicitly reflects the proper reading order of documents, while previous methods are used to perform the serialization by multiplexing the results of Optical Character Recognition (OCR), which roughly arrange tokens in the top-to-bottom and left-to-right manner (Wang et al., 2021c;Gu et al., 2022).Inevitably, it is inconsistent with human reading habits for documents with complex layouts (e.g., tables, forms, multi-column templates) and leads to sub-optimal performances for downstream tasks.(2) On the other hand, layout is actually the third modality besides language and vision, while current models are used to take lay-out as a special position feature, such as the layout embedding in input layer (Xu et al., 2020) or the bias item in attention layer (Xu et al., 2021).The lack of cross-modal interaction between layout and text/image might restrict the model from learning the role of layout in semantic expression.
To achieve these goals, we propose a systematic layout knowledge enhanced pre-training approach, ERNIE-Layout2 , to improve the performances of document understanding tasks.First of all, we employ an off-the-shelf layout-based document parser in the serialization stage to generate an appropriate reading order for each input document, so that the input sequences received by the model are more in line with human reading habits than using the rough raster-scanning order.Then, each textual/visual token is equipped with its position embedding and layout embedding, and sent to the stacked multimodal transformer layers.To enhance cross-modal interaction, we present a spatial-aware disentangled attention mechanism, inspired by the disentangled attention of DeBERTa (He et al., 2021), in which the attention weights between tokens are computed using disentangled matrices based on their hidden states and relative positions.In the end, layout not only acts as the 2D position attribute of input tokens, but also contributes a spatial perspective to the calculation of semantic similarity.
With satisfactory serialization results, we propose the pre-training task, reading order prediction, to predict the next token for each position, which facilitates the consistency within the same arranged text segment and the discrimination between different segments.Furthermore, when pre-training, we also adopt the classic masked visual-language modeling and text-image alignment tasks (Xu et al., 2021), and present a fine-grained multi-modal task, replaced regions prediction, to learn the correlation among language, vision and layout.
We construct broad experiments on three representative VrDU downstream tasks with six publicly available datasets to evaluate the performance of the pre-trained model, i.e., the key information extraction task with the FUNSD (Jain and Wigington, 2019), CORD (Park et al., 2019), SROIE (Huang et al., 2019), Kleister-NDA (Graliński et al., 2021) datasets, the document question answering task with the DocVQA (Mathew et al., 2021) dataset, and the document image classification task with the RVL-CDIP (Harley et al., 2015) dataset.The results show that ERNIE-Layout significantly outperforms strong baselines on almost all tasks, proving the effectiveness of our two-part layout knowledge enhancement philosophy.
The contributions are summarized as follows: • ERNIE-Layout proposes to rearrange the order of input tokens in serialization and adopt a reading order prediction task in pre-training.
To the best of our knowledge, ERNIE-Layout is the first attempt to consider the proper reading order in document pre-training.
• ERNIE-Layout incorporates the spatial-aware disentangled attention mechanism in the multimodal transformer, and designs a replaced regions prediction pre-training task, to facilitate the fine-grained interaction across textual, visual, and layout modalities.
• ERNIE-Layout refreshes the state-of-the-art of various VrDU tasks, and extensive experiments demonstrate the effectiveness of exploiting layout-centered knowledge.

Related Work
Layout-aware Pre-trained Model.Humans understand visually rich documents through many perspectives, such as language, vision, and layout.
Based on the powerful modeling ability of Transformer (Vaswani et al., 2017), LayoutLM (Xu et al., 2020) initially embeds the 2D coordinates as layout embeddings for each token and extends the famous masked language modeling pre-training task (Devlin et al., 2019) to masked visual-language modeling, which opens the prologue of layout-aware pre-trained models.Afterwards, LayoutLMv2 (Xu et al., 2021) concatenates document image patches with textual tokens, and two pre-training tasks, textimage matching and text-image alignment, are proposed to realize the cross-modal interaction.Struc-tralLM (Li et al., 2021a) leverages segment-level, instead of word-level, layout features to make the model aware of which words come from the same cell.DocFormer (Appalaraju et al., 2021) shares the learned spatial embeddings across modalities, making it easy for the model to correlate text to visual tokens and vice versa.TILT (Powalski et al., 2021) proposes an encoder-decoder model to generate results that are not explicitly included in the input sequence to solve the limitations of sequence  labeling.However, these methods ignore the potential value of layout in-depth and directly rely on a raster-scanning serialization, which is contrary to human reading habits.To solve this problem, Lay-outReader (Wang et al., 2021c) designs a sequenceto-sequence framework to generate an appropriate reading order for each document.Unfortunately, it is carefully designed for reading order detection and cannot directly empower various document understanding tasks.Besides, the above methods are used to regard layout as a subsidiary feature of text along with the idea of LayoutLM, but the same text with different layouts may also express different semantics.Therefore, we believe that layout should be regarded as the third modality independent of language and vision.
Knowledge-enhanced Representation.Following the BERT (Devlin et al., 2019) architecture, many efforts are devoted to pre-trained language models for learning informative representations.
There are some studies show that extra knowledge, such as facts in WikiData and WordNet, can further benefit the pre-trained models (Zhang et al., 2019;Liu et al., 2020;He et al., 2020;Wang et al., 2021b), but the embeddings of words in the text and entities in the knowledge graphs are not in the same vector space, so a cumbersome adaptation module is required (He et al., 2020;Wang et al., 2021a).Another research line is to excavate the potential human cognitive laws of the text itself: ERNIE (Sun et al., 2019) creativity proposes entity-level mask in pre-training to incorporate the human knowledge into language models.Similarly, SpanBERT (Joshi et al., 2020) modifies the making schema and training objectives to better represent and predict text spans.BERT-wwm (Cui et al., 2021) introduces a whole word masking strategy for Chinese language models.Outside the field of plain text, ERNIE-ViL (Yu et al., 2021) incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language.Inspired by the above work, we leverage the implicit knowledge related to layout, e.g., reading order, for the understanding of visually rich documents.

Methodology
Figure 1 shows an overview of the ERNIE-Layout.Given a document, ERNIE-Layout rearranges the token sequence with the layout knowledge and extracts visual features from the visual encoder.The textual and layout embeddings are combined into textual features through a linear projection, and similar operations are executed for visual embeddings.Next, these features are concatenated and fed into the stacked multi-modal transformer layers, which are equipped with the proposed spatialaware disentangled attention mechanism.For pretraining, ERNIE-Layout adopts four pre-training tasks, including the new proposed reading order prediction, replaced region prediction tasks, and the traditional masked visual-language modeling, text-image alignment tasks.By using Document-Parser, the perplexity of the document with a complex layout is significantly reduced.

Serialization Module
Before feeding a visually rich document to neural networks, serialization, that is, recognizing the text and arranging them in proper order, is a necessary step.First, an OCR tool is used to obtain the words and their coordinates in documents.Then the traditional method arranges the identified elements from left to right and top to bottom by raster-scan to generate the input sequence.Although this method is easy to implement, it cannot correctly handle documents with complex layouts.Look at the example in Figure 2, there are two tables, and the cells in these tables also have some newline text.Suppose we want to extract some information from it.In that case, the expected results may not be obtained according to the raster-scanning order, because the words in the same cell are scattered.
Inspired by the human reading habits, we adopt Document-Parser, an advanced document layout analysis toolkit based on Layout-Parser 3 , to serialize these documents.As shown in Figure 1, based on the words and their boxes recognized by OCR, it first detects document elements (e.g., paragraphs, lists, tables, figures), and then uses specific algorithms to obtain the logical relationship between words based on the characteristics of different elements, to obtain the proper reading order.
To quantitatively analyze the benefits of layout knowledge enhanced serialization, we take perplexity (PPL), calculated by GPT-2 (Radford et al., 2019), as the evaluation metric.PPL is widely used for measuring the performance of language models.From Figure 2, we find that the input sequence serialized by Document-Parser has a lower PPL than the raster-scanning order.More implementation details and cases are detailed in Appendix A.1.

Input Representation
The input sequence of ERNIE-Layout includes a textual part and a visual part, and the representation of each part is a combination of its modal features and layout embeddings (Xu et al., 2021).Text Embedding.The document tokens after the serialization module are used as the text sequence.Following the pre-processing of BERT-Style models (Devlin et al., 2019), two special tokens [CLS] and [SEP] are appended at the beginning and end of the text sequence, respectively.Finally, the text embedding of token sequence T is expressed as: where E tk , E 1p , E tp respectively denote the token embedding, 1D position embedding, and token type embedding layer.
Visual Embedding.To extract the visual features of documents, we employ Faster-RCNN (Ren et al., 2015) as the backbone of visual encoder.In particular, the document image is resized to 224×224 and fed into visual backbone, an adaptive pooling layer is introduced to convert the output into a feature map with a fixed width W and height H (here, we set them to 7).Next, we flatten the feature map into a visual sequence V , and project each visual token to the same dimension as text embedding with a linear layer F vs (•).Similarity, the 1D position and token type [V] are taken into consideration for the generation of visual embedding: Layout Embedding.For each textual token, the OCR tool provides its 2D coordinates with the width and height of the bounding box (x 0 , y 0 , x 1 , y 1 , w, h), where (x 0 , y 0 ) denote coordinates of the upper left corner of the bounding box, (x 1 , y 1 ) denote the bottom right corner, w = x 1 − x 0 , h = y 1 − y 0 , and all the coordinate values are normalized in the range [0, 1000].For the visual token, similar calculation processes can also be performed.To look up the layout embeddings of textual/visual token, we construct separate embedding layers in the horizontal and vertical directions: where E 2x is the x-axis embedding layer, E 2y denotes the y-axis embedding layer.
To achieve the ultimate input representation H of ERNIE-Layout, we integrate the embedding of each textual and visual token with its corresponding layout embeddings.Finally, the textual and visual embeddings are combined together to obtain a long sequence with the length being N + HW , where N is the max length of the textual part: (4)

Multi-modal Transformer
In the final input representation, textual and visual tokens are spliced together, and the self-attention mechanism in the transformer supports their layeraware cross-modal interaction.However, as a unique modality, layout features should be involved in the calculation of attention weight, and the tightness between them and contents (collectively refers to text and image) should also be taken into account explicitly.Inspired by the disentangled attention of DeBERTa (He et al., 2021), in which the attention weights among tokens are computed using disentangled matrices on their contents and relative positions, we propose spatial-aware disentangled attention for the multi-modal transformer to enable the participation of layout features.Firstly, we take 1D position as an example to define the relative distance δ 1p (•) between token i and j, and the definition in the x-axis and y-axis directions of 2D layout is the same: Next, to construct relative position vectors consistent with the input dimension, we introduce three relative position embedding tables for 1D position, 2D x-axis and 2D y-axis.After looking up the embedding table, a series of projection matrices map these relative position vectors as well as the content vectors into Q ⋆ , K ⋆ , V ⋆ in the attention mechanism, where ⋆ ∈ {ct, 1p, 2x, 2y}.In the process of attention calculation, we decouple the raw score into four parts to realize the in-depth exchange of 1D/2D features and contents: ⊤ , (7) Finally, all these attention scores are summed up to get the attention matrix Â.With the scaling and normalization operations, the output of spatialaware disentangled attention is4 :

Pre-training Tasks
There are four pre-training tasks in ERNIE-Layout.
We design reading order prediction and replaced region prediction, as well as borrow masked visuallanguage modeling and text-image alignment from LayoutLMv2 (Xu et al., 2021), so that the model has the ability to learn layout knowledge and fuse various multi-modal information.
Reading Order Prediction.The serialization result consists of several text segments, including a series of words and 2D coordinates.Based on the knowledge, we organize the input words in proper reading order.However, there is no explicit boundary between text segments in the input sequence received by the transformer.To make the model understand the relationship between layout knowledge and reading order and still work well when receiving input in inappropriate order, we propose Reading Order Prediction (ROP) and hope the attention matrix Â carries the knowledge about reading order.In this way, we give Âij an additional meaning, i.e., the probability that the j-th token is the next token of the i-th token.Besides, the ground truth is a 0-1 matrix G, where 1 indicates that there is a reading order relationship between the two tokens and vice versa.For the end position, the next token is itself.In pre-training, we calculate the loss with Cross-Entropy: Replaced Region Prediction.In visual encoder, each document image is processed into a sequence with a fixed length HW .To enable the model perceive fine-grained correspondence between image patches and text, with the help of layout knowledge, we propose Replaced Region Prediction (RRP).Specifically, 10% of the patches are randomly selected and replaced with a patch from another image, the processed image is encoded by the visual encoder and input into the multi-modal transformer.
Then, the [CLS] vector output by the transformer is used to predict which patches are replaced.So the loss of this task is: where G i is the golden label of replaced patches, P i is the normalized probability of prediction.Masked Visual-Language Modeling.Similar to masked language modeling (MLM), the objective of masked visual-language modeling (MVLM) is to recover the masked text token based on its text context and the whole multi-modal clues.
Text-Image Alignment.Besides the image-side cross-modal task RRP, we also adopt Text-Image Alignment (TIA), as a text-side task, to help the model learn the spatial correspondence between image regions and coordinates of bounding box.Here, some text lines are randomly selected, and their corresponding regions are covered on the document image.Then, a classification layer is introduced to predict whether each text token is covered.
To sum up, the final pre-training objective is: 4 Experiments

Datasets
For the fairness of experiments, we only use layout knowledge enhanced serialization to rearrange the reading order of pre-training data, which means that ERNIE-Layout receives the same input as the compared methods in the fine-tuning phase.Pre-training.Following popular choice in VrDU, we crawl the homologous data of IIT-CDIP Test Collection (Lewis et al., 2006) from Tabacco website, which contains over 30 million scanned document pages, and randomly select 10 million pages from them as the pre-training data.Fine-tuning.We carry out broad experiments on various downstream VrDU tasks and datasets.

Results
Key Information Extraction.Table 3 shows the results on four datasets, in which we utilize entitylevel F1 score to evaluate these sequence labeling tasks.ERNIE-Layout achieves new state-of-the-art on FUNSD, CORD, Kleister-NDA, and competitive performance on SROIE.It is worth mentioning that, in the FUNSD, ERNIE-Layout obtains a significant and stable improvement of 7.98% (with a standard deviation 0.0011), compared to the previous best results.The above phenomena are enough to verify the effectiveness of our design philosophy that mining and utilizing layout knowledge in document pre-training models.Document Question Answering.(Liu et al., 2019) 0.9011 3 UniLMv2large (Bao et al., 2020) 0.9020 4 LayoutLMlarge (Xu et al., 2020) 0.9443 5 TILTlarge (Powalski et al., 2021) 0.9552 6 LayoutLMv2large (Xu et al., 2021) 0.9564 7 StructuralLMlarge (Li et al., 2021a) 0.9608 8 DocFormerlarge (Appalaraju et al., 2021) 0.9550 9 ERNIE-Layoutlarge (ours) 0.9627  brings an exciting performance improvement to the backbone (almost double the increase of Lay-outLMv2).Furthermore, we achieve top-1 on the DocVQA leaderboard with ensemble.Document Image Classification.Table 5 shows the classification accuracy on RVL-CDIP, which again confirms the effectiveness of ERNIE-Layout in general document understanding.Unlike these key information extraction or document question answering tasks focusing on multi-modal semantic understanding, document image classification requires a macro perception of text content and document layout.Although our pre-training tasks pay attention to the fine-grained cross-modal matching, ERNIE-Layout still refreshes the best performance of the cross-grained task.

Analysis
We further conduct analysis experiments to study the effectiveness of the proposed pre-training tasks, attention mechanisms, and the serialization modules.We select FUNSD and CORD as the evaluation datasets, keep all ablations sharing the same hyper-parameter settings, and report the average number of five runs with different random seeds.
Effectiveness of Pre-training Tasks.In this experiment, we start with the basic MVLM task to implement baseline models (#1), and integrate new tasks step by step until the final model contains all four pre-training tasks (#5).From Table 6, we observe that RRP brings an improvement of 0.95% on FUNSD, demonstrating the benefit of fine-grained cross-modal interaction.When incorporating ROP, the performance of FUNSD is further increased by 1.3%.We consider that ROP facilitates the model to learn a better representation that contains the reading order knowledge.
Effectiveness of Attention Mechanisms.Lay-outLMv2 (Xu et al., 2021) initially proposes spatialaware self-attention to consider layout features in attention calculation, and many subsequent methods follow this idea.From Table 6, we find that adopting such a mechanism can boost the performance of downstream tasks (#4 v.s.#6).Meanwhile, disentangling attention into the position and content parts is another efficient solution to earn further performance gains (#5 v.s.#6).
Effectiveness of Serialization Modules.Here we explore the impact of using different serialization modules on the downstream VrDU tasks.As shown in Table 7, with the layout-knowledge based serialization modules (#2, #3), the model could achieve better performances (even without the disentangled attention).We attribute the improvement to the fact that, although the advanced serialization is not used for fine-tuning datasets, the model has the ability to understand the proper reading order of documents after pre-training.

Conclusion
In this paper, we propose ERNIE-Layout, to integrate layout knowledge into document pre-training models from two aspects: serialization and attention.ERNIE-Layout attempts to rearrange the recognized words of documents, which achieves considerable improvement on downstream tasks over the original raster-scanning order.Besides, we also design a novel attention mechanism to help ERNIE- An important preprocessing step for document understanding is serializing the extracted document tokens.The popular method for this serialization is performed directly on the output results of OCR in raster-scanning order and is sub-optimal though simple to implement.With the Layout-Parser and Table-Parser in the Document-Parser toolkit, the order of the tokens will be further rearranged according to the layout knowledge.During the parsing processing, the tables and figures are detected as spatial layouts, and the free texts are processed by paragraph analysis, combining heuristics and detection models to get the paragraph layout information and the upper-lower boundary relationship.
To validate the effectiveness of our method, we use an open-sourced language model GPT-2 (Radford et al., 2019), to calculate the PPL of the serialized token sequence by the raster-scanning order and Document-Parser respectively.Since documents with complex layouts only account for a small proportion of the total documents, in a test of 10,000 documents, the average PPL only drops about 1 point.However, for these documents with complex layouts, as shown in Table 8, Document-Parser shows great advantages.An example is shown in Figure 4, which is extracted from the third image in Table 8. to show the sequence serialized by Raster-Scan and Document-Parser.✚ ✚  A.2 More Details about Multi-modal Transformer Section 3.3 describes the proposed spatial-aware disentangled attention for the multi-modal transformer through formulas.To facilitate intuitive understanding, we also supplement the flow chart of calculation in Figure 3.

A.3 More Details about Experiments
A.3.1 Finetuning Datasets FUNSD (Jain and Wigington, 2019) is a dataset for form understanding on noisy scanned documents that aims at extracting values from forms, which comprises 199 real, fully annotated, scanned forms.The training set contains 149 samples, and the test set contains 50 samples.We use the official OCR annotations.Following previous methods, we adopt entity-level F1 as the evaluation metric.Like StructralLM (Li et al., 2021a), we use the cell-level layout information when fine-tuning.CORD (Park et al., 2019) is a consolidated dataset for receipt parsing as the first step towards post-OCR parsing tasks.CORD consists of thousands of Indonesian receipts, including images, box/text annotations for OCR, and multi-level semantic labels for parsing.The training, validation, and test sets contain 800, 100, and 100 receipts, respectively.We use the official OCR annotations and entity-level F1 as the evaluation metric.
SROIE (Huang et al., 2019) is a scanned receipts OCR and key information extraction dataset, which covers important aspects related to the analysis of scanned receipts.The training and test set contain 626 and 347 samples, respectively.This task requires the model to extract values from each receipt of four predefined keys: company, date, address, and total.We use the official OCR annotations and entity-level F1 as the evaluation metric.
Kleister-NDA (Graliński et al., 2021) is provided for key information extraction task, which involves a mix of scanned and born-digital long formal documents.The training, valid, and test sets contain 254, 83, and 203 samples, respectively.Due to the test set is not publicly available, we report the entity-level F1 score on the validation set, which is computed by the official evaluation tools5 .The task aims to extract values of four predefined keys: date, jurisdiction, party, and term.
RVL-CDIP (Harley et al., 2015) is a document classification dataset consisting of grayscale document images.The training, validation, and test sets contain 320000, 40000, and 40000 document images, respectively.The document images are categorized into 16 classes, with 25000 images per

Figure 1 :
Figure 1: The architecture and pre-training objectives of ERNIE-Layout.The serialization module is introduced to correct the order of raster-scan, and the visual encoder extracts corresponding image features.With the spatial-aware disentangled attention mechanism, ERNIE-Layout is pre-trained with four tasks.

Figure 2 :
Figure2: The effect of layout knowledge enhanced serialization compared with vanilla raster-scanning order.By using Document-Parser, the perplexity of the document with a complex layout is significantly reduced.

Figure 3 :
Figure 3: The internal working principle of spatial-aware disentangled attention.

Figure 4 :
Figure 4: The example of a document with a complex layout.The serialization result with the raster-scanning order is "... Session Chair: Session Chair: Session Chair: Tuula Hakkarainen ...", while serialization with Document-Parser is "... Session Chair: Tuula wz Session Chair: Frank Markert ...", which is more consistent with human reading habits.

Table 1 :
Table 1 shows the brief statistics of them and more details are included in Appendix A.3.Statistics of datasets for downstream tasks

Table 3 :
Results (Entity-level F1 score) of ERNIE-Layout and previous methods on the Key Information Extraction task (FUNSD, CORD, SROIE, Kleister-NDA).The highest and second-highest scores are bolded and underlined.

Table 4 :
Results (Average Normalized Levenshtein Similarity, ANLS) of ERNIE-Layout and previous methods on the Document Question Answering task (DocVQA)."-" means the fine-tuning set is not clearly described in the original paper.△ANLS means ANLS difference between the multi-modal model and its corresponding text-only model, where ERNIE-Layout is initialized from RoBERTa and LayoutLMv2 is initialized from UniLMv2.

Table 5 :
Results (Accuracy) of ERNIE-Layout and previous methods on the Document Image Classification task (RVL-CDIP).
model LayoutLM (#4) on the task.Unfortunately, UniLMv2 does not open any pre-training code or pre-trained model, and we can only use the parameters of RoBERTa to initialize our ERNIE-Layout.Nevertheless, we are surprised that ERNIE-Layout

Table 6 :
Performance analysis with different pre-training tasks and attention mechanisms, in which SADA refers to the spatial-aware disentangled attention in ERNIE-Layout, SASA refers to the spatial-aware self-attention proposed by LayoutLMv2.† indicates the added module is proposed in this paper.

Table 7 :
Performance analysis with different serialization modules, in which Raster-Scan means serialization with vanilla OCR results, while Layout-Parser and Document-Parser arrange the recognized words with the help of layout knowledge.
Layout build better interaction between text/image and layout features.Extensive experiments demonstrate the effectiveness of ERNIE-Layout, and various analyses show the impact of different utilization of layout knowledge on VrDU tasks.Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.2016.Hierarchical attention networks for document classification.In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics (NAACL), pages 1480-1489.Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang.2021.Ernie-vil: Knowledge enhanced vision-language representations through scene graphs.In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 3208-3216.Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu.2019.Ernie: Enhanced language representation with informative entities.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1441-1451.
A AppendixA.1More Details about Document-Parser The Document-Parser assembles multiple modules such as document-specific OCR, Layout-Parser, and Table-Parser, in which the Layout Parser and Table Parser modules are crucial for incorporating layout knowledge in ERNIE-Layout.

Table 8 :
The PPL of serialized token sequence with different methods.RS refers to the Raster-scanning order and DP refers to the order with Document-Parser.