Spatial Dependency Parsing for Semi-Structured Document Information Extraction

Information Extraction (IE) for semi-structured document images is often approached as a sequence tagging problem by classifying each recognized input token into one of the IOB (Inside, Outside, and Beginning) categories. However, such problem setup has two inherent limitations that (1) it cannot easily handle complex spatial relationships and (2) it is not suitable for highly structured information, which are nevertheless frequently observed in real-world document images. To tackle these issues, we first formulate the IE task as spatial dependency parsing problem that focuses on the relationship among text tokens in the documents. Under this setup, we then propose SPADE (SPAtial DEpendency parser) that models highly complex spatial relationships and an arbitrary number of information layers in the documents in an end-to-end manner. We evaluate it on various kinds of documents such as receipts, name cards, forms, and invoices, and show that it achieves a similar or better performance compared to strong baselines including BERT-based IOB taggger.


Introduction
Document information extraction (IE) is the task of mapping each document to a structured form that is consistent with the target ontology (e.g., database schema), which has become an increasingly important task in both research community and industry. In this paper, we are particularly interested in information extraction from real-world, semi-structured document images, such as invoices, receipts, and name cards, where we assume Optical Character Recognition (OCR, i.e. detecting the locations of the text tokens if the input is an image) has been already applied. Previous approaches for semi-structured document IE often assume as if the input is a one-dimensional sequence and formulate the task as an IOB (Inside Outside Beginning) tagging problem. In this setup, the tokens in the document (either obtained through an OCR engine or trivially parsed from a web page or pdf) are first serialized, and then an independent tagging model classifies each of the flattened lists into one of the pre-defined IOB categories (Ramshaw and Marcus, 1995;Palm et al., 2017). While effective for relatively simple documents, their broader application in the real world is still challenging because (1) semi-structured documents often exhibit a complex layout where the serialization algorithm is non-trivial, and (2) sequence tagging is inherently not effective for encoding multi-layer hierarchical information such as the menu tree in receipts (Fig. 1c).
To overcome these limitations, we propose SPADE (SPAtial DEpendency parser), an end-toend, serializer-free model that is capable of extracting hierarchical information from complex documents. Rather than explicitly dividing the original problem into two independent subtasks of serialization and tagging, our model tackles the problem in an end-to-end manner by creating a directed relation graph of the tokens in the document (Fig. 1). In contrast to traditional dependency parsing, which parses the dependency structure in purely (onedimensional) linguistic space, our approach leverages both linguistic and (two-dimensional) spatial information to parse the dependency.
We evaluate SPADE on eight document IE datasets created from real-world document images, including invoices, name cards, forms, and receipts, with the varying complexity of information structure. In all of the datasets, our model shows a similar or better accuracy than strong baselines including BERT-based IOB taggers, and particularly outstands in documents with complex layouts (Table 3). These results demonstrate the effectiveness { store_name: "DEEP COFFEE", store_tel: "29-979-2458" } ..., { menu_name: "volcano iced coffee", count: "4", unit_price: "1,000", price: "4,000" }, { menu_name: "citron tea", count: "1", unit_price: "2,000", price: "2,000" }, ...  Figure 1: The illustration of spatial dependency parsing problem. Receipt parsing is explained in detail with three subfigures: (a) first, text tokens and their coordinates are extracted from OCR; (b) next, the relations between tokens are classified into two types: rel-s for serialization and information type (field) classification, and rel-g for inter-grouping between fields (the numbers inside of circles in (b) indicates the box numbers in (a)); (c) the final parse is generated by decoding the graph. (d) A sample name card and its spatial dependency parse. (e) Other conceptual examples showing the versatility of the spatial dependency parsing approach for document IE.
of our end-to-end, graph-based paradigm over the existing sequential tagging approaches. In short, our contributions are threefold. (1) We present a novel view that information extraction for semi-structured documents can be formulated as a dependency parsing problem in two-dimensional space.
(2) We propose SPADE for spatial dependency parsing, which is capable of efficiently constructing a directed semantic graph of text tokens in semi-structured documents. 1 (3) SPADE achieves a similar or better accuracy than the previous state of the art or strong BERT-based baselines in eight document IE datasets.

Related Work
The recent surge of interest in automatic information extraction from semi-structued documents are well reflected in their increased number of publication record from both research community and industry (Katti et al., 2018;Qian et al., 2019;Liu et al., 2019;Denk and Reisswig, 2019;Hwang et al., 2019; Jaume et al., 2019;Zhong et al., 2019;Rausch et al., 2019;Yu et al., 2020;Majumder et al., 2020;Lockard et al., 2020;Garncarek et al., 2020;Lin et al., 2020;Xu et al., 2020;Powalski et al., 2021;Wang et al., 2021;Hong et al., 2021;. Below, we summarize some of closely related works published before the major development of SPADE . Serialized IE Previous semi-structured document information extraction (IE) methods often require the input text boxes (obtained from OCR) to be serialized into a single flat sequence. Hwang et al. (2019) and Denk and Reisswig (2019) combine a manually engineered text serializer that turn the OCR text boxes into a sequence and a Transformer-based encoder, BERT (Devlin et al., 2018), that performs IOB tagging on the sequence or semantic segmentation from images. In contrast to SPADE , these models rely on the serialization of the tokens and thus it is difficult to flexibly apply them to documents with complex layouts such as multi-column or distorted documents. Xu et al. (2019) propose LayoutLM that jointly embeds the image segments, text tokens, and positions of the tokens in an image to make a pretrained model for document understanding. However, LayoutLM still requires a careful serialization of the tokens as it relies on the position embeddings of BERT. Also, it is only evaluated on classification for the downstream task.
Serializer-free IE Existing serializer-free methods mostly extract flat key-value pairs, as they still formulate the task as tagging the text tokens. They fundamentally differ from SPADE which generates a structured output that captures full information hierarchy represented in the document. Chargrid (Katti et al., 2018) performs semantic segmentation on invoice images to extract target key-value pairs. Although Chargrid uses additional "bounding boxes" for inter-grouping of certain fields, the application to the documents that have more than two information hierarchy levels is non-trivial. Also, when fields that belong to the same group are remotely located, the bounding boxes may need to be modified to have a more complex geometrical shape to avoid overlap between the boxes. (2020) utilize a graph convolution network to contextualize the tokens in a document and a bidirectional LSTM with CRF to predict the IOB tags. However, the range of possible parse generations is limited as IOB tagging can be performed only within each OCR bounding box, ignoring inter-box relationship. On the contrary, SPADE predicts both the intrabox relationship and the inter-box relationship by constructing a dependency graph among the tokens. Lockard et al. (2019Lockard et al. ( , 2020 also utilize a graph to extract semantic relation from semi-structrued web-page. The graph is constructed based on rules from "structured html DOM" and mainly used for information encoding. On the other hand SPADE accepts "unstructured text distributed in 2D" and generates graphs as the result of decoding (in a data-driven way).
Dependency parsing Dependency parsing is the task of obtaining the syntactic or semantic structure of a sentence by defining the relationships between the words in the sentence (Zettlemoyer and Collins, 2012;Peng et al., 2017;Dozat and Manning, 2018). The relations are often expressed as directed, labeled arcs. In our work, we view the problem of  information extraction for semi-structured documents as a spatial dependency parsing task such that two-dimensional spatial information is mainly considered. This setup enables SPADE to flexibly handle documents with complicated layouts while representing the full information hierarchy.

Problem definition
In this section, we first describe the task of information extraction for semi-structured documents, and we briefly discuss how the task was approached in the past as a sequence tagging problem. Then we formulate it as a spatial dependency parsing problem. In Section 4, we show how we design our model for the newly formulated problem.

Semi-structured document IE
Document IE is often defined as the extraction of structured information (e.g. key-value pairs) in documents. For semi-structured documents, the task becomes more challenging, mainly due to two factors: (1) complex spatial layout and (2) hierarchical information structure. In the simplest case, both of the two factors are minimally present, where the text is strictly a linear sequence, and the desired output is simply a list of fields, similar to Named Entity Recognition (NER) task. However, the problem becomes more difficult when at least one of the factors is significant. In name cards, spatial relationship can be tricky; Fig. 1d shows an example where a naïve left-to-right serialization would fail because the company name ("Physics Company") is tilted. In receipts, their hierarchical information structure complicates the problem. For example, in Fig. 1a), words "volcano" (box 5), "iced" (box 6), and "coffee" (box 10) together form a single field menu name, and the field constitutes another group in the second hierarchical layer with the count field (box7), unit price field (box 8), and price field (box 9).
Other conceptual examples are shown in Fig. 1e); documents that have triple information layers (left), multiple columns (middle), and a table (right).

Previous formulation: Sequence tagging
As mentioned, IOB sequence tagging is appropriate for document IE when the layout and the information structure are simple (Ramshaw and Marcus, 1995;Lample et al., 2016;Chiu and Nichols, 2016;Ma and Hovy, 2020). When one of the factors is present, however, one has to adopt an ad-hoc solution to detour the inherent limitation of IOB.
In the case of complex spatial relationship (e.g., name card), an advanced, dedicated serialization method can be considered. However, it may require layout-specific manual engineering, which becomes more difficult for documents such as name cards that exhibit diverse layouts.
In the case of complex information structure (e.g., receipt), one can consider augmenting each IOB tag with higher-layer information. For instance, in a typical IOB setting, the menu name field will require two tags, namely menu name B and menu name I. To model the second layer information (inter-grouping of fields), menu name B can be augmented into two, namely B2 menu name B, I2 menu name B, where B2 and I2 indicate the beginning and the inside of the hierarchy's second layer. While effective for some applications, this method would not generalize well to an arbitrary depth as it requires more tags for each additional layer.

Our formulation: Spatial dependency parsing
To better model spatial relationship and hierarchical information structure in semi-structured documents, we formulate the IE problem as "spatial dependency parsing" task by constructing a dependency graph with tokens and fields as the graph nodes (node per token and field type). This is demonstrated in Fig. 1, where empty blue circles are text nodes, and filled blue circles are field nodes.
Although the spatial layout of semi-structured documents is diverse, it can be considered as the realization of mainly two abstract properties between each pair of nodes, (1) rel-s for the ordering and grouping of tokens belonging to the same information category (blue arrows in Fig. 1b), and (2) rel-g for the inter-group relation between grouped tokens or groups (orange arrows in the same figure). Connecting a field node to a text node indicates that the text is classified into the field. For example, "volcano iced coffee" in Fig. 1a) is classified as a menu name by being attached to the menu name field node with blue arrows, and it is connected with "x4", "@1,000", and "4,000" with orange arrows to indicate the hierarchical information among the groups. The dependency graphs of name cards and other conceptual examples are also shown in Fig. 1d and e.

Model
To perform the spatial dependency parsing task introduced in the previous section in an end-to-end fashion, we propose SPADE that consists of (1) spatial text encoder, (2) graph generator, and (3) graph decoder. Spatial text encoder and graph generator are trained jointly. Graph decoder is a deterministic function (without trainable parameters) that maps the graph to a valid parse of the output structure.

Spatial text encoder
Spatial text encoder is based on 2D Transformer architecture. Unlike the original Transformer (Vaswani et al., 2017), there is no order among the input tokens, making the model invariant under the permutation of the input tokens. Inspired by Transformer XL (Dai et al., 2019), the attention weights (between each key and query vector) is computed by where q i is the query vector of the i-th input token, k j is the key vector of the j-th input token, r ij is the relative spatial vector of the j-th token with respect to the i-th token, and b key|rel i is a bias vector. In (original) Transformer, only the first term of Equation 1 is used.
The relative spatial vector r ij is constructed as follows (Fig. 2c). First, the relative coordinates be-tween each pair of tokens are computed. 2 Next, the coordinates are quantized into integers and embedded using sin and cos functions (Vaswani et al., 2017). The physical distance and the relative angle between each pair of the tokens are also embedded in a similar way. Finally, the four embedding vectors are linearly projected (with a trainable projection matrix) and concatenated at each encoder layer.

Graph generator
As discussed in Section 3.3 and shown in Fig. 1, every token corresponds to a node and each pair of the nodes forms one of the two relations (or no relation): (1) rel-s for serializing tokens within the same field, and (2) rel-g for inter-grouping between fields. The dependency graph can be represented by using a binary matrix M (r) for each relation type r (Fig. 2b) where M (r) ij = 1 if their exists a directed edge from the i-th token to the j-th token and 0 otherwise. Each M (r) consists of n field + n text number of rows and n text number of columns where n field and n text represent the number of field types and the number of tokens, respectively. The graph generation task now becomes predicting the binary matrix.
We obtain M (r) as follows. The probability that there exists a directed edge i where u (field) i represents the trainable embedding vector of the i-th field type node (filled blue circles in Fig. 1), {v i } is a set of vectors of contextualized tokens from the enoder, W stands for affine transformation, h is the embedding vector of the head token, and d is that of the dependent token.

M (r)
ij is obtained by binarizing p (r) ij as follows.
(3) The recall rate of edges can be controlled by varying the threshold value p th . Here, we set p th = 0.5.
Tail collision avoidance algorithm Each node in spatial dependency graphs has a single incoming edge per relation except some special documents such as table (Fig. 1e). Based on this property, we apply the following simple yet powerful tail collision avoidance algorithm: (1) at each tail node having multiple incoming edges, all edges are trimmed except the one with the highest linking probability; (2) at each head node of the trimmed edges, the new tail node is found by drawing the next probable edge whose probability is larger than p th and belongs to the top three; (3) go back to Step 1 and repeat the routine until the process becomes selfconsistent or the max iteration limit is reached (set to 20 in this paper). The algorithm prevents loops and token redundancy in parses.

Graph decoder
We decode the generated graph into the final parse through the following three stages: (1) SEEDING, (2) SERIALIZATION, and (3) GROUPING (Table  1). In SEEDING, field type nodes (filled circles in Fig. 1) are linked to multiple text nodes (seeds) by rel-s. In SERIALIZATION, each seed node found in the previous stage generates a directed edge (rel-s) to the next text node (i.e. serialization) recursively until there is no further node to be linked. Finally, in GROUPING, the serialized texts are grouped iteratively, constructing information layers from the top to the bottom. The total number of iterations is equal to "the number of information layers−1". To group texts using directed edges, we define a special representative field for each information layer. Then, the first token of the representative field generates directed edges to the first token of other fields that belong to the same group using rel-g (for example, menu name ("volcano iced coffee") in Fig. 1a) generates directed edges to other member fields (count ("x4"), unit price ("@1,000") and price ("4,000")).
The process generates an arborescence 3 for each field (rel-s) and group (rel-g). The resulting set of graphs has a one-to-one correspondence with the parse through detokenization. The use of beam search in SERIALIZATION does not introduce noticeable difference in rel-s probably due to the short decoding length of the graph (mostly less than 30). The development of a more advanced decoding algorithm that generates globally optimal multiple arborescences remains as future work.
Although undirected edges can be employed for the inter-grouping of fields, the use of directed edges has the following merits: (1) an arbitrary depth of information hierarchy can be described without increasing the number of relation types (Fig. 1e) under a unified framework and (2) a parse can be generated in a straightforward manner by iteratively selecting dependent nodes. s and g stand for rel-s and rel-g respectively.

Action
Input node Graph at time t + 1 5 Experimental Setup

Optical character recognition
To extract the visually embedded texts from an image, we use our in-house OCR system that consists of CRAFT text detector (Baek et al., 2019b) and Comb.best text recognizer (Baek et al., 2019a). The OCR models are finetuned on each of the document IE datasets. The output tokens and their spatial information on the image are used as the inputs to SPADE .

Training
We use 12 layers of 2D Transformer encoder (Section 4.1). The parameters are initialized from bert-multilingual (Devlin et al., 2018) 4 . ADAM optimizer (Kingma and Ba, 2015) is used with the following learning rates: 1e-5 for the encoder, 1e-4 for the graph generator, and 2e-5 for s+bert+iob 2 and s adv +bert+iob 2 . The decay rates are set to 1 = 0.9, 2 = 0.999. The batch size is chosen between 4 and 12. SPADE is trained by using one to eight NVIDIA V100 or P40 GPUs for two to seven days, depending on the tasks. The dev sets are used to pick the best model except FUNSD task in which the model is trained in two steps. First, the 25 examples from training set are sampled and used for a model validation. Next, the model is further trained using entire training set and stopped after 1000 epochs. The training dataset is augmented by randomly rotating the text coordinates by a degree of -10 • to +10 • , (2) by distorting the whole coordinates randomly using a trigonometric function, and (3) by randomly deleting or inserting a single token with 3.3% probability each. Also, 1-2 random tokens from training is attached at the end of the text segments from OCR bounding box with 1.7% probability each. In namecard task, the tokens are not augmented. The identical augmentation algorithm are applied to s+bert+iob 2 , s adv +bert+iob 2 and SPADE .

Evaluation metric
To evaluate the predicted parses that consist of hierarchically organized key-value pairs (e.g. Fig. 3, Fig. 4, 5, 6 in Appendix) we use F 1 score based on exact match. First the group of key-value pairs between predictions and ground truth (gt) are matched based on their string edit distance. Each key-value pairs in the predicted parse is counted as true positive if same key-value pair exists within the corresponding group in gt. Otherwise it is counted as false positive. The unmatched key-value pairs in gt are counted as false negative. The accuracy of dependency parsing is evaluated by computing F 1 of predicted edges. For FUNSD dataset, entity labeling and entity linking scores are computed following the original paper (Jaume et al., 2019). See Appendix A.2 for more details.

Data statistics
We summarize the data statistics in Table 2, 6. The property of each dataset and their collection process is described in Appendix A.1.

Experimental Results
The main focus of SPADE is to handle the two challenging factors of semi-structured document information extraction-complex spatial relationships and highly structured information-in a gen-a b c  eralizable way. We first show that our model can handle hierarchical structure in documents by evaluating the model on two datasets CORD  and Receipt-idn that consist of (Indonesian) receipt images. We then show SPADE can perform well on tasks that require modeling the complex spatial relationship in documents by reporting the performance on name card IE where the spatial layout is more complex than receipts. Then the evaluation on the invoice dataset shows the advantage of SPADE when both of the two challenging factors are simultaneously present. Finally, we show that SPADE can handle even more types of documents by evaluating the model on a form understanding dataset, FUNSD (Jaume et al., 2019). Table 3 summarizes the performance of several baseline models and SPADE in various semi-structured document information extraction tasks.
Handling hierarchical structure in documents CORD consists of receipt images without creases or warping. SPADE initially achieves 91.5% and 87.4% in F 1 with and without the oracle (ground truth OCR results), respectively (Table 3, 1st row, co). Their dependency parsing score is also shown   Table 7 in Appendix (1st panel, co). To push the performance further, we notice that individual text nodes have a single incoming edge for each relation except in special documents like table (Fig.  1). Using this property, we integrate Tail Collision Avoidance algorithm (tca) that iteratively trims the tail-sharing-edges and generate new edges until the process becomes self-consistent (Section 4.2). F 1 increases by +1.0% and +0.8% with and without the oracle upon the integration (2nd row, co).
Importance of generating hierarchical structure in receipt IE In receipt IE task, the intergrouping of fields is critical due to multiple appearance of same field types such as menu name and price (Fig. 3a). Without the field grouping, the maximum achievable score is 58.1 F 1 (Table 3, 6th row, UB-flat). Generating hierarchical parses from the semi-structured documents is relatively new and thus the direct comparison to previous stateof-the-art methods are not feasible without considerable modification. General confidential issue related to industrial documents and multi-lingual properties of our task also hinder the comparison. In this regard, we build our own baselines consisting of the manually engineered serializer and BERT-based double IOB taggers (s+bert+iob 2 5 ).

BERT-tagger
The serializer generates pseudo-1D-text from the input tokens distributed in 2D and groups them line-by-line based on their height differences. BERT+iob 2 predicts the boundary between the fields and between the groups of the fields (see Section 3.2 for the detail). In CORD, s+bert+iob 2 shows comparable performance with SPADE with the oracle (-0.1 F 1 ) but shows +1.9 F 1 on the test set (2nd and 3rd rows, co). The relatively lower score of SPADE on the test set may originate from the small size of the training set (800, Table 2) as SPADE needs to handle the text serialization in a data-driven way. Indeed, when both models are trained using Receipt-idn that consists of 9508 training examples, SPADE outperforms by +1.0 F 1 on the test set (2nd and 3rd rows, Receipt-idn).
Inflexibility of tagging model in handling complex spatial relationships Next, we prepare CORD+ and CORD++, which are more challenging setups where the images are warped or tilted as often seen in real-world applications (Fig. 3). SPADE significantly outperforms s+bert+iob 2 (+13.4% F 1 in CORD+, +31.1% F 1 .b in CORD++). This is due to the failure in the serialization in s+bert+iob 2 resulting in line-mixing (Fig. 3b, c and Fig. 5, 6 in Appendix). To understand how much improvement can be achieved through further manual engineering, we prepare s adv +bert+iob 2 which is equipped with the advanced serializer where polynomial fitting is employed to group tokens placed on curvy line. The result shows although there is a large improvement in CORD+ and CORD++ task compared to s+bert+iob 2 , SPADE still shows the better performance (+2.0% in CORD+, +18.3% in CORD++, 1st and 4th rows). This shows the limitation of a serializer-based method that it cannot be easily generalized to handle document images in wild and the performance can be bottlenecked by the serialization step regardless of how advanced tagging models are. The competent performance of SPADE on CORD-M, a dataset generated by concatenating two receipt images from CORD into a single image (Fig. 4 in Appendix), further high-5 S stands for the serializer.
lights the flexibility of SPADE .

Handling documents having complex layout
We further evaluate SPADE on name card IE task. Unlike receipts, no inter-grouping between fields is necessary for name card IE. However, name cards often have a complex layout such as non-horizontal alignment of text or multi column even without tilting and warping (Fig. 1d). Our model achieves +1.1% F 1 compared to s adv +bert+iob 2 on the test set (Table 3, nc).
Handling documents having both hierarchical structure and complex layout To fully explore the capability of SPADE , we further evaluate the model on invoice IE task. Typical invoices have a hierarchical structure where some fields need to be grouped together, such as item name, count, and price that correspond to one same item. In addition, invoices also have a relatively complex layout, having multiple tables or columns. SPADE achieves +1.9 F 1 compared to s adv +bert+iob 2 (Table. 3, inv).
Handling general documents In order to see if SPADE can handle more general kinds of documents, we use the FUNSD form understanding dataset (Jaume et al., 2019) where document IE is performed under a more abstract setting by finding general key-value pairs and their inter-grouping (Section A.1.6). The performance is measured on two OCR-independent subtasks (Jaume et al., 2019): (1) "entity-labeling (ELB)" which predicts the information category of the serialized words, and (2) "entity-linking (ELK)" which measures the score for key-value pair link prediction. The evaluation reveals that SPADE achieves the state of the art on ELK, outperforming the previous baseline by 37.3% F 1 (Table 4, rightmost column). In ELB, SPADE achieves +11.5% F 1 absolute improvement with respect to BERT-Base Tagger. Both models use BERT-Base as a backbone. Although the F 1 scores of LayoutLM are higher than our model, their contributions are orthogonal to ours since they focus on making a better pretrained model. Also, it cannot perform ELK. We emphasize that SPADE solves the three subtasks-ELB, ELK, and word serialization-simultaneously, while other tagger models need to use the perfectly serialized input text and solve only entity labeling. The stable performance of SPADE over randomly rotated documents (ELB-R) or shuffled tokens (ELB-S) supports this highlighting the merit of the serializer-free architecture. Table 4: F 1 scores for two FUNSD subtasks: entity labeling (ELB, ELB-R, and ELB-S) and entity linking (ELK). "Need S" means the input tokens should be serialized. "# of D" indicates the number of documents used for layout pretraining. Ablation study We probe the role of each component of SPADE via ablation study ( Table 5). The performance drops dramatically upon the removal of the relative coordinate information of tokens in the self-attention layer, highlighting its importance in the serializer-free encoder (2nd row). When the absolute coordinates are used in the input instead of the relative coordinates, F 1 drops by 6.9% (3rd row). Finally, 2.6% drop in F 1 is observed upon the removal of the data augmentation during training (4th row). 84.5 (-) relative coordinate 10.5 (-74.0) (-) relative coordinate (+) absolute coordinate 78.6 (-6.9) (-) data augmentation 81.9 (-2.6) † Five encoder layers are used for computational efficiency.

Conclusion
We present SPADE , a spatial dependency parser that can extract highly structured information from documents that have complex layouts. By formulating document IE as a spatial dependency graph construction problem, we provide a powerful unified framework that can extract hierarchical information without feature engineering. We empirically demonstrate the effectiveness of our model over various real-world documents-receipts, name cards, and invoices-and in a popular form understanding task.

A Appendices
A.1 Dataset

A.1.1 Dataset collection
The internal datasets Receipt-idn, namecard and Invoice are annotated by the crowd through an in-house web application following Hwang et al., 2019). First, each text segment is labeled (bounding box and the characters inside) for the OCR task. The text segments are further grouped according to their field types by the crowds. For Receipt-idn and Invoice, additional group-ids are annotated to each field for inter-grouping of them. The text segments placed on the same line are also annotated through row-ids. For quality assurance, the labeled documents are cross-inspected by the crowds.
A.1.2 CORD, CORD+, CORD++, and CORD-M for receipt IE CORD and their variant consist of 30 information categories such as menu name, count, unit price, price, and total price ( Table 6). The fields are further grouped and forms the information layer at a higher level.

A.1.3 Receipt-idn for receipt IE
Receipt-idn is similar to CORD but includes more diverse information categories (50) such as store name, store address, and payment time (Table 6).
A.1.4 namecard for name card IE namecard consists of 12 field types, including name, company name, position, and address ( Table  6). The task requires grouping and ordering of tokens for each field. Although there is only a single information layer (field), the careful handling of complex spatial relations is required due to the large degree of freedom in the layout.
A.1.5 Invoice for invoice IE Invoice consists of 62 information categories such as item name, count, price with tax, item price without tax, total price, invoice number, invoice date, vendor name, and vendor address (Table 6). Similar to receipts, their hierarchical information is represented via inter-field grouping.
A.1.6 FUNSD for general form understanding FUNSD form understanding task consists of two sub tasks: entity labeling (ELB) and entity linking (ELK). In ELB, tokens are classifed into one of four fields-header, question, answer, and other-while doing serialization of tokens within each field. Both subtasks assume that the input tokens are perfectly serialized with no OCR error. To emphasize the importance of correct serialization in the real-world, we prepare two variant of ELB tasks: ELB-R and ELB-S. In ELB-R, the whole documents are randomly rotated by a degree of -20 • -20 • and the input tokens are serialized using rotated y-coordinates. In ELB-S task, the input tokens are randomly shuffled. In both tasks, the relative order of the input tokens within each field remain unchanged. In ELK task, tokens are linked based on their key-value relations (inter-grouping between fields). For example, each "header" is linked to the corresponding "question", and "question" is paired with the corresponding "answer".

A.2 Evaluation metric
During calculation of F 1 for parses, the difference between prediction and ground truth is not counted in store name, menu name, and item name fields in receipt and invoice when the edit distance (ED) is less then 2 or when the ED/gt-string-length ≤ 0.4. Also, in Japanese documents, white spaces are ignored.
In the FUNSD form understanding task, we measure entity labeling (ELB) and entity linking (ELK) scores following (Jaume et al., 2019). ELB measures the field classification accuracy of already "perfectly" serialized tokens of each field (words group), whereas ELK measures the inter-grouping accuracy between word groups. As SPADE does both the serialization of the fields and grouping between fields simultaneously, we do not feed the serialized tokens into SPADE but only use the oracle information to indicate the first text node of each field from the predicted graph. These text nodes effectively represent entire fields and are used for the evaluation.
A.3 The score for the dependency relation prediction