ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction

Natural reading orders of words are crucial for information extraction from form-like documents. Despite recent advances in Graph Convolutional Networks (GCNs) on modeling spatial layout patterns of documents, they have limited ability to capture reading orders of given word-level node representations in a graph. We propose Reading Order Equivariant Positional Encoding (ROPE), a new positional encoding technique designed to apprehend the sequential presentation of words in documents. ROPE generates unique reading order codes for neighboring words relative to the target word given a word-level graph connectivity. We study two fundamental document entity extraction tasks including word labeling and word grouping on the public FUNSD dataset and a large-scale payment dataset. We show that ROPE consistently improves existing GCNs with a margin up to 8.4% F1-score.


Introduction
Key information extraction from form-like documents is one of the fundamental tasks of document understanding that has many real-world applications. However, the major challenge of solving the task lies in modeling various template layouts and formats of documents. For example, a single document may contain multiple columns, tables, and non-aligned blocks of texts (e.g. Figure 1).
The task has been studied from rule-based models (Lebourgeois et al., 1992) to learning-based approaches (Palm et al., 2017;Tata et al., 2021). Inspired by the success of sequence tagging in NLP (Sutskever et al., 2014;Vaswani et al., 2017;Devlin et al., 2019), a natural extension is applying these methods on linearly serialized 2D documents (Palm et al., 2017;Aggarwal et al., 2020). * Work done while an intern at Google Research. Nevertheless, scattered columns, tables, and text blocks in documents make the serialization extremely difficult, largely limiting the performance of sequence models. Katti et al. (2018);  explore to directly work on 2D document space using grid-like convolutional models to better preserve spatial context during learning, but the performance is restrictive to the resolution of the grids. Recently, Qian et al. (2019); Davis et al. (2019); Liu et al. (2019) propose to represent documents using graphs, where nodes define word tokens and edges describe the spatial patterns of words.  show state-of-the-art performance of Graph Convolutional Networks (GCNs) (Duvenaud et al., 2015) on document understanding.
Although GCNs capture the relative spatial relationships between words through edges, the specific word ordering information is lost during the graph aggregation operation, in the similar way to the average pooling in Convolutional Neural Networks (CNNs). However, we believe reading orders are strong prior to comprehending languages. In this work, we propose a simple yet effective Reading Order Equivariant Positional Encoding (ROPE) that embeds the relative reading order context into graphs, bridging the gap between sequence and graph models for robust document understanding. Specifically, for every word in a constructed graph, ROPE generates unique reading order codes for its neighboring words based on the graph connectivity. The codes are then fed into GCNs with self-attention aggregation functions for effective relative reading order encoding. We study two fundamental entity extraction tasks including word labeling and word grouping on the public FUNSD dataset and a large-scale payment dataset. We observe that by explicitly encoding relative reading orders, ROPE brings the same or higher performance improvement compared to spatial relationship features in existing GCNs in parallel.

Other Related Work
Attention models show state-of-the-art results in graph learning (Veličković et al., 2018) and NLP benchmarks (Vaswani et al., 2017). As attention models with positional encodings are proven to be universal approximators of sequence-to-sequence functions (Yun et al., 2020), encoding positions or ordering is an important research topic. For sequence, learned positional embeddings (Gehring et al., 2017;Devlin et al., 2019;Shaw et al., 2018), sinusoidal functions and its extensions (Liu et al., 2020) have been studied. Beyond that, positional encodings are explored in graphs (You et al., 2019), 2D images (Parmar et al., 2018) and 3D structures (Fuchs et al., 2020). Lastly, graph modeling is also applied to other document understanding tasks, including document classification (Yao et al., 2019) and summerization (Yasunaga et al., 2017).

Method
We follow recent advances in using GCNs for document information extraction that relax any serialization assumptions by sequence modeling. GCNs take inputs (word tokens in this case) of arbitrary numbers, sizes, shapes and locations, and encode the underlying spatial layout patterns of documents through direct message passing and gradient updates between input embedding in the 2D space.
Node definition. Given a document D with N tokens denoted by T = {t 1 , t 2 , ..t N }, we refer t i to the i-th token in a linearly serialized text sequence returned by the Optical Character Recognition (OCR) engine. The OCR engine generates the bounding box sizes and locations for all tokens, as well as the text within each box. We define node input representation for all tokens T as vertices where v i concatenates quantifiable attributes available for t i . In our design, we use two common input modalities: (a) word embeddings from an off-the-shelf pre-trained BERT model (Devlin et al., 2019), and (b) spatial embeddings from normalized bounding box heights, widths, and Cartesian coordinate values of four corners.
Edge definition. While the vertices V represent tokens in a document, the edges characterize the relationship between the vertices. Precisely, we define directional edges for a set of edges E, where each edge e ij connects two vertices v i and v j , concatenating quantifiable edge attributes. In our design, we use two input modalities given an edge e ij connecting two vertices: (a) spatial embeddings from horizontal and vertical normalized relative distances between centers, top left corners and bottom right corners of the bounding boxes. It also contains height and width aspect ratios of v i , v j , and relative height and width aspect ratios between v i and v j . (b) Visual embeddings that utilizes Im-ageNet pre-trained MobileNetV3 (Howard et al., 2019)  such as colors, fonts, separating symbols or lines between two token bounding boxes (through their union bounding box). We refer to the spatial embedding in (a) as the edge geometric (EdgeGeo) feature used in the experimental section.
Graph construction. Our implementation is based on the β-skeleton graph (Kirkpatrick and Radke, 1985) with β = 1 for graph construction. By using the "ball-of-sight" strategy, β-skeleton graph offers high connectivity between word vertices for necessary message passing while being much sparser than fully-connected graphs for efficient forward and backward computations (Wang et al., 2021). A β-skeleton graph example can be found in Figure 2, and more can be found in Figure 5 in the Appendix.
Aggregation function. Inspired by the Graph Attention Networks (Veličković et al., 2018) and the Transformers (Vaswani et al., 2017), we use multihead self-attention module as our GCN aggregation (pooling) function. It calculates the importance of individual message coming from its neighbors to generate the new aggregated output.

Reading Order Equivariant Positional Encoding (ROPE)
Positional encoding (Gehring et al., 2017) in sequence models is with an assumption that the input is perfectly serialized. However, as illustrated in Figure 1, form-like documents often contain multiple columns or sections. A simple left-to-right and top-to-bottom serialization commonly provided by OCR engines does not provide accurate sequential presentation of words -two consecutive words in the same sentence might have drastically different reading order indexes by naive serialization. Instead of assigning absolute reading order indexes for the entire document at the beginning, we propose to encode the relative reading order context of neighboring words w.r.t. the target word based on the given graph connectivity. Figure 3 demonstrates the process of the proposed method: ROPE iterates through the neighboring word vertices in the original reading order and assigns new ROPE codes p ∈ N (red numbers) to the neighbors, starting from zero. The generated codes are then appended to the corresponding incoming messages during graph message passing. Hence, ROPE provides a relative reading order context of the neighborhood for order-aware self-attention pooling.
Note that the generated ROPE codes remain unchanged if the neighbors and the target shift equally in the document with the same relative order, therefore being equivariant. Additionally, ROPE provides robust sequential output that is consistent even when the neighborhood crosses multiple columns or sections in a document.
Finally, we also explore sinusoidal encoding matrix (Vaswani et al., 2017) besides the index-based encoding. Our ablation study in Section 4 shows that using both results in the best performance.

Experiments
We evaluate how reading order impacts overall performance of graph-based information extraction from form-like documents. We adopt two form understanding tasks as Jaume et al. (2019), including word labeling and word grouping. Word labeling is the task of assigning each word a label from a set of predefined entity categories, realized by node classification. Word grouping is the task of aggregating words that belong to the same entity, realized by edge classification. These two fundamental entity extraction tasks do not rely on perfect entity word groupings provided by the dataset and therefore help decouple the modeling capability provided by the proposed ROPE in practice. These two tasks also effectively demonstrate the quality of the node embedding and edge embedding of the proposed graph architecture and decouple any performance gain from sophisticated Conditional Random Field (CRF) decoders often used on top of the model.

Datasets
Payment. We follow Majumder et al. (2020) to prepare a large-scale payment document collection that consists of around 18K single-page payments. The data come from different vendors with different layout templates. For both word labeling and word grouping experiments, we use a 80-20 split of the corpus as the training and test sets.
We use a public OCR service 1 to extract words from the payment documents. The service generates the text of each word with their corresponding 2D bounding box. The word boxes are roughly arranged in an order from left to right and from top to bottom. We then ask human annotators to label the words with 13 semantic entities. Each entity ground truth is described by an entity type and a list of words generated by the OCR engine, resulting in over 3M word-level annotations. Labelers are instructed to label all instances of a field in a document, therefore our GCNs are trained to predict all instances of a field as well.
FUNSD. FUNSD (Jaume et al., 2019) is a public dataset for form understanding in noisy scanned documents, containing a collection of research, marketing, and advertising documents that vary widely in their structure and appearance. The dataset consists of 199 annotated forms with 9,707 entities and 31,485 word-level annotations for 4 entity types: header, question, answer, and other. For both word labeling and word grouping experiments, we use the official 75-25 split for the training and test sets.

Experimental Setup
All GCN variants used in the experiment have the same architecture: The node update function is a 2-layer Multi-Layer Perceptron (MLP) with 128 hidden nodes. The aggregation function uses a 3-layer multi-head self-attention pooling with 4 heads and 32 as the head size. The number of hops in the GCN is set to 7 for payment dataset and 2 for FUNSD dataset due to the complexity and scale of the former. We use cross-entropy loss for both multi-class word labeling and binary word 1 cloud.google.com/vision  grouping tasks. We train the models from scratch using Adam optimizer with the batch size of 1. The learning rate is set to 0.0001 with warm-up proportion of 0.01. The training is conducted on 8 Tesla P100 GPUs for approximately 1 day on the largest corpus.

Results
We train the GCNs from scratch on all datasets. For word labeling we use multi-class node classification F1-scores as the metric and for word grouping we use binary edge classification F1-scores as the metric with the corresponding precision and recall values.
Importance of reading order. Positional encoding mechanisms are the key components to exploiting layout patterns of words -Answer entities are usually next to or below the Question entities. Existing GCN approaches rely on edge geometric (EdgeGeo) features to capture such spatial relationships between words in 2D space. Here we evaluate the importance of the proposed reading order encoding ROPE with various combinations of EdgeGeo over the baseline GCN (Qian et al., 2019) as summarized in Table 1. Without any positional encoding, word labeling F1 drops by 13.75 points and word grouping F1 drops by 2.84 points on payment dataset. Then, we pass ROPE to incoming messages and find that this reduces the drop to 6.38 points on word labeling and 0.76 points on word grouping. Similar trend can be observed on FUNSD as well. Surprisingly, ROPE reduces performance drop more effectively than EdgeGeo on the larger payment dataset. Given these ablations, we conclude that reading order information is at least the same or more important than geometric features, and they bring orthogonal improvements to the overall performance.
Reading order encoding function. In practice, each target word usually has less than 8 neighboring words given a constructed β-skeleton graph. Therefore, a natural approach to assigning relative reading orders is to simply use the ROPE encoded indexes. In Table 2 we observe that simple index encoding immediately improves GCN without ROPE by 6.32 points on word labeling and 1.59 points on word grouping using payment corpus. Next we explore the popular sinusoidal function (with 3 base frequencies) for reading order encoding. It improves GCN without ROPE by 4.85 points on word labeling and 0.72 points on word grouping. Interestingly, sine function provides on par performance but does not outperform index encoding. The reason might be because the β-skeleton graph does not generate an extremely large number of neighbors, so simple index encoding is sufficient.
Sensitivity to OCR reading order. We investigate the robustness of ROPE to the quality of the input reading order. We shuffle the reading order provided by the OCR engine with a varying percentage of words before feeding into ROPE. Figure 4 exhibits the performance. For both word labeling and word grouping tasks, ROPE provides performance improvement up to less than 30% word or-der shuffling on the large payment corpus. With 30% or more word order shuffled, we observe less performance degradation on the word labeling, suggesting that the word grouping task is more sensitive to the original OCR reading order.

Conclusion
We present a simple and intuitive reading order encoding method ROPE that is equivariant to relative reading order shifting. It embeds the effective positional encoding from sequence models while leveraging the existing spatial layout modeling capability of graphs. We foresee the proposed ROPE can be immediately applicable to other document understanding tasks. Figure 5: β-skeleton examples of documents of FUNSD. By using the "ball-of-sight" strategy, β-skeleton graph offers high connectivity between word vertices for necessary message passing while being much sparser than fully-connected graphs for efficient forward and backward computations Figure 6: Sample output of the word grouping task on FUNSD with a few failure cases.