Merge and Recognize: A Geometry and 2D Context Aware Graph Model for Named Entity Recognition from Visual Documents

Named entity recognition (NER) from visual documents, such as invoices, receipts or business cards, is a critical task for visual document understanding. Most classical approaches use a sequence-based model (typically BiLSTM-CRF framework) without considering document structure. Recent work on graph-based model using graph convolutional networks to encode visual and textual features have achieved promising performance on the task. However, few attempts take geometry information of text segments (text in bounding box) in visual documents into account. Meanwhile, existing methods do not consider that related text segments which need to be merged to form a complete entity in many real-world situations. In this paper, we present GraphNEMR, a graph-based model that uses graph convolutional networks to jointly merge text segments and recognize named entities. By incorporating geometry information from visual documents into our model, richer 2D context information is generated to improve document representations. To merge text segments, we introduce a novel mechanism that captures both geometry information as well as semantic information based on pre-trained language model. Experimental results show that the proposed GraphNEMR model outperforms both sequence-based and graph-based SOTA methods significantly.


Introduction
With the rapid progresses in natural language processing and computer vision, visual documents become a mainstream media for expressing abundant information. Extracting the named entities from these visual documents is a critical step for further understanding. In the perspective of traditional natural language processing, the layout and the format information of documents is discarded. Only plain text that simply consists of sequential words do the researchers focus on and are necessary to be extracted entities. However, visual documents consist of many discrete text segments with a large variety of layouts and formats as Figure 1 shows. The combination of different text segments with different positions may represent different semantic information. Without the layout structure and the 2D semantic context information in it, the named entity recognition in visual documents could be much harder.
Although the named entity recognition in visual documents is a newly proposed under-researched task. Recently, in the attempt to make use of the structure information of visual documents, many works (Palm et al., 2017;Yang et al., 2017;Katti et al., 2018;Liu et al., 2018;Qian et al., 2019;Denk and Reisswig, 2019;Zhao et al., 2019) have designed NLP/CV/NLPCV based methods for visual-documents-related tasks. These approaches mostly focus on the coordinate of text segments to make features or learn embeddings. However, most of these methods do not consider the two problems: 1. Existing models often ignore the geometric information between text segments which is crucial for constructing 2D context for visual documents to extract named entities. It is hard to analyze the semantic meaning only through the plain text inside the bounding box and its coordinates. 2. Because of the layout design, a complete named entity may be separated into several segments and cannot represent its full meaning. Meanwhile, some of text segments may lose the semantic information of their prefix ones and get incorrect tagging results. It is necessary and important to merge text segments into a complete named entity.
Specifically, for the first problem, as illustrated in Figure 1, it is hard to tell whether "会锦"("HuiJin") in text segment 1 is a named entity only according to its own plain text. While according to some of the nearest text segments, "秘书长"("Secretary General") in text segment 2 and "创始人"("The Founder") in text segment 3, human can help infer that text segment 1 is a person entity.
For the second, as Figure 1 shows, neither text segment 4 nor text segment 6 is a complete named entity. However, the two segments (segment 4 and segment 6) together represent a complete location entity that is 6/F, Middle Block, Huijin World Trade Center, at the intersection of Fengqi Road and Yan'an Road, Xihu District, Hangzhou, Zhejiang, China. Apparently, the ability to merge text segments into a complete entity is important for NER from visual document.
To solve the above two problems, in this paper, we propose GraphNEMR, a graph neural end-to-end joint model for named entity recognition and merging tokens into named entities from visual documents. GraphNEMR incorporates the geometric information with the semantics to automatically extract nonsequential context-aware hidden features for each text segments in the visual documents. Specifically, We regard a visual document as a graph structure and all text segments in it are the graph nodes. The geometric information is represented by the adjacency matrix of the graph. In each text segment, a BiLSTM structure is used to sequentially encode tokens to represent semantic features. Then the graph convolutional network (GCN) (Kipf and Welling, 2017) encoder integrates information between neighbor nodes to learn the final representation of each text segment. Then the representations are used as the inputs of the merge module we proposed to decide the relation between text segments. After GCN encoder and the merge module, a LSTM+CRF decoder is used to get the named entity tagging. Our main contributions of this paper are as follows: • To address the visual text merge problem, we propose a general method that captures the geometry information and semantics information. To the best of our knowledge, our approach is the first work to merge tokens in different text segments into complete named entity in visual documents.
• We propose the 8-geometry neighbors relation for each text bounding boxes in visual documents to represent geometry information in merge layer. Meanwhile, we design a geometry-distance-related adjacency matrix for graph representation with GCN.
• We propose a loss called Loss nsp/sop for semantically supervising merging text segments. Furthermore, we can obtain the right prefix semantic information for each text segments via merging results.
Extensive experiments show that our model outperforms both the strong sequence-based baseline model (BiLSTM-CRF) and the SOTA graph-based model (GraphIE) in visual document named entity recognition task.

Related Work
Our model builds on recent research of information extraction in visual documents and nested named entity recognition.
Recently, there is a lot interest in the task of information extraction in visual documents. Most of these works combined approaches from NLP, CV and document analysis. Lample et al. (2016) propose model BiLSTM-CRF that is a strong and a wildly used baseline for NER. Many researchers apply BiLSTM-CRF directly in visual information extraction without structure information consideration. Palm et al. (2017) use the sequential recurrent neural network (RNN) to extract key-value information from invoices. Their work shows the ability of neural network approach for extracting information in visual documents. However, their RNN model also treats documents as sequential text. Yang et al. (2017) consider document information extraction as a pixel-wise segmentation task and applied a end-to-end multimodal network to in visual documents. Their experiments showed that the textual features help layout segmentation. Katti et al. (2018) try to preserves visual documents' 2D layout by incorporating coordinate of characters for information extraction from invoices. Zhao et al. (2019) also found that in documents key information extraction, spatial information plays intrinsic roles. Denk and Reisswig (2019) extended the work of (Katti et al., 2018) to incorporate contextualized embedding by BERT language model and Xu et al. (2019) propose a pre-trained LayoutLM with the text and coordinates of text segments as inputs. They showed the effectiveness of using a pre-trained language model to invoice information extraction. Based on graph convolution network (GCN), Qian et al. (2019) and Liu et al. (2018) introduced the graph-based model that integrate textual and position attribute (i.e., coordinate, font size) to do visual information extraction task. They show that graph-based model outperforms the sequence-based baseline BiLSTM-CRF and confirms the benefits of using layout structure in visual information extraction.
The task of nested NER (Finkel and Manning, 2009) focuses on recognizing entities that can be nested within each other. This can be considered as related problems to ours on how to merge text segments in visual document. Recently, a number of approaches have been proposed for nested NER (Ju et al., 2018;Wang and Lu, 2018;Fisher and Vlachos, 2019). Specifically, Fisher and Vlachos (2019) decomposes nested NER into two stages, that first merge tokens into entities and then do recognition. But compared with our merging tokens problems, their approaches applied on serialized 1-D text instead of visual documents.
As we can see, 2D layout documents features is crucial for most existing work on visual documents information extraction. These models however simply equipped position features (i.e., absolute/relative coordinate) but ignore the geometry of neighbourhood and geometry distance information. Inevitably, simply combining position features with neural models may help little with 2D semantic context, as the layout of the different document varies a lot. And in many cases, text segments should be merged to represent a complete named entity. Thus we are thus motivated to look into the relative geometry of neighbourhood, exploring how to integrate geometry distance to build better 2D semantic context of each text segments and researching on how to merge text segments into complete named entity in visual documents.

Overview
Text segments (characters in text segments and the bounding box position coordinates of the text segments) of a visual document are acquired by an OCR engine. Mathematically, let a visual document be D = (t 0 , t 1 , ..., t n ), where t i stands for a text segments and n is the number of text segments in the visual document D. An overview of our proposed model is illustrated in Figure 2. Firstly, we model a visual document D as a weighted fully connected graph by a graph convolution network encoder into ge-ometry&2D context aware hidden representations, where each text segments t i is the node of the graph. Secondly, given these hidden representations, a merge layer is applied to infer text segments merge decision. Lastly, we combine the graph hidden representations with merge information to reconstruct the

Geometry&2D Context Feature Encoder
1:k i denotes the hidden states. We then use the last hidden states h (t i ) k i to represent the text sequence. To encode basic 2D information, relative coordinates and relative text segment size are concatenated to text embeddings. So the hidden representation node t i is defined as follows, where x (t i ) and y (t i ) are x-coordinate and y-coordinate of text segments respectively, x min and y min are the circumscribed square's minimum xy-coordinates of all text segments' bounding boxes, s is the side length of the circumscribed square. Then, a graph convolution is applied to capture 2D context and geometry features from input embeddings that contain text and position information. Intuitively, from the perspective of 2D context, the closer the distance between text segments, the stronger the relevant information they represents. Different from exist gcn-based methods that use mean/max aggregation, to better build a 2D context with geometry information considered, we utilize the distance between text segments as a weighted aggregation information. For node t i , our model retrieves new node features as follows, where g l t j ∈ R f denotes the hidden feature of node t j at layer l, W l+1 and b l+1 are learnable weights. Our model aggregates information from the neighbors of each node by the weight α ij that is denoted as where the d ij = |t i , t j | is the geometry distance between t i and t j . Intuitively, the closer the distance between t i and t j , the greater the value of α ij . Thus, by using this weighted aggregation, closer relevant information between text segments can be encoded.
Since our GCN layer propagates information between nodes by every connected nodes with distancebased weight to construct 2D context. And the node embedding consist of semantics and geometry information. After we do graph convolution by L layers, each node t i can capture both 2D context and geometry information. So, given document D, after getting each node hidden representations by our encoder, we get tensor D g of shape [n, f ],

Merge Layer
Figure 3: 8-geometry neighbors relation from Text segment i (t i ) to Text segment i (t j ) The merge layer is responsible for merging text segments into a complete entity. Intuitively, one text segment can only be merged with its nearest text segments of 8-geometry neighborhoods. We obtain relative position between two text segments by 8-geometry neighbors. Given a text segment t i with its bounding box area p t i , its 8-geometry neighbors areas are defined as a set P t where represent the left-up, up, right-up, right, right-down, down, left-down and left area of t i respectively in visual document D. Given another text segment t j with its bounding box area p t j , then the geometry position relation p t i :t j can be denoted by a 9-dim-one-hot vector where the first eight dimensions stand for 8-geometry neighbors and the last dimension represent the self-area of given text segments. Notice that p t j may intersect with more than one area in P t i , we choose the area which has the largest intersection with p t j . For example in Figure 3, the size of the intersecting area between text segments t j and p t i down is larger than the others, the relation from t i to t j is down. Following encoder, by tiling and expanding dim on D g with p t i :t j being added, we have where D M ∈ R n×n×(2f +9) denotes all nodes pair features, m ij ∈ R 2f +9 is the concatenation of g L t i , p t i :t j and g L t j . Then, a fully-connected network with sigmoid activation function is applied to learn a merge matrix M f as where M f ∈ R n×n represents whether two text segments should be merged and which text segment is the front segment. Here, the merge decisions are trained using cross entropy (CE) loss: where M f label is the label of M f with binary value of 0 or 1. M k is also a binary matrix where M k [i, j] == 1 means that t j is one of the top k nearest text segments from one of t i 's 8-geometry neighbors. By doing · (dot product operation) between M f and M k , only the top k nearest text segments in each 8-geometry neighbors can be merged. During inference, for example, if M f [i, j] == 1, it means that text segment t i should be merged with t j and t i is in front of t j . M f [i, j] == 0 means t i and t j should not be merged.
To leverage sequential language semantics, inspired by the next sentence prediction (NSP) training in BERT (Devlin et al., 2019) and sentence-order prediction (SOP) training in ALBERT (Lan et al., 2019), we propose a loss called Loss nsp/sop for semantically supervising merging text segments as follows, where LM nsp/sop is the pre-trained language model that use NSP or SOP training. × represents matrix multiplication. The language model's parameters are fixed during training. By doing matrix multiplication between D and M f , we can get the pair that our model hope to be merged in equation (8). The BERT model then get the pairwise input and by maximizing Loss nsp/sop , the parameters of our model will be upgraded to make the merge decision in language model's perspective.

NER Decoder
The last NER Decoder is for named entity tagging. The structure is a standard LSTM+CRF. But different from previous works that use LSTM+CRF for tagging, we utilize the front text segment information that we get from Merge Layer as an initial state Init (t i ) for LSTM+CRF, for every node in D, where h t i lstm is the hidden state of LSTM.| is the concatenate operation. Init (t i ) is the initial state for the LSTM that is denoted as follows, where we can easily get the front text segment by doing matrix multiplication between D g and M f to get Init (t i ) . Then, a conditional random fields (CRF) is applied to perform entity tagging, Finally, the objective function to be optimized is as follows, In this way, the geometry information and 2D context is encoded into to our model's hidden layer, and with the Merge Layer and the last decoder, our model perform merge and recognize in visual documents.

Experiments
We first introduce the datasets for evaluating our proposed model. Then we describe baselines we compared with, the evaluation metrics and briefly explain our implementation details. Next, we show the results for two datasets. Finally, we demonstrate the improved effect of our model via ablation study.

Dataset
The aim of ICDAR 2019 SROIE task3 1 is to extract different kinds of text of several keys which are company, address, date and total from given receipts. The SROIE dataset consists of 1,000 scanned receipt images. Since the annotation of this task is incomplete and not well applied to our problem, we relabeled all the named entities in this dataset and get the text and corresponding bounding boxes according to the ground truth OCR annotations of SROIE. We build the dataset as SROIE-VNER. Our goal is to extract all the named entities (location, organization, date) in SROIE's receipt images. For our relabeled SROIE-VNER dataset, we split the dataset in 70% for training, 30% for testing.
The BCD dataset consists of 13,498 real-world business card images that is much larger than SROIE-VNER dataset. The collection is provided by user-uploading. We get the text and corresponding bounding boxes with Alibaba's OCR API 2 . Each character in text is manually labelled with B/I/E named entity tagging. Our goal is to extract all named entities (person, organization, location) in business card images. Business card styles of different companies are usually different and they are in large layout variability. So the layout of the images in the BCD dataset is more diverse than the SROIE-VNER dataset. 80% images in BCD dataset are used as training data. The left 20% images in BCD dataset are used for testing.

Baselines and Evaluation Metrics
We implement BiLSTM-CRF as a sequential tagger baseline as many researchers do in invoice/receipt images information extraction. According to (Palm et al., 2017) and (Liu et al., 2018), text segments in a document are concatenated from left to right and from top to bottom. And then they apply BiLSTM-CRF model to the concatenated document. We also compared our model to a graph-based tagging model GraphIE (Qian et al., 2019) which is probably the SOTA graph-based model in visual information extraction.

Figure 4: Example text segments in a visual document
The evaluation metrics are the standard named entity recognition precision, recall and F1 score. However, even if the tags of each text segments are completely correct, it cannot achieve extracting the complete entity correctly due to the order of the text segments. The traditional NER CoNLL evaluation method can not cover this problem. For example in Figure 4, assuming that entity tags in text segments 1-4 are correct. It is difficult for humans to determine whether the order (1,2,3,4) or the order (1,3,2,4) is right. And without the right order, we can not extract the right complete named entity. To address this problem, we evaluate the precision, recall and F1 score on complete entities recognition.

Implementation Details
We calculate the distance between text segments and determine whether the two regions intersect by the Shapely Python package 3 . For LSTM in our model, the dimension of hidden state is 256. The 300dimensional pre-trained fasttext English word embeddings are used in SROIE-VNER experiments and 300-dimensional pre-trained fasttext Chinese character embeddings are used in BCD experiments. We use an one-layer GCN that is the same with GraphIE and the hidden size is 256. For language model for supervision, we utilize the sentence-order prediction (SOP) of ALBERT. Table 1 shows the precision, recall, and F1 score of the tagging results in SROIE-VNER dataset and BCD dataset for BiLSTM-CRF, GraphIE, and GraphNEMR. In SROIE-VNER dataset, both graph-based model GraphIE and our GraphNEMR have over 5.7% improvement compared to the sequential-based BiLSTM-CRF model. But from named entity character tagging results, GraphNEMR * dose not have significant improvement over GraphIE. In BCD dataset, which has a large diversity of layouts, GraphNEMR further surpasses GraphIE by 5.34% and yields 10.60% improvement over BiLSTM-CRF.   Table 2 presents the comparisons of our model with the sequence-based model and the graph-based model on complete named entities recognition. Intuitively, this is a more suitable evaluation method for visual documents related tasks. Since GraphIE model doesn't have the ability to merge text segments, on this evaluating method, the LOC result of GraphIE has dropped significantly in SROIE-VNER dataset. Sequence-based model BiLSTM-CRF merges the text from left-to-right and from up-to-down and our model GraphNEMR * learns to merge. GraphNEMR * significantly outperforms the sequencebased model BiLSTM-CRF by 6.02% and the graph-based model GraphIE by 21.52% separately. In BCD dataset, GraphNEMR also obtains significant improvements over BiLSTM-CRF by 9.94% and GraphIE by 8.91%.

Analysis
From the dataset perspective, the layout of SROIE-VNER dataset is relatively simple. But many text segments need to be merged. The BCD dataset is quite different from SROIE-VNER. Since the style of each business card image is quite different, the BCD dataset has large layout varieties and also many text segments that need to be merged.
Since the sequence-based BiLSTM-CRF concatenate text segments in a document based on a common order which many layouts follow in real-world situations. So in SROIE-VNER dataset that is in a relatively simple layout, no matter what evaluation methods, the sequence-based BiLSTM-CRF can achieve relatively stable and comparable results. But the order that is from left-to-right and from top-to-bottom may not be guaranteed. The sequence order of entities like location and organization that often appear as multiple lines or multiple text segments are broken by such concatenations. So in BCD dataset that is in large varieties layouts, the performance of the sequence-based model suffers from significant performance degradation. Entities that usually have a short text length and in most cases are in left-to-right order in single text segment, i.e. person, are not influenced by concatenations and can keep a relatively stable performance in different layout varieties.   Table 4: Ablation study ("-" means removing the sub-component from GraphNEMR.
In BCD dataset, visual documents have diverse layouts and many text segments need to be merged. Because graph-based models take the visual layout into account, both GraphIE and GraphNEMR * achieve better results on visual documents with diverse layout changes than sequence-based model BiLSTM-CRF. So under these circumstances, the gap between our proposed GraphNEMR * and GraphIE mainly comes from the geometry features. While the visual documents in SROIE-VNER have simple layouts and a lot of text segments need to be merged, both GraphNEMR * and GraphIE have a better performance than BiLSTM-CRF in character tagging results. Since GraphIE doesn't have the ability of mergence and BiLSTM-CRF uses a naive merging strategy, GraphNEMR * performs much better than other models.
We also evaluate the impact of different numbers of k nearest text segments from 8-geometry neighbors. In theory, a text segment wouldn't be merged with a long-distance text segment. And a text segment will not only be merged with its nearest neighbors. Table 3 presents the results of using different k nearest text segments from 8-geometry neighbors. As we can see, using all text segments as candidates do not help the merging task and only setting the nearest neighbors as merging candidates will hurt the model performance.

Ablation Study
To better understand the contributions of each sub-component of GraphNEMR, we perform ablation studies in BCD dataset. Table 4 presents the results. In each study, we exclude the relative location, geometry information in merge layer and the use of the pre-trained ALBERT language model as Loss nsp/sop to supervise sentence order respectively. As we can see that the geometry information plays a more important role than others. The semantic Loss nsp/sop is also very helpful for recognize complete named entities. Intuitively, with the 8-geometry neighbors information considered, a richer 2D context and layout information is provided to better merge and recognize named entities.