Entity Relation Extraction as Dependency Parsing in Visually Rich Documents

Previous works on key information extraction from visually rich documents (VRDs) mainly focus on labeling the text within each bounding box (i.e., semantic entity), while the relations in-between are largely unexplored. In this paper, we adapt the popular dependency parsing model, the biaffine parser, to this entity relation extraction task. Being different from the original dependency parsing model which recognizes dependency relations between words, we identify relations between groups of words with layout information instead. We have compared different representations of the semantic entity, different VRD encoders, and different relation decoders. The results demonstrate that our proposed model achieves 65.96% F1 score on the FUNSD dataset. As for the real-world application, our model has been applied to the in-house customs data, achieving reliable performance in the production setting.


Introduction
In real-life scenarios, there are many types of visually rich documents (VRDs), such as invoices, questionnaire forms, declaration materials and so on. These documents contain abundant layout information which helps us to understand the content while texts alone are not enough. In recent years, many works focus on how to extract key information from VRDs based on the results of OCR (Optical Character Recognition), which recognizes bounding boxes and texts within the boxes (Liu et al., 2019a;Yu et al., 2020b). Each bounding box contains 1) a group of words that belong together from a semantic and spatial standpoint and 2) visual features such as layout, tabular structure and font size of the boxes in the document. We call such bounding boxes and texts within the boxes semantic entities 1 , and each entity contains the word group and layout coordinates 2 .
Key information extraction (KIE) is such a task to analyze visually rich documents, which usually contains two steps, entity labeling and entity relation extraction. Similar to named entity recognition (NER) and relation extraction in the traditional natural language processing (NLP), entity labeling aims to assign predefined labels to the semantic entities in VRDs (G. Jaume and Thiran, 2019; Liu et al., 2019a;Yu et al., 2020b), and entity relation extraction 3 predicts relations between these semantic entities. Compared with NER and relation extraction in the traditional NLP, KIE from VRDs is a more challenging task. First, a normal (named-) entity in plain text does not contain layout information as those semantic entities in VRDs. Second, normal relation extraction predicts the relation between two given mentions, while relation extraction in VRDs needs to predict the relation between any two semantic entities in the document. As Figure 1 (left) illustrates, the entity labeling task is to tag "533" with the label "Answer", "Registration No." with label "Question". However, which question can be answered by "533" remains unknown without entity relation extraction. Compared with labeling, the entity relation extraction task is less explored, but its benefits at least include: 1) pro- Figure 1: Left part shows one visually rich document from the FUNSD data. Group of words with one bounding box means one semantic entity and we number each entity as B i . Different colors of boxes mean their different entity labels as legends under the example list. Relation links between semantic entities always point from key entities to value entities. We convert entity relations in the VRD to one tree and add the pseudo root node following the similar setting in dependency tree, and then link zero-head entities to the pseudo node as the pseudo links shown in the right part.
viding additional structural information closer to human comprehension of the VRDs, and 2) being easier to be transferred to other domains when the predefined label set changes. Therefore, in this paper, we concentrate on the task of semantic entity relation extraction which discovers the relation between two groups of words with layout information, as the yellow links in Figure 1 (left) show.
As another similar task to entity relation extraction in VRDs, dependency parsing aims to find out syntactic relations between words of an input sentence, which has been widely studied for decades. Both of these two tasks capture pairwise relationship between basic units of the input data. We adapt the popular biaffine dependency parser which utilizes the biaffine attention to compute scores between words (Dozat and Manning, 2017) into the entity relation extraction task due to their similarity. Since visual features play an important role in the VRDs, we introduce abundant layout information into different layers of the model to enhance the original text-only biaffine model: • At the entity representation layer, we use the LayoutLM (Xu et al., 2020b) to encode both the word group and coordinates.
• At the document encoder layer, we utilize graph convolutional networks (GCN) to combine textual and visual information in VRDs by mapping layout information into graph edge representation between entities (Liu et al., 2019a;Yu et al., 2020b).
• At the relation scorer layer, we extract relative position features between entities according to their coordinates.
Apart from the above, inspired by the joint POS tagging and dependency parsing model (Nguyen and Verspoor, 2018), we propose the multi-task learning for both entity labeling and relation extraction to further improve the performance.
Abundant detailed experiments are conducted to verify our approach of applying the biaffine dependency parser to semantic entity relation extraction task in VRDs. Our proposed relation extraction model achieves 65.96% F1 score on the FUNSD dataset, demonstrating the effectiveness of our model. As for the real-world application scenario, our model has also been applied to the in-house customs data, achieving reliable performance in the production setting.
The contributions of this paper are as follows: • We adapt the biaffine model used in dependency parsing to the entity relation extraction task and achieve 65.96% F1 score on the FUNSD dataset.
• We conduct detailed experiments to compare different representations of the semantic entity, different VRD encoders, and different relation decoders to better understand this task.
• We apply our model to the real-world customs data with different layouts and achieve high performance in the production setting.

Related Work
Visually rich document understanding includes many tasks, such as layout recognization (Zhong et al., 2019b;Li et al., 2020), table detection and recognition (Li et al., 2019a;Zhong et al., 2019a) and key information extraction (Graliński et al., 2020;Guo et al., 2019;Huang et al., 2019;G. Jaume and Thiran, 2019;Majumder et al., 2020). Our paper focuses on the key information extraction task which contains two subtasks, entity labeling and relation extraction. The former subtask tags entities with predefined labels, such as Task 3 on the SROIE data released by Huang et al. (2019), while the latter discovers relations between entities, such as Subtask C(3) on the FUNSD data (G. Jaume and Thiran, 2019).
To encode the semantic entity in VRDs, Yu et al. (2020b) and  replace BiLSTM (Bi-directional Long Short-Term Memory) used by Liu et al. (2019a) with BERT (Devlin et al., 2018) or RoBERTa (Liu et al., 2019b). Xu et al. (2020b) propose LayoutLM, which adds the 2-D position embedding into language model based on BERT and pretrain their language model on large-scale scanned document images with more visually-related loss function. Experiments verify that encoding the word group and layout coordinates at the same time is more effective for VRD understanding. LayoutLMv2 additionally introduces visual embedding into input layer and integrates a spatial-aware self-attention mechanism into the Transformer architecture (Xu et al., 2020a). And LayoutLMv2 performs better than LayoutLM in downstream VRD understanding tasks.
While encoding VRDs, previous works take entity labeling task as sequence labeling and reimplement the named entity recognition (NER) framework (Lample et al., 2016) but ignore layout information. Then, many works introduce a GCNbased module to encode layout information and combine textual and visual information together (Liu et al., 2019a;Yu et al., 2020b;Carbonell et al., 2021). In the GCN module, Liu et al. (2019a); Yu et al. (2020b) take layout features between entity b i and b j as edge embedding to up-date entity representation while  prune irrelevant nodes in graph according to same x-axis or y-axis coordinates to get the adjacency matrix.
To predict relations between entities, G. Jaume and Thiran (2019) provide one simple approach which concatenates the representations of two entities and use multi-layer perceptron (MLP) to obtain the relation score between entities. Carbonell et al. (2021) also use the MLP scorer but take GNN as document encoder instead of BERT and perform better. In the field of dependency parsing, Dozat and Manning (2017) propose the biaffine attention mechanism to compute scores between words, and achieve better performance than the MLP mechanism used by Kiperwasser and Goldberg (2016). As the biaffine attention is widely used in other tasks like NER (Yu et al., 2020a) and semantic role labeling (Li et al., 2019b), we propose to use it for the entity relation extraction task in this work.

Entity Relation Extraction as Dependency Parsing
Both semantic entity relation extraction and dependency parsing tasks aim to decide whether there exists relation between two entities/words and assume that links always point from key/head unit to value/modifier unit shown in Figure 1. Therefore, we can draw lessons from the dependency parsing exploration as it has been studied for several decades and achieved great progress. Biaffine parser, a strong model in dependency parsing, achieves competitive performance and is widely used in different scenes and tasks. This section introduces how to apply the biaffine parser to our relation extraction task according to their similarity and difference.

Task Definition
Each scanned visually rich document is composed of a list of semantic entities, and each entity composes of a group of words and coordinates of the bounding box, defined as i /y 2 i are left/right x-coordinates and top/down y-coordinates respectively. Documents in our used dataset are annotated with the label of each entity and relations between entities. We represent each annotated document as D = where l ∈ L is the label of each entity and L is the predefined entity label set. (b i , b h i ) mean the relation between entities b i and b h i , and the link points from b h i to b i . Notably, the entity may exist relations with more than one entity or does not have relation with any other entities.

Biaffine Parser
As Figure 2 (right) shows, biaffine dependency parser takes word and POS-tag embedding as the word representation, and uses multi-layer BiLSTM to encode the input sentence. Then, two MLP modules are used to strip away information not relevant to the current link decision. At last, the biaffine attention mechanism is utilized to compute the score of the dependency link between words.
We explore various aspects of applying the biaffine parser to our relation extraction in VRDs due to their similarity. Especially, we take the layout information into consideration besides the text itself, compared to the regular dependency parsing. In our proposed entity relation extraction model, we exploit important layout information at different processing levels, including entity encoder, document encoder and relation scorer, as Figure 2 (left) shows. We name our proposed model as SERA (Semantic Entity Relation extraction As dependency parsing). Details of our proposed SERA are discussed in the following subsections.

Entity Representation
At the input layer, in order to obtain better entity representations, we compare different ways to encode the information of semantic entity, containing the word group and the layout features. In this work, we take the advantage of widely used pretrained models, including context-free word vector from word2vec (Mikolov et al., 2013), contextualized representations from BERT and LayoutLM. Specially, LayoutLM introduce coordinate information from bounding boxes during pretraining which is very suitable for our scenario.
In addition, we make use of the label of each semantic entity, such as "Question", "Answer" in FUNSD label set as Figure 1 shows. We map the entity labels into label embedding, as POS-tag embedding in dependency parsing. Then, we concatenate the entity representation and label embedding as the input of the document encoder for each semantic entity, as the following equation shows: where l i means entity label embedding, and b i means the representation of semantic entity, which can be obtained from the above-mentioned three pretrained models, e.g. word2vec, BERT and Lay-outLM .

Document Encoder
We compare different document encoders, including transformer, BiLSTM, and GCN, for better encoding the information of the semantic entities in VRDs. Specifically, we feed the representation of entity into the document encoder and obtain the output of the encoder as the contextual representation of the entity. Details of BiLSTM and transformer can refer to Lample et al. (2016) and Vaswani et al. (2017), respectively. For GCN encoder, initial entity representation in graph is computed as subsection 3.3 shows. While updating the representation of the entity and edge, we follow the computation of Liu et al. (2019a). The edge embedding consists of 2 layout features, as the following equation shows: where x ij and y ij are horizontal and vertical distance between the two entity boxes respectively: For entity b i , we extract features h ij of each neighbour b j by concatenating the representation of the two entities and their corresponding edge.
Then, we update the representation of entity and edge so that each entity can extract relevant information from other entities according to the document layout information, as the following equation shows: and α ij is the attention weight and can be computed as follows: Where, n means the number of entities in a document.

Relation Scorer
Following the biaffine parser, we firstly apply MLP module to drop trivial information which is unrelated to current relation decision. Two MLP are used to generate the different representation of key and value roles in each relation link, which indicates the direction of arc in Figure 1.
where F is an activation function. Then, we use biaffine attention to compute the score between two semantic entities as follows: Such biaffine mechanism can capture pairwise relationship between entities better. We also use layout information r i,j as external features to help the model predict relations between entities better. Such layout features indicate the position relationship between entity b i and entity b j : left-to-right or top-to-down. Empirically, we observe that entities in the left-to-right or top-todown order are more likely to exist relations. We use MLP to compute the layout feature score as follows: Lastly, we add biaffine score with layout feature score together as the score of the relation between entity b i and entity b j :

Relation Decoder
Based on relation scores between entities, two different relation decoding methods decide different loss functions of our training objective.
The first method is to judge whether there exists relation between any two entities in each VRD and such way is similar to semantic role labeling (SRL). In such setting, we take the relation prediction as a binary classification task and use binary cross entropy loss as G. Jaume and Thiran (2019).
The second is to choose one head entity from all entities in one VRD for the current one, which is similar to the decoder in dependency parsing. This method means that each entity must have exactly one head entity, namely single-head constraint. Now, relation prediction is seen as a multiclassification task and use softmax cross entropy loss as Dozat and Manning (2017).

Datasets
We conduct experiments on the FUNSD 4 data, which is published by G. Jaume and Thiran (2019) for the form understanding task. Moreover, to verify our proposed model, we also collect real-world dataset from the customs scenario.
FUNSD is composed of 199 fully annotated, scanned forms with comprehensive annotations to address form understanding tasks including entity labeling and relation extraction. We follow the data split as G. Jaume and Thiran (2019), and detailed data statistics are listed in Table 1, including the entity/relation distribution in FUNSD.
Customs Data consists of about 1,600 customs declaration documents in different layouts and languages collected by us. There are four types of documents: packing list, invoice, sales contract and customs declaration form, and each kind of document provides different information which is useful to apply to the customs. Figure 3 gives one invoice example, providing unit price, quantity and other details. Customs documents may be in Chinese or English, and their format may contain Word, Excel, PDF or image. We parse these documents by a self-developed OCR tool to get semantic entities in each VRD. We organize crowd-sourcing to annotate labels of entities given the predefined label set, containing 48 kinds of label which are important for customs information extraction system. We can get the key entities according to the map dictionary from each predefined entity label to its all possible names in VRDs due to these names are enumerable. Then, we link the entities from keys to values respectively with same labels. We finally get the annotated customs data annotated with labels of entities and relations between entities. The scale of our collected customs data is much bigger than the FUNSD data as Table 1 shows.

Data Preprocessing
Multi-Head & Zero-Head Entities. In terms of dependency parsing, one word must have one and only one head (the single-head constraint). However, zero-head entity that has no relations with any other entities or multi-head entity that has multiple heads does appear in our dataset. For zero-head entity, we add a pseudo root entity and link these zerohead entities to the pseudo entity as Figure 1 shows. For multi-head entity, we randomly remain one head entity and delete others to get single-head entity while under single-head constraint. In FUNSD, there are 324 (/4,236) and 16 (/1,048) multi-head entities in train/test data respectively, accounts for a small part in all data.
Auto Labels. It is intuitive that type-tagged entity will ease the prediction of relations between semantic entities. We employ an effective entity labeling model consisting of two modules: entity encoder and label scorer. We take LayoutLM to encode the document and get the entity representation in a similar way as our relation extraction model. We also introduce three layout features w i , h i , c i into our entity representation and map the features into 10-dim embedding. w i and h i mean the width and height of the bounding box and c i means the length of characters in word group of each semantic entity. After concatenating the feature embedding and LayoutLM output, we pass them into the MLP scorer to compute the score of each candidate label. By this way, we get the auto label of our relation extraction data 5 .

Evaluation Metrics
To evaluate our semantic entity linking model, we take the entity-level precision, recall and F1 score as measure standard. Under single-head constraint, we ignore the links pointed from pseudo root entity in gold and predicted results for fair comparison with other works.

Parameters
We investigate several pretrained language models to obtain entity representation, i.e, word2vec, BERT, and LayoutLM. For word2vec, we obtain the entity representation by averaging embeddings of the words contained in one entity; and for BERT/LayoutLM, we use the base model and take the hidden state output of the first subword of word group as the whole entity representation. Therefore, the representation dimension of words within bounding box is 100 while using word2vec and 768 while using BERT or LayoutLM. We use 100-dim embedding to represent the entity labels, so our entity representation is 200 or 864 6 .
To encode the whole VRD, we investigate 1layer BiLSTM or 1-layer transformer or 2-layer GCN encoders. The hidden state dimension of BiLSTM and transformer is 300 and the dimension of output edge and entity representations generated by GCN is 100.
The learning rate for BERT and LayoutLM is set to 1e-5 and others to 1e-2. The model are trained 5 We train the labeling model on the whole training data and predict the auto labels of the test data. And we split training data into 5-fold, and train model with 4-fold to generate automatic labels of the left 1-fold training data. 6 While using transformer, it's difficult to set the number of heads in multi-head self-attention if dim of entity representation is 868. Here, we use a 96-dim label embedding instead.  Table 2: Performance of entity relation extraction task on the FUNSD test data of previous works and our model with different but important settings. We reimplement previous works after application to entity relation extraction task, except for the work of Carbonell et al. (2021). We report their published experiment results in their paper.
for 50 iterations on FUNSD data and 100 iterations on customs data 7 . And each iteration we traverse the whole training data under all settings.

Overall Results
We adapt the biaffine model from the dependency parsing task to our entity relation extraction task, and conduct detailed experiments on FUNSD dataset. Experimental results are shown in Table 2. Previous works. Firstly, we train the original biaffine model (Dozat and Manning, 2017) after replacing each word and its POS-tag with word group and entity label of each entity, but achieve poor performance. Then, we re-implement the entity relation extraction model proposed by G. Jaume and Thiran (2019), which consists of BERT as the entity encoder and MLP as the relation scorer. And our re-implement results are much higher than the performance reported in their paper (0.04% F1). We replace BERT with LayoutLM and keep other parts unchanged to encode the layout coordinate information into relation extraction task and model performance improves a little. Carbonell et al. (2021) also utilize MLP link scorer but encode documents with k-layer GNN instead of BERT or LayoutLM, 7 We train the model for 50/100 iterations, and then predict test set on the trained model.  Table 3: Performance of entity relation extraction on the FUNSD test data. We compare different pretrained language model used to encode entities and keep other modules of SERA unchanged. and they achieve higher performance than other previous works.
Our Models. We propose our semantic entity relation extraction model based on the architecture of the biaffine parser as Section 3 describes. We apply LayoutLM/GCN as our entity/document encoder and optimize our models with softmax cross entropy loss under the single-head constraint. Results show that our proposed SERA achieves much higher performance than previous works by a large margin. Performance improvement demonstrates that layout information plays an important role in entity relation extraction task. Two ablation experiments verify the effectiveness of the layout feature scorer and auto labels of entities. Our SERA model can further improve the performance with two training strategies: data augmentation and multi-task learning of entity labeling and relation extraction.
Detailed experiments about our different explorations for entity relation extraction task are discussed in the following subsections.

Entity Representation
While encoding semantic entities, we employ three different pretrained models 8 and comparison experiment results are listed in Table 3. Results show that encoding entities with LayoutLM performs the best because it introduces layout information into its transformer encoder and has been pretrained on a large scale of VRDs compared with BERT. Word2vec achieves much poorer performance than the other two due to the missing context-aware information inside semantic entities. Table 2 demonstrates that taking entity label embedding as part of the entity representation can improve the model performance. From our analysis on the FUNSD data, we find that many entity relation links point from entities with label 'Question' to entities with label 'Answer' and almost no   relations in-between answers or questions 9 . Therefore, entity labels help the extraction model prune the unreasonable relations and enrich the entity representation. The gap between models with gold labels and auto labels is about 10% F1, which indicates the room for improvement is still large if we can obtain better auto entity labels.

Document Encoder
We investigate three popular encoders, including transformer, BiLSTM and GCN to encode VRDs. Experiment results in Table 4 show GCN encoder performs much better than the other two in our task. The GCN we applied can be seen as an improvement of self-attention mechanism due to it introduces layout information into the document encoder as Formula 4 shows. Such layout features indicate the positional relation between entities according to x-axis or y-axis coordinates. They help document encoder to copy more information from adjacent entities which are more relevant to current entity, while transformer updates entity representation according to the textual information alone. BiLSTM encodes the entities in documents in the plain sequential order. However, the sequential order is not suitable for the VRD understanding, for example, many key-value pairs in tables are in top and down order.

Relation Decoder
Decoding relation links between entities with or without the single-head constraint leads to a large 9 In FUNSD training data, relations between answers or questions occupy only about 1%.  Table 6: Performance of entity relation extraction on the FUNSD test data. We train our SERA with two training strategies: data augmentation and multi-task learning.
performance gap as Table 5 shows. Using different relation scorer, the trend between these two decoders is contrary. MLP scorer performs poorer than biaffine scorer under single-head constraint. This is because biaffine scorer is more suitable for single-head constraint which has been proved by Dozat and Manning (2017). Without such constraint, the task is similar to SRL, and more SRL previous works prefer to the MLP scorer.
Our error analysis finds that biaffine scorer without single-head constraint leads to the model prefers to predict multi-head links for entities, which is not consistent with the data distribution.
Due to the big gap between the biaffine scorer with single-head constraint and the MLP scorer with multi-head constraint, we finally choose the biaffine scorer and add the single-head constraint in our experiments.

Training Strategies
To get better performance in entity relation extraction task, we apply two training strategies. Firstly, we take entity labeling and relation extraction tasks as multi-task learning (MTL) (Collobert and Weston, 2008;Nguyen and Verspoor, 2018) and these two tasks share the pretrained language model in entity encoder and fine-tune the sharing parameter together during training. Relation extraction task can improve by about 0.86% F1 while labeling model performance drops a little from Table 6. Performance improvement demonstrates that MTL is highly effective on alleviating error propagation from entity labeling task.
Secondly, we try to augment our training data due to the small size of training documents in the FUNSD data (Krizhevsky et al., 2017). We randomly drop some words in ratio 0.2 from word group of each entity to obtain more pseudo documents. We combine these pseudo training data with the gold training data and keep the test data unchanged. Models trained on the training data  Table 7: Performance of entity relation extraction on the customs data using SERA. We compare different language models in different languages. after augmentation improve performance by about 1.1% F1 with auto label and 2.9% F1 with gold label. Improvement gap between the two indicates that relation extraction model is sensitive to the accuracy of entity labels.
Combination of these two strategies performs the best under the auto entity label settings.

Customs Data
We apply our SERA with best configuration to our collected customs data. Due to documents in customs data may be in Chinese and English, pretrained Chinese or English language model cannot cover the words in documents by its vocabulary perfectly. We conduct experiments with different pretrained models in different languages to study this problem deeply.
As Table 7 shows, our proposed model works well on the customs data, whose scale is much larger than FUNSD. Customs data contain more layout information, such as tables as Figure 3 shows. We observe that the Chinese BERT is better than the English BERT on our language mixed data. We analyse their vocabularies and find Chinese vocabulary can cover more words in documents. Even though Chinese BERT performs better, English LayoutLM still achieves the best results among three pretrained models. This indicates encoding layout information into language model makes difference.

Conclusion
This paper focuses on the largely-unexplored entity relation extraction task in VRDs. We take advantages of previous works in semantic entity labeling and dependency parsing and propose our relation extraction model SERA. Our improved entity relation extraction model achieves 64.60% F1 score on the FUNSD data, outperforming previous baseline by a large margin; and we employ two simple but effective training strategies to further improve the performance to 65.96%, i.e., multi-task learning with entity labeling and data augmentation. In addition, We verify the effectiveness of our model on the real-world customs data with different layouts in the production setting. In the future, we plan to incorporate more visual features into the relation extraction model and also extend it into more domains and business scenarios.