Cost-effective End-to-end Information Extraction for Semi-structured Document Images

A real-world information extraction (IE) system for semi-structured document images often involves a long pipeline of multiple modules, whose complexity dramatically increases its development and maintenance cost. One can instead consider an end-to-end model that directly maps the input to the target output and simplify the entire process. However, such generation approach is known to lead to unstable performance if not designed carefully. Here we present our recent effort on transitioning from our existing pipeline-based IE system to an end-to-end system focusing on practical challenges that are associated with replacing and deploying the system in real, large-scale production. By carefully formulating document IE as a sequence generation task, we show that a single end-to-end IE system can be built and still achieve competent performance.


Introduction
Information extraction (IE) for semi-structured documents is an important first step towards automated document processing. One way of building such IE system is to develop multiple separate modules specialized in each sub-task. For example, our currently-deployed IE system for name card and receipt images, POT (Hwang et al., 2019), 1 consists of three manually engineered modules and one data-driven module (Fig. 1a). This system first accepts text segments and their 2D coordinates (we dub it "2D text") from an OCR system and generates pseudo-1D-text using a serializer. The text is then IOB-tagged and mapped to a raw, structured parse. Finally, the raw parse is normalized by trimming and reformatting with regular expressions. * Most work done while these authors were at NAVER. 1 In October 2020, the system receives approximately 350k name cards and 650k receipts queries per day. a b IE System doc. images parses { store name: "WYVERN STORE", store tel: "13-806-4852" } ..., { menu name: "Polyjuice portion", count: "2", unit price: "1,000", price: "2,000" }, { menu name: "Cheap wand", count: "1", ... { store name: "WYVERN STORE", store tel: "13-806-4852" } ..., { menu name: "Polyjuice portion", count: "2", unit price: "1,000", price: "2,000" }, { menu name: "Cheap wand", count: "1", ... Figure 1: The scheme of (a) our tagging based IE system and (b) the end-to-end IE system proposed in this study.

line number
Although POT has shown satisfactory performance to its customers, its long pipeline increases the development cost: (1) each module should be engineered and fine-tuned by experts per domain, and (2) each text segment should be annotated manually for the training. As entire structural information of documents should be recovered from a set of individual tags, the difficulty of the annotation process increases rapidly with the layout complexity.
One can consider replacing the entire pipeline into a single end-to-end model but such model is known to show relatively low accuracy in structured prediction tasks without careful modeling (Dyer et al., 2016;Dong and Lapata, 2018;Yin and Neubig, 2018;Fernandez Astudillo et al., 2020). Here, we present our recent effort on such transition to replace our product POT with an end-to-end model. In the new IE system, which we call WYVERN 2 , we formulate the IE task as a sequence generation task equipped with a tree-generating transition module. By directly generating a sequence of actions that has a one-toone correspondence with the final output (parse), the system can be trained without annotating intermediate text segments or developing and maintaining four different IE modules in POT (orange texts in Fig. 1a). This also leads to the dramatic reduction in cost (Tbl. 6).
To achieve service-quality accuracy and stability with a sequence generation model while minimizing domain-specific engineering, we start from a 2D Transformer baseline and experiment with the following components: (1) copying mechanism, (2) transition module, and (3) employing preexisting weak-label data. We observe that WYVERN achieves comparable performance with POT when it is trained with a similar amount of, yet low-cost data, and achieves higher accuracy by leveraging a large amount of preexisting weak supervision data which cannot be utilized by the pipeline-based system. This signifies that turning a long machine learning pipeline into an end-to-end model is worth considering even in real production environment.
Related Work Most of previous works formulate semi-structured document IE task as text tagging problem (Palm et al., 2017;Katti et al., 2018;Zhao et al., 2019;Xu et al., 2019;Denk and Reisswig, 2019;Majumder et al., 2020;Liu et al., 2019;Yu et al., 2020;Qian et al., 2019;Hwang et al., 2019). Hwang et al., 2020 formulates the task as spatial dependency parsing which is essentially an another tagging approach where the inter-text segment relations are tagged. Although all previous studies have shown the competency of the proposed methods on their own tasks, they have following limitation: individual text segments should be labeled by appropriate tags for the model training. Since the tags require domain and layout-dependent modification for each task, the appropriate annotation tools should be developed together. The difficulty of annotation rapidly increases when documents show multiple information hierarchy necessitating grouping between fields (for example, see name, price fields in Fig. 2a). On the other hand, we formulate 2 WEAKLY SUPERVISED GENERATIVE DOCUMENT PARSER the IE task as a sequence generation problem where only parses are required for the training.

Model
WYVERN consists of the following three major modules: (1) the Transformer encoder that accepts 2D text, (2) the decoder for sequence generation, and (3) the transition module that converts the generated sequence into parse tree.
2D Transformer Encoder We use Transformer (Vaswani et al., 2017) with the following modification for encoding 2D text (Hwang et al., 2020). The input vectors are generated using the following five features: token, x-coordinate, ycoordinate, character height, and text orientation. Like BERT, each feature is represented as integers and maps to a trainable embedding vector. The each coordinate is quantized into 120 integers, character height into 6, and text orientation into 2 integers. The resulting embeddings are summed into a single input vector. Unlike original transformer, the position embedding for word ordering is omitted.
Decoder We use Transformer decoder equipped with the gated copying mechanism (Gu et al., 2016;See et al., 2017). At each time t, the probability of copying individual input tokens is calculated via inner product between the contextualized inputs from the last Transformer layers of the encoder ({h e }) and the decoder (h d (t)). The resulting probability is added to the generation probability of corresponding tokens gated by the probability p gate . p gate is calculated by linearly projecting the concatenated vector of the h d (t) and the sum of {h e } each weighted by h d (t).
Transition module All parses are uniformly formatted following JSON. However, direct generation of JSON strings requires unnecessary long steps because (1) all syntactic tokens ({,:, }) are generated separately, and (2) "key" often consists of multiple tokens. To minimize the generation length while constraining the search space, we propose to convert JSON-formatted parses into corresponding abstract syntax trees (ASTs) (Fig. 2a). Under this formulation, the sequence of generated tokens is interpreted as a sequence of AST generating actions. The actions are later converted into a JSON-formatted parse using the push-down automaton. Our transition module consists of three types of actions

NT(key), GEN(token)
, and REDUCE based on previous studies (Yin and Neubig, 2018;Dyer et al., 2016). NT(key) generates nonterminal node representing key of JSON. The special tokens representing the type of individual fields are interpreted as the corresponding actions. GEN(token) generates corresponding token. REDUCE indicates the completion of generation in a current level and the systems moves to a higher level. The process is demonstrated with an example in Tbl. 1. A sequence of actions can be uniquely determined for a given AST by traveling the tree in depth-first, and left-right order.

Datasets and Setups
Datasets Due to confidential issues related to industrial documents, we use our four large-scale internal datasets: three strong-label (NJ, RK, RJ) and one weak-label (NJ-W) datasets. The properties are summarized in Tbl. 2. Baseline We compare WYVERN to POT that consists of four separated modules (Fig. 1a, Sec. 1).
The technical details are described in Sec. A.1.
Evaluation The parses consist of hierarchically grouped key-value pairs, where a key indicates a field type such as menu, and a value is either a corresponding text sequence or a sub parse forming a recursive structure (e.g., menu under item in Fig. 2a). We conduct extensive evaluation with three different metrics; F 1 , nTED, and A/Btest. F 1 is calculated by counting the number of exactly matched key-value pairs. Since F 1 ignores partially correct prediction (even a single character difference is counted as wrong), we use another metric nTED, a tree edit distance normalized by the number of nodes. nTED considers both lexical and structural differences between parse trees (Fig. 2b) , 2015) with learning rate 2e-5 or 3e-5. The decay rates are set to β 1 = 0.9, β 2 = 0.999. After the initial rapid learning phase, which typically takes 1-2 weeks on 2-8 NVIDIA P40 gpus, the learning rate is set to 1e-5 for the stability and the training continues up to one month. The batch size set to 16-32. During inference, beam search is employed. In receipt IE task, the training examples are randomly sampled from three tasks-Japanese name card, Korean receipt, and Japanese receipt-while sharing the weights. We found this multi-domain setting leads to the faster convergence.

Results
WYVERN shows competent performance We first validate WYVERN on Japanese name card IE task (NJ). The comparable scores of WYVERN with respect to POT shows the effectiveness of our endto-end approach (Tbl. 3, 1st row vs 3rd row). Note that a naive application of Transformer encoderdecoder model shows dramatic performance  degradation (2nd row) highlighting the importance of the careful modeling. Here, F 1 score is based on exact match and WYVERN generates accurate parses even for non semantic text like telephone numbers (Tbl. 7 last three columns in Appendix).
Higher precision can be achieved by controlling the recall rate ( Fig. 3 in Appendix). Our generative approach also enables automatic text normalization and OCR error correction. Without the transition module, F 1 error increases by 1.3%.
Utilizing large weak-label data significantly enhances accuracy Often times, a weak-label data (document and parse pairs) already exists because it is what gets stored in databases. Especially, when a human-driven data collection pipeline has been existed, the amount of accumulated data can be huge. To leverage the preexisting weak-label data in the database, we first train WYVERN using NJ-W that consists of 6m weakly labeled Japanese name cards and finetune it using NJ. The fine-tuning step is required as parses in NJ-W and NJ have multiple distinct properties (Sec. A.4). The results shows WYVERN outperforms POT by -2.4 F 1 error and -2.29 nTED (Tbl. 3, bottom row). Note that POT requires strong-label data that should be annotated from scratch unlike WYVERN.

WYVERN can generate complex parse trees
We further validate WYVERN on Korean (RK) and Japanese (RJ) receipt IE tasks where parses have more complex structure (Fig. 2, Tbl. 2). WYVERN shows higher performance compared to POT in RK (Tbl. 4, 1st row vs 3rd row) even for numeric fields (Tbl. 8 last ten columns in Appendix). However in RJ, it shows the lower performance (2nd row vs 4th row). The lower performance in RJ may be attributed to its complex parse tree that consists of a total of 39 fields (Tbl. 2).

WYVERN is preferred in A/B-test
We conduct three A/B-tests between POT (P) and WYVERN (W) on Japanese name card and Korean receipt IE tasks with varying training set. WYVERN achieves comparable performance with POT (Tbl. 5, 1st panel, the neutral rate ∼ 50%) and better performance with the use of preexisting weak-label data (final row). When name cards have complex layouts, WYVERN is always favoured (Sec. A.5).
WYVERN is cost-effective Training POT requires a strong-label (tagging of individual text segments) data. The tags should convey information of field type, intra-field grouping (collecting text segments belong to the same field), and inter-field grouping (e.g. name, count, and price in Fig. 2b form a group). On the other hand, WYVERN is trained by using a weak-label data, i.e. parses (structured text) which can be conceived more easily and typed directly by annotators. This fundamental difference in the labels bring various advantages to WYVERN (W) compared to POT (P). Here we focus on the cost and perform semi-quantitative analysis based on our own experiences. We split the cost into five categories: annotation cost (Annot.) for preparing training dataset, communication cost (Comm.) for teaching data annotators, annotation tool development cost (Tool dev.), maintenance cost (Maint.) that involves data denoising & post collection, and inference cost related to serving. The result is summarized in Tbl. 6. The details of the cost estimation process is presented in A.6.

A.1 POT
Here, we explain each module of POT (Hwang et al., 2019). The serializer accepts 2D text from the OCR module and converts them into a single pseudo-1D-text. To group text segments line-by-line, the segments are merged based on their height differences. When the text segments are placed on the curved line due to physical distortion of documents as often observed in receipt images, a polynomial fit is used. The tagging model performs IOB-tagging on the pseudo-1D-text. The model is based on BERT (Devlin et al., 2018) except that 2D coordinate embeddings are added to input vectors. The embeddings are prepared in the same way with WYVERN (Sec. 2). The output consists of IOB-tags of multiple fields. For inter-fields grouping (e.g. name, count, and price under item in Fig. 2a), additional IOB-tags are introduced. The tagged text is structured into raw parses by the tag2parse module. Finally, the raw parses are normalized using regular expressions and various domain-specific rules. For example, a unit price "@2,000" is converted into "2000"; Chinese numbers in postal codes are converted into English numerals etc.

A.2 Evaluation methods
F 1 score To calculate F 1 , first a group of key-value pairs from the ground truth (gt) is matched with a group from predicted parse based on their similarity in character level. Each predicted key-value pair is counted as true positive if there exists exactly equal gt key-value pair in the matched group. Otherwise it is counted as false positive. Unmatched key-value pairs in ground truth are counted as false negative.
Tree edit distance Although F 1 can show model performance for individual fields, the group matching algorithms requires non-trivial modification per domain due to structural change in parses. Hence, we use another metric nTED based on tree edit distance (TED) (Zhang and Shasha, 1989) 3 that can be used for any documents represented as trees.
nTED = TED(gt, pr)/TED(gt, φ) Here, gt, pr, and φ stands for ground truth, predicted, and empty trees respectively. The process is depicted in Fig. 2b). To account the permutation symmetry, the node in each level is sorted before the calculation using their labels and their children's. A similar score has been recently suggested by (Zhong et al., 2020) for a table recognition task.
Human Evaluation via A/B test While predefined metrics are useful for automated evaluation, their score cannot fully reflect the overall performance. Hence, we prepare accompanying human evaluation via A/B test. In the test, the randomly selected output of WYVERN and POT are presented to human subjects with corresponding document image. Then the human subjects are asked to choose one option out of three choices: A is better, B is better, or neutral. The results of two models are randomly shown as either A or B.
A.3 1-F 1 of individual fields    Annotation cost The annotation cost is quantified by the number of documents that can be labeled by a single annotator per hour in the name card IE task. The strong-label data requires about 5 times longer annotation time compared to the weak-label data (3rd column).
Communication cost The tag annotators should be trained by an expert (1) to understand the connection between tagged texts and corresponding parses and (2) to become accustomed to annotation process and using tag annotation tool. In our receipt annotation task, five tag annotators were trained by one expert for five working days. The expert needed to use one full working day. By counting 20 working days of single annotator as 1 Person Month (PM) and that of expert as 3 PM, the communication (teaching) cost is calculated as 1-2 PM 4 . On the other hand, the parse annotators just need to see the images and type human readable parses. The process similar to a summarizing documents on notes in which they are already familiar with. This minimizes the communication cost.
Annotation tool development cost The tag annotation tool should be prepared per ontology. In our own experience, the tool modification takes approximately two working days of one expert per domain as the format of parses are already capable of expressing arbitrary complex layout (for example JSON format can be utilized).
Maintenance cost The strong-label data can be modified only by the people trained in converting tagged text segments into parses. There is no such restriction in the weak-label data.
Inference cost We compare the inference cost of two IE systems by calculating the inference time. In name card IE task, POT requires 0.4 s per document on average whereas WYVERN requires 1.4 s. In receipt IE task, POT and WYVERN take 1.6 s and 2.3 s, respectively. In POT, the serializer costs the most of inference time (Fig. 1a). Although WYVERN takes slightly more time for the inference, the OCR module requires few seconds and the overall difference between two IE system is not significant. The time is measured on the computer equipped with Intel Xeon cpu (2.20 GHz) and P40 NVIDIA gpu with single batch.