Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents

Documents that consist of diverse templates and exhibit complex spatial structures pose a challenge for document entity classification. We propose KNN-Former, which incorporates a new kind of spatial bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities. We limit entities’ attention only to their local radius defined by the KNN graph. We also use combinatorial matching to address the one-to-one mapping property that exists in many documents, where one field has only one corresponding entity. Moreover, our method is highly parameter-efficient compared to existing approaches in terms of the number of trainable parameters. Despite this, experiments across various datasets show our method outperforms baselines in most entity types. Many real-world documents exhibit combinatorial properties which can be leveraged as inductive biases to improve extraction accuracy, but existing datasets do not cover these documents.To facilitate future research into these types of documents, we release a new ID document dataset that covers diverse templates and languages. We also release enhanced annotations for an existing dataset.


Introduction
Structured document information extraction (IE) attracts increasing research interest due to the surging demand for automatic document processing, with practical applications in receipt digitization, workflow automation, and identity verification etc.
Recent state-of-the-art methods for processing documents with complex layouts extensively exploit layout information, such as position, relative distance, and angle, with transformer-based models.Spatial modelling is a key contributing factor to the success of these methods ( Xu et al. 2020, Appalaraju et al. 2021, Xu et al. 2021, Hwang et al. 2021).However, absolute coordinates, pair-wise 1 https://github.com/miafei/knn-formerrelative Euclidean distance, and angle are insufficient to capture the spatial relationship in complex layouts.Two document entity pairs could carry different importance despite having the same position and distance, due to the presence or absence of other entities positioned between the pairs.We believe that spatial information can be better exploited for document entity classification.
We propose KNN-Former, a parameter-efficient transformer-based model that extracts information from structured documents with combinatorial properties.In addition to relative Euclidean distance and angle embeddings as inductive biases (Hwang et al., 2021), we introduce a new form of spatial inductive bias based on the K-Nearest Neighbour (KNN) graph which is constructed from the document entities and integrate it directly into the attention mechanism.Specifically, we first construct a KNN graph based on the relative Euclidean distance of document entities.Then we incorporate hop distance between entities, which is defined as the shortest path between two entities on the KNN graph, in training their pair-wise attention weight.For entity pairs with the same Euclidean distance but different hop distance, the difference in hop distance would still contribute to different attention weights.We limit an entity's attention calculation only to its local radius of neighborhood defined by the KNN graph.This also strengthens the inductive bias as reflected by our experiment results.
Furthermore, many real-world document information extraction tasks come with combinatorial properties, such as one-to-one mapping between field categories and values.Such combinatorial properties can be leveraged as inductive biases to improve the extraction accuracy, but are underexplored because existing datasets do not cover such documents.Current methods that do not address the combinatorial constraints suffer suboptimal performance on these types of documents.We further leverage this inductive bias by treating the entity classification task as a set prediction problem and using combinatorial matching to post-process model predictions (Kuhn, 1955;Carion et al., 2020;Stewart et al., 2016).
In addition, KNN-Former is parameter-efficient.Recent baseline models are initialized with parameters of pre-trained language models (Xu et al., 2020(Xu et al., , 2021;;Hwang et al., 2021;Hong et al., 2022), making their model size larger or at least comparable to the language models.KNN-Former does not utilize initialized parameters of existing language models, therefore free from the parameter size floor restriction.It is designed to be 100x smaller in trainable parameters compared to prevailing baselines.KNN-Former's parameter efficiency makes it energy-efficient, contributes to faster training, fine-tuning and inference speed and makes mobile deployment feasible.
To encourage the progress of IE research in complex structured documents with combinatorial mapping properties, we release an ID document dataset (named POI).While the existing ID document dataset has only 10 templates (Bulatov et al., 2021), POI exhibits better template and lingual diversity.It also has a special mapping constraint where one field category has only one corresponding entity.In compliance with privacy regulations, the documents in the POI dataset are specimens and do not contain information about real persons.
We conduct extensive experiments to evaluate the effectiveness of our proposed method.KNN-Former outperforms baselines on most field categories across various datasets, despite having a significantly smaller model size.Extensive ablation studies show the importance of the KNN-based inductive bias and combinatorial matching.
To summarize, our contributions include (1) a highly parameter-efficient transformer-based model that (2) incorporates KNN-based graph information in sparsified local attention; (3) combinatorial matching to address the one-to-one mapping constraint; (4) a new ID document dataset with good template diversity, complex layout, and a combinatorial mapping constraint.

Related Work
Researchers have tried multiple approaches for document information extraction (Jaume et al., 2019;Mathew et al., 2021;Stanisławek et al., 2021).However, these works do not have spatial cues, such as the position of the information in the origi-nal document.To address this shortcoming, a number of works introduce the modality of layout information as additional input features.Majumder et al. (2020) adopts positional information as inputs to their method to extract information from receipt documents.LayoutLM (Xu et al., 2020) adds 1-D and 2-D absolute position encodings to text embeddings before passing them to the transformer.Hong et al. (2021) proposes to train a language model from unlabeled documents with area masking, encoding relative positions of texts.StructuralLM (Li et al., 2021) assigns the bounding box cell position as the position coordinates for each word contained in it.DocFormer (Appalaraju et al., 2021) encodes 2D spatial coordinates of bounding boxes for visual and language features.LayoutLMv2 (Xu et al., 2021) uses learnable pair-wise relative positional embeddings as attention bias.
A few works propose to use graphs to represent spatial entity relationships in documents.SPADE (Hwang et al., 2021) uses a three steps graph decoder and formulates the information extraction task as a dependency parsing problem.FormNet (Lee et al., 2022) constructs a k-nearest neighbor graph and applies a 12-layer graph convolutional network (GCN) to get the entity embeddings before feeding them into a transformer network.However, there are some limitations in using GCN to obtain embeddings.It is well established that the message passing-based GCN are limited in their expressive power (Xu et al., 2018;Arvind et al., 2020;Morris et al., 2019;Chen et al., 2020;Loukas, 2019;Dehmamy et al., 2019).In addition, FormNet does not use the hop distance between nodes, which could serve as a strong inductive bias to capture the spatial relationships between document entities.
Datasets with positional information such as Funsd (Jaume et al., 2019), Cord (Park et al., 2019), Sroie (Huang et al., 2019) are released to facilitate research in document understanding.However, they do not contain documents with combinatorial properties which are common in real-world applications.MIDV500 (Arlazarov et al., 2018) and MIDV2020 (Bulatov et al., 2021) are two synthetic ID datasets with combinatorial properties, but are unsuitable for document information extraction tasks due to incomplete annotations.They also lack template diversity.

Methodology
In this section, we discuss the methodology for our model.We formulate the problem in Sec.3.1 and explain our overall model architecture and the details of each component in Sec.3.2.

Problem Formulation
Given a document D which consists of multiple entities {e i , . . ., e j }, and the bounding box coordinates and texts{x i , . . ., x j } detected by Optical Character Recognition (OCR) tool.We measure the relative distance and angle between two entities e i and e j as σ (i,j) based on the coordinates of bounding boxes.Our task is to map each entity e i in document D to its field category y i , which is one of the predefined labels.For each field category y i , there is only one corresponding entity e i .

Model Architecture
We propose KNN-Former, a transformer-based model for document entity classification.The architecture of KNN-Former is shown in Fig. 1.KNN-Former uses K-Nearest Neighbours Hop Attention, which incorporates a new inductive bias into attention computation.KNN-Former also treats document entity classification as a set prediction problem and uses combinatorial assignment to address the one-to-one correspondence between entities and fields.KNN-Former is highly parameter-efficient compared to baselines.Details of model size can be found in Tab 4.

K-Nearest Neighbors Hop Attention
One key contribution of KNN-Former is the proposed attention mechanism.Following (Lee et al., 2022), we first construct a KNN graph based on the Euclidean distance between each pair of entities.We represent entities as nodes and then connect edges between each entity and its K nearest neighboring entities.We also add a self-loop to each entity to improve performance (Kipf and Welling, 2016).While previous works focus on leveraging pair-wise relative Euclidean distance (Xu et al., 2021;Hwang et al., 2021), we propose to incorporate pair-wise relative hop distance, which is defined as the shortest path between two entities on the KNN graph.Two entities could be in proximity in terms of Euclidean distance but not so in terms of hop distance.For example, in documents with complex layouts, it is common to have two entities that are close to each other in the Euclidean space, but there is a third entity positioned in between.This type of entity pair should be treated differently from pairs that are close to each other in both Euclidean and hop distances.In this case, the spatial attention mechanism based solely on the relative Euclidean distances between entity pairs is insufficient since it neglects this structural information.We argue that the KNN graph structure is an effective way of capturing the structural information and propose to incorporate it as an inductive bias into the attention computation.
Intuitively, different hop distances should carry different weights in calculating pairwise attention.We use ϕ (i,j) to represent the hop distance between entity i and j and H to represent a learnable embedding lookup table based on the hop distance ϕ (i,j) .Inspired by DeBERTa (He et al., 2020) and Transformer-XL (Dai et al., 2019), we integrate the hop distance bias into attention as described in the following equations (1) where σ (i,j) is a concatenation of the relative Euclidean distance and angle between entity i and j, and R is a learnable matrix.H could be a learnable matrix or a lookup table that maps σ (i,j) to learnable embeddings.e ij is the attention weight between entity i and j. a ij is calculated as the weight of exp(e ij ) in the exponential sum of all e ik , as described in Eqn.3.
k exp(e ik ) . (3) Similar to how pair-wise relative Euclidean distance is added to attention, we add pair-wise hop distance as three learnable weight matrices, two of which multiply with query and key vectors respectively while the remaining one is added to the value vector.We further limit an entity's attention only to its local radius of neighborhood defined by the KNN graph.Specifically, we do not calculate e ij if the hop distance between entity i and j exceeds a certain threshold.This also strengthens the inductive bias as supported by our experiment results.

Combinatorial Matching
We hypothesize that combinatorial properties between field categories and entities can be leveraged as inductive biases to improve extraction performance.Different from existing methods that treat the classification of each entity independently (Xu et al., 2021;Hwang et al., 2021;Lee et al., 2022), we propose to treat the entity classification task as a set prediction problem to exploit the one-to-one mapping constraint, where one field has one and only one corresponding entity.The combinatorial assignment is described in Eqn.4.
where τ is an assignment, and L match is the matching cost.N is the number of entities in a document.
In practice, N is often much larger than the number of entities of interest.Therefore, we pad the number of ground truths to N in order to perform a one-to-one combinatorial assignment.This can be done with the Hungarian algorithm in polynomial time (Kuhn, 1955;Carion et al., 2020;Stewart et al., 2016).

Datasets
Many real-world documents exhibit combinatorial properties, such as a one-to-one mapping between between its fields and entities.However, existing public datasets do not cover documents with such properties (Jaume et al., 2019;Park et al., 2019;Huang et al., 2019).To fill the gap, we release a new ID document dataset POI, and enhanced annotations of MIDV2020.We also verify our method on a private dataset PRV.All 3 datasets exhibit combinatorial properties.
In addition, we design the POI dataset to be template-rich with diverse languages.We also design the enhanced MIDV2020 with a difficult split such that templates in testing are unseen during training.BERT alone without spatial information can achieve above 90% F1 on some existing datasets (Hong et al., 2022;Park et al., 2019;Huang et al., 2019), indicating relative sufficiency of leveraging text information alone.Yet in many real-world use cases, using text alone is insufficient.This motivates us to work on more challenging datasets where the exploitation of spatial information is important.Dataset statistics are summarized in 1 and Tab. 2.More details are as follows.are not subject to the constraint.In real-world applications, it is common to extract a set of entities from documents that have combinatorial properties between its field and entities.ID document information extraction is one such use case, where we only expect to extract one entity for each field category of interest.This one-to-one correspondence can be leveraged to improve classification performance.Despite being a common task setting, we notice the lack of method exploration and innovation in this direction, due to the unavailability of such property among existing popular document datasets.More details about the dataset can be found in the Appendix.

Experiments
In this section, we conduct extensive experiments to evaluate our proposed KNN-Former on aforementioned datasets.We first compare our results with several baselines in Sec.5.1.Then in Sec.5.2, we evaluate the generalization ability of our method on unseen templates.We then conduct ablation studies in Sec.5.3 and Sec.5.4 to assess the effects of each component in KNN-Former and the impact of different K in the KNN graph.

Comparison with Baselines on Multiple Datasets
We first evaluate the performance of KNN-Former against multiple competitive methods.We choose base models for most of the baselines, because these are closest to KNN-Former in terms of the number of parameters.Brief description of baseline models as well as the implementation details of all the models can be found in Sec.A.1.We do not have results for StruturalLM on POI dataset because of an OOV error.Tab.4 shows the entity-level classification performance.The results show that our method outperforms the baselines on most entity types across various datasets.In particular, KNN-Former outperforms LayoutLMv2 BASE , a state-of-the-art model that uses additional image features.We also observe that BERT performs poorly on these datasets, indicating the importance of exploiting spatial information.
Secondly, as shown in Trainable Param column in Tab.4,KNN-Former is highly parameter-efficient.All baselines except GCN have more than 100 million trainable parameters, while KNN-Former has only 0.5 million and is magnitudes smaller than competing methods.Even after adding the sentence transformer, KNN-Former has only 23.2 million parameters, still 5x smaller than baselines.The parameter efficiency has 4 benefits.First, it contributes to learning and inference time efficiency, with details illustrated in 5.5.Second, it allows for faster fine-tuning on new datasets and domains, especially in real-world use cases when training datasets are big and re-training requirements are frequent.Third, smaller model size and faster inference time make mobile deployment more feasible.Fourth, training, fine-tuning and inferring smaller models reduces power consumption and carbon footprint.Despite the smaller model size, KNN-Former achieves comparable or better performance across datasets.
Thirdly, we observe that KNN-Former underperforms both LayoutLM BASE and LayoutLMv2 BASE for name related entities in both POI and PRV datasets.The robustness of the two baselines in predicting names could be attributed to their extensive pre-training.The two baselines learn common names in pre-training, enabling them to predict names correctly regardless of context.However, despite no extensive pre-training, KNN-Former still outperforms BROS and StructuralLM which are also pre-trained on 11 million documents.
Fourthly, we observe all methods suffer performance degradation on MIDV2020, compared to the other two datasets.This is because in MIDV2020, training and testing documents are split by countries, templates in testing are not seen during training.In addition, MIDV2020 has only 6 templates in training data, which easily leads to overfitting.Detailed discussion on the generalization ability can be found in Sec.5.2.we find that BERT outperforms several baselines with spatial modelling on names, this may be due to overfitting to limited number of training templates.We notice that our method do not perform well on id number entity.We conducted manual inspection on several error cases, and find that in many documents there exist two different types of id numbers(see Fig. 3(b)), but only one of them is labeled as id number according to the provided annotations.Our model sometimes predicts the other one as id number.This also explains the poor performance on id number for some other baselines.
Lastly, we notice that on the PRV dataset, KNN-Former performs poorly on DoB field, underperforming even GCN.KNN-Former's performance on DoB drops after combinatorial matching, despite an overall increase macro average F1.This could be due to the presence of noise in groundtruth, since this dataset is annotated by automatic fuzzy labeling logic.Manual examination of a few documents confirms our hypothesis.

Evaluation of generalization ability on unseen templates
To assess the generalization capability of our model, we test and compare our model with other competitive baselines on MIDV2020 dataset using two train/test settings: random split and split by country .The country split is a more difficult set- ting as the templates in testing are unseen during training.Intuitively, we would expect a decline in performance as compared to the random split setting.Fig. 2 shows the Macro average F1 scores comparison of KNN-Former and multiple baselines under both the random split and the country split.
We observe across-the-board performance degradation for all methods after switching from random split to country split.However, the drop is least significant on KNN-Former, enabling it to achieve 10% higher F1 than the best baseline.These experiments indicate that our method is more robust and generalizes better to unseen templates as compared to existing baseline models.This is helpful in real-world applications where models frequently encounter new types of documents.To better understand how KNN-Former works, we ablatively study the effects of each component and report the results in Tab. 5. Entity-level detailed results can be found in the Appendix.

Effects of each component in KNN-Former
Firstly, we observe a 2.43% drop in performance with the removal of KNN hop attention and an even bigger 5.09% drop when local attention is removed together with KNN hop attention.This demonstrates that the KNN graph-based inductive bias is effective in capturing the structural information between document entities.It also shows that local attention, the practice of masking out attention weights when the hop distance between two entities exceeds a pre-defined threshold, further strengthens the inductive bias.
Secondly, we observe that the commonly used spatial inductive bias based on the pairwise relative Euclidean distance and angle also plays an important role.When both relative Euclidean distance attention and KNN hop attention are absent, there is a 4.09% drop in performance, an additional decrease of 1.66% compared to when only KNN hop attention is ablated(2.43%).The overlap of performance drop suggests some information are captured by both Euclidean distance and hop distance, as some pairs are similarly close/far from each other as measured in both distances.However, each distance also complements the other by capturing additional information.For example, two pairs could carry different importance despite having the same Euclidean distance, due to the presence or absence of other entities positioned between the pairs, signifying the importance of hop distance.
Thirdly, we notice that the F1 score drops drastically by 4.76% when combinatorial matching is ablated.This demonstrates the important contribution of combinatorial matching, as the datasets we experiment on are all subject to a special oneto-one mapping constraint between fields and entities.Combinatorial matching enables our method to treat entity classification as a set prediction problem, instead of predicting each entity's class independently, which enhances our model robustness.
Lastly, we observe that there is a 4.43% drop in performance when absolute positional encoding is added.Previous works (Hwang et al., 2021) have similar findings that adding absolute positional encoding is not helpful, especially when the test set contains a diverse set of unseen templates.In our experiments, adding absolute positional encoding improves performance in training but generalizes poorly in testing.

Impact of different K in the KNN graph
To further study the effect of how the hyperparameter of the KNN graph affects the performance, we conduct experiments with different values of K on the POI dataset.As shown in Tab.6, the

Runtime Comparison
In addition to performance evaluation, we also evaluate the runtime of our model against competitive baselines.For fair comparison, we report the total runtime of sentence transformer plus KNN-Former, since KNN-Former uses sentence transformer for text embeddings.In fact, the sentence transformer takes up half of the time in our pipeline.We first measure the runtime to process a single document for each method.As shown in Tab. 7, time taken for sentence encoder plus KNN-former is comparable to LayoutLM and BROS, and is faster than SPADE, LayoutLMv2.We run Stru-turalLM(written in tensorflow1.14)on CPU due to cuda version mismatch, hence there is no speed measurement.Moreover, our method allows for significantly larger batch sizes because of the smaller model size.Therefore, runtime for documents in batch is significantly faster than the baselines.Running with maximum possible batch size for each model using a 16GB V100 GPU, KNN-Former is significantly faster than the rest, as shown in Tab. 7.This experiment demonstrates that our model is advantageous when faster execution time is desirable, and this could be attributed to the lightweight property of our model.

Conclusion
We propose KNN-Former, a parameter-efficient transformer-based model for document entity classification.KNN-Former uses KNN Hop Attention, a new attention mechanism that leverages KNN graph-based inductive bias to capture structural information between document entities.KNN-Former utilizes combinatorial matching to perform set prediction.We also release POI, a templaterich ID document dataset subject to combinatorial constraints.Experiments show that KNN-Former outperforms baselines in entity classification across various datasets.

Limitations
We identify the following limitations in this work.First, the robust performance of baseline methods that leverage image features (Appalaraju et al., 2021) testifies to the importance of visual cues.The inclusion of image features to KNN-Former might contribute to better performance.Second, unlike models that perform extensive pre-training (Xu et al., 2020(Xu et al., , 2021)), KNN-Former might lack generic domain knowledge.Third, KNN-Former uses a vanilla sentence transformer to get the text embedding inputs.The sentence transformer model is pre-trained and not fine-tuned on the new datasets.An end-to-end training pipeline that jointly trains the text encoding model and KNN-Former could lead to better results.Fourth, there are many design choices we did not explore, such as applying attention directly at the token level and pooling representations at the end.Lastly, KNN-Former, along with all baselines used in this work, are subject to OCR failure.All models consume OCR outputs such as bounding box coordinates and texts.In the case of OCR failure, where one bounding box is detected as two or two boxes are merged as one, models that consume OCR results are less likely to make correct predictions.

Ethics Statement
This work has obtained clearance from author's institutional review board.The annotators for POI and MIDV2020 are all paid full-time interns and researchers hired by our institute, whose compensation are determined based on the the salary guidelines of our institute.Among the datasets and annotations released, POI only contains specimens with dummy values, while MIDV is a synthetic dataset.External data are accessed and used in compliance with fair use clauses.We conduct experiments on the private dataset PRV in a secure data zone with strict access control, using auto-labeling scripts for annotations.

A Appendix
A.1 Implementation details We briefly describe the baseline models as well as detailed implemetation details of all models in this section.
• BERT BASE (Devlin et al., 2019): We use the pre-trained BERT base model for token classification.
• GCN (Kipf and Welling, 2016): We use sentence transformer (Reimers and Gurevych, 2019) to get the embeddings of text inputs and use them as the node features for the constructed KNN graph.Then we train a 2-layer graph convolutional network to classify the nodes/entities.• LayoutLMv2 BASE (Xu et al., 2021): In addition to LayoutLM, the LayoutLMv2 adds a new multi-modal task during pre-training to take in the visual cues and incorporates a novel spatial-aware self-attention mechanism.
• StructuralLM LARGE (Li et al., 2021): On top of LayoutLM, Structural LM uses cell position for each word, and introduces a new pre-training task that predicts the cell position.
It is also pre-trained on the IIT-CDIP dataset.
• SPADE (Hwang et al., 2021): SPADE builds a directed graph of document entities and extracts and parses the spatial dependency using both linguistic and spatial information.
• BROS (Hong et al., 2022): Similar to Lay-outLM, BROS is also pre-trained on the IIT-CDIP dataset, but with a different area masking pre-training task, and a different method to encode the 2D positions of bounding boxes.
• DocFormer (Appalaraju et al., 2021): Doc-Former is a multi-modal transformer that takes in both text and visual cues.It proposes a multi-modal attention mechanism and is pretrained with several tasks involving both text and image input.
All models are trained on 16G V100 GPUs and implemented with Pytorch, except for StructuralLM LARGE , for which we use their official repository2 that is implemented in Tensor-flow1.14and we train it on cpu because of cuda version mismatch.We use APIs open-sourced by Huggingface3 for Bert, LayoutLM BASE and LayoutLMv2 BASE .SPADE is implemented using the official implementation released by ClovaAI4 .BROS is implemented using their released official repository 5 .Only text inputs are passed to BERT BASE for classification while bounding box coordinates are neglected.Results are obtained after training for 100 epochs.We trained the SPADE

Figure 1 :
Figure 1: An illustration of KNN-Former.Bounding box texts are embedded using sentence transformer, which are concatenated with embeddings of bounding box size to form input embeddings.The concatenated embeddings are then passed to the transformer layers with KNN Hop Attention, which incorporates pair-wise relative hop distance between entities on KNN graph in attention calculation.The output entity representations of the transformer layers are passed to combinatorial matching for set prediction.

Figure 2 :
Figure 2: Macro average F1 scores of KNN-Former and various baseline models under random split and country split on MIDV2020 dataset.

Table 1 :
Number of documents in training and testing.

Table 2 :
Statistics of entity distribution in documents.Ent.stands for entities and Doc.stands for documents.There are 8 field categories in total: last name, first name, date of birth, date of issue, date of expiry, ID number, key, and others.Key represents entities that indicate the field names for the important entities (e.g.Last Name) that we are interested to extract.The first 6 field categories appear in each document image once and only once, creating a special mapping constraint unseen in other datasets.The last 2 field categories (key and others)

Table 3 :
(Bulatov et al., 2021)nt types in POI dataset MIDV2020 We utilize the 1000 synthesized ID documents from the initial MIDV2020 dataset(Bulatov et al., 2021).These documents are generated from 10 templates, with 100 documents for each template.Each document image is annotated with a list of bounding box coordinates and field values.We find that only artificially generated entities, such as the values of names and ID numbers, are annotated, while entities that belong to the original templates, such as document title and field names are not.We proceed to annotate the remaining entities.The newly annotated ground truths of MIDV2020 will be released alongside POI.These enhanced annotations enable us to perform information extraction task in a setting that is closer to real-world application, where all texts recognized by the OCR engine are used.The train/test split we introduce for MIDV2020 is a split by countries, this ensures that the document templates in the training dataset are unseen in the testing dataset.The country split simulates real-world scenarios where the model extension to new countries or new versions of documents is needed.More details can be found in the Appendix.

Table 4 :
Entity-level F1 score of KNN-Former compared to baselines.Column L.Name, F.Name, DoB, DoI, DoE and ID No. correspond to results of Last Name, First Name, Date of Birth, Date of Issue, Date of Expiry, and ID Numbers.GCN and KNN-Former have additional 22.7 M fixed parameters since we employed a light-weighted 6-layer sentence transformer (Reimers and Gurevych, 2019) to get the text embeddings.

Table 5 :
Ablation results on POI dataset.(-) indicates the component is absent compared to KNN-Former, (+) indicates the component is additional.

Table 7 :
Runtime comparison with baselines.Time taken is reported in milliseconds.
Nils Reimers and Iryna Gurevych.2019.Sentence-bert: Sentence embeddings using siamese bert-networks.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics.Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou.2021.LayoutLMv2: Multi-modal pre-training for visually-rich document understanding.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579-2591, Online.Association for Computational Linguistics.Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou.2020.Layoutlm: Pre-training of text and layout for document image understanding.In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining, KDD '20, page 1192-1200.Association for Computing Machinery.

•
LayoutLM BASE (Xu et al., 2020): LayoutLM is a transformer-based model for document image understanding.It is pre-trained on IIT-CDIP Test Collection with 11 million scanned images.